Disaster recovery

When something goes wrong — recovery

Stolen laptop. Locked Anthropic account. Corrupted ~/.claude/. Dead droplet. Blown MCP auth. An overzealous edit that wiped a memory file. Each of these has a procedure; this page is the catalogue.

Read once now. Practice the lock-down checklist next time you're bored on a flight. The middle of an actual incident is the wrong moment to learn what your backup strategy was.

On this page

The disasters
Backup strategy
Stolen / lost laptop
Locked Anthropic account
Corrupted ~/.claude/
Lost API keys
Blown MCP auth
Dead droplet
Dead NAS
Memory restore
Audit & early warning
The fire drill
Pitfalls
Related

01 — The disasters

What can actually break

Disaster	Likelihood	Blast radius	Practiced response time
Laptop stolen / lost	Once a decade	SSH key + auth.json compromised; need to rotate	15 min
Account locked / suspended	Rare	Claude.ai access; managed agents; mobile	Days (Anthropic timeline)
Corrupted `~/.claude/`	A few times a year	Settings/skills broken on one machine	5 min
Lost API key	Annual	Local CLI on servers; CI	10 min
MCP auth expired	Monthly-ish	One MCP tool stops working	2 min
Dead droplet	Once per provider lifetime	All public sites; remote claude	1–2 hours rebuild
Dead NAS / failed RAID	Every 5–8 years	Backup chain; queue files	Hours to days
Memory file overwritten	Rare but real	Lost a memory; subtle behaviour drift	2 min if versioned
Sync conflict on settings.json	Common	Hook silently stops firing	10 min

The point isn't that all of these will happen. The point is that none of them feel survivable when they happen and you don't have a plan. The procedures below are the plans.

02 — Strategy

The six-leg push, applied to `~/.claude/`

WholeTech already runs a 6-leg backup pattern for the websites tree (droplet, B2, GitHub, Drive, HS NAS, CC NAS). Apply the same pattern to your Claude config and you get disaster recovery basically for free.

Leg	What lives there	How fresh
Primary PC `~/.claude/`	Live, edited daily	Now
GitHub (private repo)	Whatever's been pushed	Last commit (target: same day)
Secondary PCs	Last `git pull`	Last nightly pull (within 24h)
B2 bucket	Compressed snapshot	Nightly
Google Drive	Rclone mirror	Nightly
HS NAS + CC NAS	Local mirror	Nightly

The hierarchy is:

Last 24 hours → secondary PCs (fastest, online).
Last week → GitHub repo history (versioned, diffable).
Last month → B2 daily snapshots.
Older → Drive + NAS cold storage.

The "if you only do one thing": commit ~/.claude/ to a private GitHub repo. That single leg covers 95% of recovery scenarios — corrupted config, lost machine, mistake-rolling-back. The other five legs are insurance for the bad 5%.

03 — Stolen / lost

Laptop is gone

Treat as compromised. The threat isn't "they might find the files" — full-disk encryption handles that. The threat is "they might be holding a logged-in machine right now." Lock down accounts the laptop had access to, in order of value.

Revoke the Anthropic session. claude.ai → Settings → Sessions → revoke the missing device. Forces re-login.
Rotate the SSH key. On the droplet: edit ~/.ssh/authorized_keys, delete the laptop's public key. From any other machine in the fleet that still has access. Verify: ssh -i missing-key root@wholetech.com should now fail.
Rotate the Anthropic API key if one was on the laptop. console.anthropic.com → API Keys → revoke the named key, issue a new one. Update on every other machine that needed it.
Rotate the GoDaddy API key if godaddy.env was on the laptop. developer.godaddy.com → revoke + reissue. Update ~/.secrets/godaddy.env on the remaining PCs.
Rotate the GitHub PAT if one was on the laptop. github.com/settings/tokens → revoke. Reissue, redeploy.
Rotate the rclone remotes if rclone.conf was on the laptop. Each remote (B2, Google, etc.) has its own console; rotate the keys there.
Find My (Apple) or Find My Device (Google) — lock and erase remotely if it's a Mac or Chromebook.
Replace the laptop. From a fresh machine: re-run the bootstrap procedure (clone the dotfiles repo, run claude login, replace the per-PC secrets from your secrets vault).

Time matters. Step 1 (revoke session) and step 2 (rotate SSH key) in the first 15 minutes; the rest within a few hours. Don't wait until tomorrow.

Per-machine SSH keys make this much easier

If every machine has its own SSH key (laptop has laptop-ed25519, desk PC has desk-ed25519, etc.), rotation is one line in ~/.ssh/authorized_keys on the droplet. If they all share one key, you're rotating one key on every machine. Always per-machine.

04 — Locked out

Anthropic account is locked

Rare. Happens if Anthropic flags unusual activity, payment failure, or a violation. The frustrating part: there's no instant fix; you're on Anthropic's support timeline. But you can keep working in the meantime.

Contact Anthropic support immediately. support.anthropic.com — describe what you were doing right before the lock. Honest, factual.
Don't try to log in repeatedly — that often extends the lock.
Switch local Claudes to a different account temporarily. If you have a team/business workspace, use that. If not, a colleague or family account can keep you running for non-sensitive work.
Pay-per-use via the Anthropic API. A different Anthropic API account (or your team account's API key) keeps Claude Code working — codex login --api-key or ANTHROPIC_API_KEY=.... The CLI keeps working even if your ChatGPT-style web account is locked.
Mobile and web are dead. Until the lock lifts. Plan around that.
Document for review. What you were doing, when, on which IP, with which key. Send the bundle to support; speeds up resolution.

Prevention: keep two paths to Claude — one via your primary ChatGPT-style account, one via an API key on a separate (or team) Anthropic account. Either alone can fail; both at once is rare.

05 — Corrupted config

~/.claude/ is broken

A bad JSON edit; a Dropbox conflict; a permissions glitch; an OS upgrade that scrambled file ownership. The most common shape: Claude starts and immediately errors on parsing settings.json, or hooks silently stop firing.

Diagnosis first

Try claude --version. If this works, the CLI binary is fine. If not, reinstall.
Try cat ~/.claude/settings.json | jq .. If jq errors, your settings.json is invalid JSON. That's by far the most common cause.
Try claude --debug. The verbose logs usually point at the broken file.
Look at the auto-backup directory. Many Claude Code versions write ~/.claude/backups/settings.json.<timestamp> before saving. Roll back to the most recent.

If you have the git repo

$ cd ~/.claude
$ git status                      # see what changed
$ git checkout HEAD settings.json   # revert the file
$ git log -p settings.json         # see history; pick a commit to roll to

If you don't

# on another machine in the fleet, copy the file across
$ scp other-pc:.claude/settings.json ~/.claude/settings.json

# or pull from your B2 snapshot
$ rclone copy b2:walhus-backups/claude-config/latest/settings.json ~/.claude/

Nuclear option

Move the whole ~/.claude/ aside (don't delete — you might need to copy memory back later), re-run claude login to create a fresh one, then layer your dotfiles back on top.

$ mv ~/.claude ~/.claude.bak
$ claude login                       # creates a fresh ~/.claude/
$ rm -rf ~/.claude
$ git clone <repo> ~/.claude
$ cp ~/.claude.bak/auth.json ~/.claude/   # or just re-login

06 — Lost keys

Lost or leaked API keys

Revoke the key. console.anthropic.com → API Keys → revoke the named key. Immediately. Don't wait to figure out if it actually leaked.
Issue a replacement, with the same scope. Name it so you'll know what it is in six months: laptop-2026-05, droplet-cron, ci-builds.
Update everywhere it was used. Each machine's ANTHROPIC_API_KEY env var, each CI job's secret, the droplet's /etc/environment, any rclone-driven config sync.
Audit recent usage. The Anthropic console shows usage by key. Look for activity that wasn't yours; that tells you whether the key actually was used by an attacker.
Set a spend cap on the new key if you didn't have one. Limits the blast radius of the next leak.

Naming convention worth adopting: <host>-<purpose>-<year-month>. droplet-cron-2026-05. laptop-cli-2026-05. Makes audit and rotation trivial.

07 — MCP auth

An MCP server stopped working

Almost always auth-related. Google OAuth tokens refresh quietly until they don't; GitHub PATs expire; database passwords change. Symptoms: Claude says it can't reach that MCP tool, or the tool list is shorter than usual.

Identify which server. Run /mcp inside Claude — the status of each is shown. Red/yellow = auth issue.
Re-authenticate. For OAuth-based ones (Google, Slack), claude mcp auth <server-name> opens the auth flow. For token-based ones (GitHub, Postgres), update the env var or the secret file.
Restart the Claude session. The MCP server is a subprocess; it reads the env at startup. Quit and reopen.
Check ~/.claude/mcp-needs-auth-cache.json — Claude tracks which servers are pending auth here. If it's stale, delete it; it regenerates.

The renewal cadence to remember

MCP server type	Token lifetime	Renewal
Google OAuth (Drive, Gmail, Calendar)	Refresh tokens last months; access tokens auto-refresh	Mostly silent; re-auth every 6-12 months
Slack OAuth	Long-lived	Rarely needs intervention
GitHub PAT (classic)	You set it; up to 1 year	Calendar this; rotate on a schedule
GitHub PAT (fine-grained)	Up to 1 year, default 90 days	Renew via console; update env vars
Postgres / DB	Until someone rotates the password	Update `DATABASE_URL` env var

08 — Dead droplet

The droplet is down

DigitalOcean (or whoever) is having a bad day; the droplet was rebuilt accidentally; the disk filled and nginx OOM'd. Distinguish "down" (will come back) from "lost" (need to rebuild).

If "down"

Check provider status page. If known incident, wait.
If droplet is up but services are down, SSH in. Look at journalctl -u nginx -n 100.
Most common: disk full. df -h → journalctl --vacuum-size=200M to free /var/log; trim old /var/www/*/sessions/ dirs.

If "lost" — rebuild from scratch

Spin up a fresh droplet. Same size, same OS, same hostname (for DNS continuity).
Update DNS A records if the new IP differs. GoDaddy API per the runbook.
Restore the websites tree. From your most recent backup leg (CC NAS or B2). rclone copy b2:walhus-backups/var-www/ /var/www/.
Reinstall nginx and certbot. apt install nginx certbot python3-certbot-nginx. Restore vhosts from the backup of /etc/nginx/sites-available/.
Re-issue certs. certbot --nginx -d <domain> for each — DNS already points at the new IP, so the HTTP-01 challenge works.
Reinstall Claude Code. npm i -g @anthropic-ai/claude-code. Set ANTHROPIC_API_KEY in /etc/environment.
Restore the cron jobs. From the backup of /etc/cron.d/ or the user crontab dump.
Verify. curl -sI https://<each-site> → all 200. ssh root@wholetech.com 'claude --version'.

Rebuild time: with the runbook handy and current backups, count on 2-3 hours for a full rebuild. Without them, count on a weekend.

09 — Dead NAS

The NAS is dead

Drive failure; controller failure; power surge. Less acute than a dead droplet (no public services lose access) but slower to fully recover (drives to source, RAID to rebuild, hours to days of resync).

Identify scope. Single drive failure → RAID handles it; rebuild after replacement. Multiple drives or controller → restore from other legs.
Pause writes. Stop scheduled jobs that mirror to this NAS. They'll queue locally; resume when the NAS is back.
Restore from B2. Your overnight cloud backup is the next-best source. rclone copy b2:walhus-backups/ /mnt/nas/restore/.
Restore from the other NAS. If HS NAS is dead, CC NAS still has the mirror (and vice versa). Two-NAS topology is the case where this kind of failure stays survivable.
Replace, rebuild, resume. Source drives, rebuild the array, re-establish the mirror jobs, verify checksums.

The Claude angle: if your task queue lives on this NAS, queue work pauses until restore. Workers should keep heartbeating; new tasks queue at the orchestrator until the queue layer is back.

10 — Memory

Restoring an auto-memory file

A particularly insidious failure mode: a memory file gets overwritten or deleted, and you don't notice for a week — by which time Claude has been operating without that context. By then the symptom is "Claude keeps making the mistake I corrected last month."

If memory is in git (recommended)

cd ~/.claude
git log -p projects/<slug>/memory/ — see what changed, when.
git checkout <good-commit> -- projects/<slug>/memory/<file>.md
Verify MEMORY.md still references the restored file. If you'd already removed the index entry, restore that too.

If memory is in OneDrive / Dropbox

Both services keep file version history. OneDrive: right-click the file in Explorer → Version history. Dropbox: right-click → Version history (or web UI → the file → Version history). Restore the version from before the mistake.

If memory is gone everywhere

This is why I keep saying "version your memory." If you've lost a memory and have no version history, reconstruct from:

Your conversation history: scroll back in claude.ai or in CLI ~/.claude/projects/<slug>/history.jsonl — find the session where the memory was first written; Claude probably summarised it back to you at the time.
The PR or commit where it was first introduced: often memories about a codebase have a corresponding commit you can read.
Your physical notes, if you have any.

11 — Audit

Logging for early warning

The cheapest disaster is the one you catch a minute after it happens, not a week. Three lightweight monitors that pay back hundreds of times their setup cost:

1. Heartbeat per machine

Every machine writes $(hostname) $(date) to a shared file (NAS or droplet) every 10 minutes. A separate process checks the file every hour; any host whose heartbeat is >30 min old fires an alert. Catches "the scheduled task on the laptop quit working" within an hour.

# cron entry, every 10 minutes
*/10 * * * * echo "$(hostname)  $(date -Iseconds)  alive" >> /mnt/nas/fleet-heartbeats.log

2. settings.json validity check

Once a day, on each machine, validate that ~/.claude/settings.json parses. If it doesn't, ping the notification channel.

*/60 * * * * jq . ~/.claude/settings.json > /dev/null 2>&1 || curl -d "settings.json broken on $(hostname)" ntfy.sh/walhus-claude

3. Disk space

The droplet running out of disk has been the cause of more weekend incidents than anything else. df -h at 09:00 daily; warn if /var is >80%.

12 — Practice

The fire drill

Once a quarter, practise. Pick a scenario, set a timer, run the procedure. You'll find out which steps are stale, which secrets you can't actually locate, which backups didn't actually back up.

Three drills worth running

15 minutes

"Laptop stolen"

Revoke session, rotate SSH key, rotate API key. Time yourself end-to-end. If you can't do it in 15 minutes from memory, your runbook is incomplete.

2 hours

"Restore the dotfiles"

On a spare VM: wipe ~/.claude/, restore from scratch using only the backups. Identify which files needed manual recovery (auth.json, secrets) and document the gap.

half a day

"Rebuild the droplet"

Spin up a new droplet, restore the websites tree from B2, re-issue certs, verify all sites. If a domain doesn't come back, you know about a hole in your backup before you really needed it.

After every drill: update the runbook with whatever you learned. The drill that doesn't change the runbook didn't teach you anything — and the drill that surfaced ten changes paid for itself ten times over.

13 — Pitfalls

What goes wrong in recovery

stale

The backup that didn't run

The scheduled task disabled itself silently after a permission change. Your "nightly" backup is six weeks stale. Heartbeat your backup jobs — not just their existence, their last-success time.

missing

The credential you forgot to back up

You restored everything except the godaddy.env file because it was correctly excluded from git. Now you can't change DNS. Maintain a per-PC secrets manifest (1Password, Bitwarden) separate from the git-tracked config.

version

Restoring a config that's too old

Your settings.json from six months ago references MCP servers and skills that have since been renamed. Restore-and-fix is sometimes faster than restore-and-pray.

no-creds

You can't authenticate to your own backups

The B2 access key was on the laptop that died. Always keep at least two paths to your backup storage — primary access key on machine A, recovery key in 1Password.

trust

The "I'll fix it later" trap

The MCP auth has been broken for three weeks; you've been working around it. By the time you fix it, you've forgotten what the working state looked like. Fix soon; document while it's fresh.

scope

Restoring more than you need

Don't restore the entire ~/.claude/ when only one file is corrupt. Surgical restores are faster and don't risk reverting unrelated good changes.

When something goes wrong — recovery

What can actually break

The six-leg push, applied to ~/.claude/

Laptop is gone

Per-machine SSH keys make this much easier

Anthropic account is locked

~/.claude/ is broken

Diagnosis first

If you have the git repo

If you don't

Nuclear option

Lost or leaked API keys

An MCP server stopped working

The renewal cadence to remember

The droplet is down

If "down"

If "lost" — rebuild from scratch

The NAS is dead

Restoring an auto-memory file

If memory is in git (recommended)

If memory is in OneDrive / Dropbox

If memory is gone everywhere

Logging for early warning

1. Heartbeat per machine

2. settings.json validity check

3. Disk space

The fire drill

Three drills worth running

"Laptop stolen"

"Restore the dotfiles"

"Rebuild the droplet"

What goes wrong in recovery

The backup that didn't run

The credential you forgot to back up

Restoring a config that's too old

You can't authenticate to your own backups

The "I'll fix it later" trap

Restoring more than you need

The rest of the hub

The six-leg push, applied to `~/.claude/`