atproto pds migration IN ANGER

%at=2025-07-24T23:26:49.639Z

#author_luna #atproto #sysadmin-notes #blogpost

an edit from a neon genesis evangelion frame showing 3 large crosses, except i've edited it to have @dholms.xyz‬'s diagram of how a PDS migration works, where each actor on the migration is a cross: "Old PDS", "Client", and "New PDS"

the context #

it's not obvious to everyone but I'm on bluesky (same handle as the website!), I've been going through it and exploring its design since around feb/march iirc with some projects:

this does also mean a subset of my friends also are on bluesky, and a subset of those are inside (thunder sfx) The United Kingdom (thunder sfx). Laurens Hof puts the "current happening" related to UK users pretty well on his Bluesky Report:

Bluesky has announced it is rolling out an age verification system in order to comply with the UK’s Online Safety Act. Users in the UK will be asked to verify their age, using a variety of options. Bluesky uses Epic’s Kids Web Service for age verification, which allows users to verify via an ID scan, payment card verification or face scan. If users in the UK choose not to verify their age they can still use the Bluesky app, and only adult content as well as the DMs will be made inaccessible. Bluesky is implementing the system as a compliance with the Online Safety Act, which requires all platforms that contain adult content and can potentially be accessed by children in the UK to implement a “highly effective age assurance” system. This part of the law goes in effect on July 25th, and non-compliance risks a fine of £18 million. Bluesky PBC is implementing this age verification system in their own apps only, and other Bluesky clients have their own responsibility to implement such an age verification system. Other clients have not yet announced to be implementing an age verification system, meaning that users in the UK who do not want to share their information can sidestep this barrier by using another client to access the network.

I'm not from the UK, but I take a pretty strong stance against sharing my personal identity data in that way, and some friends of mine also share the same vision. thankfully, the way Bluesky has complied with this is through a request to the user's PDS. while bluesky's official PDSes implement the relevant identity check, users outside of the bluesky PDS can "MITM" the request and just return that the identity has been verified.

quick atproto primer #

there are various ways of explaining how the AT protocol works, I'll have to make a shorther one, but I recommend ATProto for distributed systems engineers.

okay, so, in the AT protocol there are 4 main "roles" in the network. these roles are taken by specific servers:

identity is done in atproto via the DID scheme, most users are on the DID PLC scheme. at a VERY high level, the did plc service is a public centralized blockchain. it's not truly a blockchain in the cryptocurrency scheme of things because the chains are per-user, not global. it's centralized for efficiency purposes but auditable since all events in the plc are signed with a public key. you can look at my own plc data on https://plc.directory/did:plc:ghmhveudel6es5chzycsi2hi, you can see that the identity document links to a "pds", which is described further

data storage in atproto is done via the PDS (Personal Data Server). it holds all the records that you make (which are authenticated via the plc key, also called as repository or just repo) as well as any other user inside the PDS, this lets the PDS have extremely low operational overhead per user (say, a user has a direct cost of only ~20MB storage total inside the PDS, though if someone really likes making memes they may be in the hundreds or a gigabyte) and as such there are thousands of them on the internet, you can even see mine! right now it's pds.bsky.ln4.net.

data aggregation in atproto is desirable because from the thousands of PDSes you need some way to aggregate them all into a computationally efficient entity in the network that you can just plug yourself in and get the... firehose of data (pun intended). in the network that's also called a "relay", and its whole job is scraping new PDSes via either the PDS requesting itself to be scraped by contacting the relay (it's an env var you set on the PDS) or via accounts referencing each other (through likes, posts, etc), and connecting to it to get a realtime data stream through the relay. with this component you would be able to keep a full network clone of all public data on the network (but that's not required for the relay anymore! see sync 1.1), and my own vibe-check estimates would put it in <10TB range as of 2025-07.

user-facing frontend in atproto is required because you need some way to actually interact with the network. atproto is a neutral protocol, bluesky is an application developed on top of the protocol. the identity, pds, and relay do not care what kind of application you're developing, only that there's some authenticated data that has to be shipped in real time to somewhere so it's synced across everyone else that does care. currently this component is called the "appview" and it is still kind of nebulous what it does. more information and discussion about the role of the appview here.

how does this relate to the UK #

to repeat the intro:

thankfully, the way Bluesky has complied with this is through a request to the user's PDS. while bluesky's official PDSes implement the relevant identity check, users outside of the bluesky PDS can "MITM" the request and just return that the identity has been verified.

the way appviews are supposed to work is that the frontend component makes requests to the user's PDS, which then may proxy the request to another service to fetch more data, this is done because its your PDS that holds authentication on you (user x password), not the appview! appviews and other components actually validate tokens made by your PDS (with your identity keys from the PLC).

the way this connects together is that when the bluesky frontend detects the user is from the UK (via their own geoip service https://bsky.app/ipcc, which is also used to decide which country-level moderators to enable for the account), the frontend will make a request to <pds>/xrpc/app.bsky.unspecced.getAgeAssuranceState.

you can bypass the check in other ways since the check is done inside the bluesky's own frontend, @mary.my.id made a list. the PDS way is the "most stable" one since it would require no change to any clients, as well as no patches to do on the mobile app. the article also develops on the legality, and why the way bluesky has implemented this is "thankful". they could've done worse, which would heavily challenge the decentralization of the network.

well, let's migrate PDSes then! #

if the bluesky PDS is requiring a government ID check, and we can bypass it by just saying it's fine on a custom PDS, then we should be able to migrate PDSes, right? given that atproto decouples your identity from your data storage that is very much possible! the way the process works is:

but for a non-technical user that is not very friendly. you can carry out those operations automatically (via an all-in-one "migrate" command) or manually with the goat cli, more info here, but a specific tool caught my eye which is the ATP Airport, it was made by @knotbin.com to make the that process more automated/user-friendly to someone that doesn't know what a CAR file is, so I've kept it in the back of my mind for a while for the rare occourence that I would need to migrate PDSes. it would be rare for me because I created my account from my own PDS, haven't migrated. but maybe. just in case...

and then the age check rolled out on bluesky for all UK users on yesterday, july 24th.

the amount of people in my circles that run their own PDS is extremely low, so I was the obvious point of contact to be the target/"new PDS" in a migration process, I haven't done it myself, so I decided to point one of them to the ATP Airport. since we were unsure about the entire process, the plan was:

  1. create a new account on my PDS (should be simple enough, I made some for myself and nagare, but I needed to test the check bypass and I'm very much not VPN'ing myself to inside the UK, lmao)
  2. migrate a throwaway account (to be created in bluesky's PDS) with the Airport
  3. migrate an alt account with the Airport
  4. migrate a main account with the Airport

step 1 was smooth, sadly we forgot step 2 for the first friend, step 3 was smooth, but step 4 failed with a weird error. then we proceeded to investigate the state of the alt account, and found out all sorts of inconsistencies, missing records, and signature validations.

that's when I started writing an immediate issue to get more eyes on it ASAP: migration failure caused account breakdown, as well as going on bluesky to get more eyes on it. I'm not an atproto expert and needed as much help as I could get on both understanding what happened and recovering the account. that same friend suggested to give up on the alt and just create a new alt, I declined that on both a "this should never happen, it is useful for everyone to understand this happened" and a "you are now MY user, I HAVE to fix this 🧎‍♀️" perspectives.

to make it clear: ATP Airport shouldn't have caused this and @knotbin.com's approach to this is what should be done. migrations on the Airport have been disabled as soon as they found out this happened and will be coming back once the bug has been fixed.

after the relevant issue was made we started to investigate. we didn't understand how the migration process worked, I'm giving you the rundown of it so it can be used as vocabulary. after some evidence gathering we understood what happened: the Airport uploaded the CAR file, containing all user records, to the wrong account. this caused all records on the previous account to just be gone. and that threw me up for a spin on how do we recover from this with minimal disruption.

recovering from a CAR accident IN ANGER #

as soon as we noticed this we ran to get the CAR file that was still available on the bluesky PDS, the one that wasn't overwritten by the Airport, and we were able to get both a CAR file containing all records from both the alt and main accounts. then the bigger question happened of how do we do this? @cinny.bun.how chimed in to help with a pointer and @notnite.com helped me understand it:

i think u need to rm the from the pds db altogether which is pretty involved db surgery (the pds impl can only really deactivate and tombstone accounts afaik) and then re-migrate (you can grab known-good repos from the source pds right? and then importRepo again via atpairport i guess)

so u want to delete rows from tables referencing the did from repo_root, account, actor, and the dirs ./data/actors/*/<did>, ./data/blocks/<did>

would definitely take a full backup of the pds data dir beforehand in case something goes wrong

that made me inspect how the PDS is structured and how we could do this with minimal disruption. the main thing here is that I DO NOT want to delete the account through normal APIs, that's going to trigger a delete at the identity/plc level, which would stream that identity deletion through the entire network, possibly bricking the account. there may be ways of recovering from this, but I did not want to get near them

what bun's suggesting here is to trick my PDS into thinking the account never really existed in the first place, even though the PLC has the identity as hosted by my PDS. this would then let us re-create the account and re-import the data, fixing the alt.

but that's the thing I started to realize, if only the repository that was completely overwritten via the wrong CAR file, if we can re-import without causing disruption on how the PDS behaves with the outside world it would be perfect. no account deletion OR running Airport again required. so I started digging. IMPORTANT NOTE: I was running ghcr.io/bluesky-social/pds:0.4.136 so it's possible your PDS data structure changes over time

the architecture of the official PDS implementation by bluesky heavily relies on sqlite (good!) and keeps per-account sqlite databases for scalability reasons (good!), it's structured a bit like this (i keep my blobs inside the data dir, but you can put it separately):

/o/pds# ll data
total 9.0M
-rw-r--r--. 1 root root 216K Jul 24 00:39 account.sqlite
drwxr-xr-x. 9 root root 4.0K Jul 24 19:53 actors/
drwxr-xr-x. 9 root root 4.0K Jul 24 20:02 blobs/
-rw-r--r--. 1 root root  36K Jul 24 00:36 did_cache.sqlite
-rw-r--r--. 1 root root 3.9M Jul 24 00:36 sequencer.sqlite

the account.sqlite file contains important key tables as mentioned by @cinny.bun.how:

sqlite> .tables
account                device                 kysely_migration_lock
account_device         device_account         refresh_token
actor                  email_token            repo_root
app_password           invite_code            token
authorization_request  invite_code_use        used_refresh_token
authorized_client      kysely_migration

one interesting thing from these tables is that most of these are not actually related to the actual atproto record data. most of these are either high level operations that are only relevant to the PDS (like authentication, oauth, accounts, etc), or repo_root which IS related to the record data (in this case, the repository)! its structure is very simple:

sqlite> .schema repo_root
CREATE TABLE IF NOT EXISTS "repo_root" ("did" varchar primary key, "cid" varchar not null, "rev" varchar not null, "indexedAt" varchar not null);

and so since its just per-user, its rows are also simple:

did:plc:ghmhveudel6es5chzycsi2hi|bafyreic3r44uhvlho6rv4fs4x5p6wf3kvtu2wm56aszfarjkjx5fpciqqa|3luqlbtdr422y|2024-10-29T01:18:53.243Z

if I want to trick my PDS into thinking the account has no repository, I definitely would have to edit this table.

the did_cache.sqlite well, by its name contains a cache of the DID document for a given DID identifier (currently on atproto there's did:plc and did:web but I won't go too deep into that), since the broken account was already migrated, we don't need to edit anything there

the sequencer.sqlite's structure is more interesting. only a repo_seq table:

sqlite> select * from repo_seq limit 10;
1|did:plc:ghmhveudel6es5chzycsi2hi|identity|cdidx did:plc:ghmhveudel6es5chzycsi2hifhandleuluna.pds.bsky.ln4.net|0|2024-10-29T01:18:53.245Z
2|did:plc:ghmhveudel6es5chzycsi2hi|account|cdidx did:plc:ghmhveudel6es5chzycsi2hifactive|0|2024-10-29T01:18:53.275Z

and to me that looks like the "data stream" that the PDS exposes to the world in atproto! this is important, because editing this directly would "change the world" outside of the PDS. I was able to get a hold of @cinny.bun.how through a contact and bun said to not touch or remove sequence data. so I didn't.

that completes the "global" PDS sql data, but more important is the actor/ folder. you can see two files in it, a /key and a store.sqlite:

/o/pds# ll data/actors/e5/did:plc:fme5yoqbc4vwdrex7qh6qb7w/
total 104K
-rw-r--r--. 1 root root   32 Jul 24 19:53 key
-rw-r--r--. 1 root root 100K Jul 24 20:21 store.sqlite

from what I understood about atproto and that records are authenticated, I would think the key is important to be stable and so I shouldn't edit, but store.sqlite contains tables that very much look like a "deserialized" version of the repository. I don't see the CAR file, but that's because CAR files are just a way to send data through, not the actual storage (as storage requires more than just the raw data, as you want to have indices for fast lookup of data, etc):

sqlite> .tables
account_pref           kysely_migration       record_blob
backlink               kysely_migration_lock  repo_block
blob                   record                 repo_root

that (and @cinny.bun.how's comment) suggested to me that i should delete the entire actor database, which makes sense to me! so a plan was materialized in my mind to:

we proceeded to run with the reimport, but that failed:

pds-1  | {"level":50,"time":1753317966334,"pid":8,"hostname":"pleroomba","name":"xrpc-server","err":{"type":"Error","message":"ENOENT: no such file or directory, open '/opt/pds/data/actors/5a/did:plc:g.../key'","stack":"Error: ENOENT: no such file or directory, open '/opt/pds/data/actors/5a/did:plc:g.../key'","errno":-2,"code":"ENOENT","syscall":"open","path":"/opt/pds/data/actors/5a/did:plc:g.../key"},"msg":"unhandled exception in xrpc method com.atproto.repo.importRepo"}

that suggested to me that something in the PDS expected the repository and its keypair to exist, so I put the work on pulling out the key file from that "PDS backup", as well as taking a .schema out of sqlite and reconstructing an empty store.sqlite file that only has the migration-related metadata, so that the PDS doesn't attempt to create tables that already exist. after that was done...

a discord screenshot of me importing the CAR file, seeing that it works, and being very happy that it worked

but we had a new error! https://pdsls.dev (an incredible tool for inspecting data like this) was reporting the following error on the records we were fetching: Invalid record: signature verification failed. this took us an hour to understand why that failed, and how to fix it, going through multiple resources and github repos (many thanks @notnite.com!), writing scripts that say "yep this private key is this public key lol", because we knew the keys should all be matching between PDS and PLC, but the signature just didn't work. after more time spent on this, @notnite.com had an idea:

okay uh
very stupid idea
lyna just make a post
[...]
(on luna's PDS)
what I want to see happen is the repository gets updated with a new commit
and that means that it gets resigned by the PDS key

and that worked. the new record was created, and all the previous old records worked verification as well. I have no idea how merkle trees work, or how that worked, but it's now a part of my migration process to make a new record (post, like, follow, anything).

a discord screenshot of me saying "FUCK ALL OOF THIS" after successfully putting everything together

what's next? #

this whole thing took around 4hours, between the migrations, debugging, fixing, until calling it done. for today (2025-07-24), I sucessfully migrated another friend out of a bluesky PDS with a manual flow with the goat CLI into my PDS, which has given me a lot of insight into the entire process, and I can't recommend it enough: https://whtwnd.com/bnewbold.net/3l5ii332pf32u

@knotbin.com from the ATP Airport has acknowledged the issue and I can't thank them enough that they're being cooperative! the overall atproto developer ecosystem is very diverse (which is a whole another article I can make), and as someone that was dealing with activitypub and the non-cooperation of Mastodon on various issues has burned me on federated systems. it's all a breath of fresh air, and they're being a part of it all! it is important to remember that this could've happened with ANY migration tool, including the official goat tool!

you can see effective updates to this article from @knotbin.com on the issue itself: https://github.com/knotbin/airport/issues/6

there's a PR by @cinny.bun.how on adding safeguards to the PDS to prevent any CAR import from going to the wrong account, it's in progress at time of writing: https://github.com/bluesky-social/atproto/pull/4067

very thankful to (in shuffled order):