atproto pds migration IN ANGER
#author_luna #atproto #sysadmin-notes #blogpost
the context #
it's not obvious to everyone but I'm on bluesky (same handle as the website!), I've been going through it and exploring its design since around feb/march iirc with some projects:
- bsky dot (I'll fix it soon I swear)
- nagare: linked here as discover feed fever dream
this does also mean a subset of my friends also are on bluesky, and a subset of those are inside (thunder sfx) The United Kingdom (thunder sfx). Laurens Hof puts the "current happening" related to UK users pretty well on his Bluesky Report:
Bluesky has announced it is rolling out an age verification system in order to comply with the UK’s Online Safety Act. Users in the UK will be asked to verify their age, using a variety of options. Bluesky uses Epic’s Kids Web Service for age verification, which allows users to verify via an ID scan, payment card verification or face scan. If users in the UK choose not to verify their age they can still use the Bluesky app, and only adult content as well as the DMs will be made inaccessible. Bluesky is implementing the system as a compliance with the Online Safety Act, which requires all platforms that contain adult content and can potentially be accessed by children in the UK to implement a “highly effective age assurance” system. This part of the law goes in effect on July 25th, and non-compliance risks a fine of £18 million. Bluesky PBC is implementing this age verification system in their own apps only, and other Bluesky clients have their own responsibility to implement such an age verification system. Other clients have not yet announced to be implementing an age verification system, meaning that users in the UK who do not want to share their information can sidestep this barrier by using another client to access the network.
I'm not from the UK, but I take a pretty strong stance against sharing my personal identity data in that way, and some friends of mine also share the same vision. thankfully, the way Bluesky has complied with this is through a request to the user's PDS. while bluesky's official PDSes implement the relevant identity check, users outside of the bluesky PDS can "MITM" the request and just return that the identity has been verified.
quick atproto primer #
there are various ways of explaining how the AT protocol works, I'll have to make a shorther one, but I recommend ATProto for distributed systems engineers.
okay, so, in the AT protocol there are 4 main "roles" in the network. these roles are taken by specific servers:
- identity
- data storage
- data aggregation
- user-facing frontend
identity is done in atproto via the DID scheme, most users are on the DID PLC scheme. at a VERY high level, the did plc service is a public centralized blockchain. it's not truly a blockchain in the cryptocurrency scheme of things because the chains are per-user, not global. it's centralized for efficiency purposes but auditable since all events in the plc are signed with a public key. you can look at my own plc data on https://plc.directory/did:plc:ghmhveudel6es5chzycsi2hi, you can see that the identity document links to a "pds", which is described further
data storage in atproto is done via the PDS (Personal Data Server). it holds all the records that you make (which are authenticated via the plc key, also called as repository or just repo) as well as any other user inside the PDS, this lets the PDS have extremely low operational overhead per user (say, a user has a direct cost of only ~20MB storage total inside the PDS, though if someone really likes making memes they may be in the hundreds or a gigabyte) and as such there are thousands of them on the internet, you can even see mine! right now it's pds.bsky.ln4.net
.
data aggregation in atproto is desirable because from the thousands of PDSes you need some way to aggregate them all into a computationally efficient entity in the network that you can just plug yourself in and get the... firehose of data (pun intended). in the network that's also called a "relay", and its whole job is scraping new PDSes via either the PDS requesting itself to be scraped by contacting the relay (it's an env var you set on the PDS) or via accounts referencing each other (through likes, posts, etc), and connecting to it to get a realtime data stream through the relay. with this component you would be able to keep a full network clone of all public data on the network (but that's not required for the relay anymore! see sync 1.1), and my own vibe-check estimates would put it in <10TB range as of 2025-07.
user-facing frontend in atproto is required because you need some way to actually interact with the network. atproto is a neutral protocol, bluesky is an application developed on top of the protocol. the identity, pds, and relay do not care what kind of application you're developing, only that there's some authenticated data that has to be shipped in real time to somewhere so it's synced across everyone else that does care. currently this component is called the "appview" and it is still kind of nebulous what it does. more information and discussion about the role of the appview here.
how does this relate to the UK #
to repeat the intro:
thankfully, the way Bluesky has complied with this is through a request to the user's PDS. while bluesky's official PDSes implement the relevant identity check, users outside of the bluesky PDS can "MITM" the request and just return that the identity has been verified.
the way appviews are supposed to work is that the frontend component makes requests to the user's PDS, which then may proxy the request to another service to fetch more data, this is done because its your PDS that holds authentication on you (user x password), not the appview! appviews and other components actually validate tokens made by your PDS (with your identity keys from the PLC).
the way this connects together is that when the bluesky frontend detects the user is from the UK (via their own geoip service https://bsky.app/ipcc, which is also used to decide which country-level moderators to enable for the account), the frontend will make a request to <pds>/xrpc/app.bsky.unspecced.getAgeAssuranceState
.
you can bypass the check in other ways since the check is done inside the bluesky's own frontend, @mary.my.id made a list. the PDS way is the "most stable" one since it would require no change to any clients, as well as no patches to do on the mobile app. the article also develops on the legality, and why the way bluesky has implemented this is "thankful". they could've done worse, which would heavily challenge the decentralization of the network.
well, let's migrate PDSes then! #
if the bluesky PDS is requiring a government ID check, and we can bypass it by just saying it's fine on a custom PDS, then we should be able to migrate PDSes, right? given that atproto decouples your identity from your data storage that is very much possible! the way the process works is:
- configure a new account on a target PDS (using "new PDS")
- this account contains the "reference" to an existing
did:plc:...
, so the PDS understands that this account is unfinished and it should not attempt to create a new identity for it.
- this account contains the "reference" to an existing
- transfer your data from the old to the new PDS
- all user records via the CAR file
- private account preferences (a json, contains accounts you've muted, labelers you subscribe to, etc)
- all user blobs (images, videos). since blobs are content-addressed the "id" if a blob is globally unique/stable
- update your identity to point to the new PDS
- activate the account on the new PDS, deactivate the account on the old PDS
- this is done because deactivation leads to deletion after a couple of days. deleting immediately wouldn't be welcome
- tell the appview that the identity was updated
- for bluesky, this is done via logging out then logging back in, since that'll make the appview refetch the did:plc and see that your PDS changed
- you'll also have to make a new record so that the new key has been inserted into the front of the record log
but for a non-technical user that is not very friendly. you can carry out those operations automatically (via an all-in-one "migrate" command) or manually with the goat
cli, more info here, but a specific tool caught my eye which is the ATP Airport, it was made by @knotbin.com to make the that process more automated/user-friendly to someone that doesn't know what a CAR file is, so I've kept it in the back of my mind for a while for the rare occourence that I would need to migrate PDSes. it would be rare for me because I created my account from my own PDS, haven't migrated. but maybe. just in case...
and then the age check rolled out on bluesky for all UK users on yesterday, july 24th.
the amount of people in my circles that run their own PDS is extremely low, so I was the obvious point of contact to be the target/"new PDS" in a migration process, I haven't done it myself, so I decided to point one of them to the ATP Airport. since we were unsure about the entire process, the plan was:
- create a new account on my PDS (should be simple enough, I made some for myself and nagare, but I needed to test the check bypass and I'm very much not VPN'ing myself to inside the UK, lmao)
- migrate a throwaway account (to be created in bluesky's PDS) with the Airport
- migrate an alt account with the Airport
- migrate a main account with the Airport
step 1 was smooth, sadly we forgot step 2 for the first friend, step 3 was smooth, but step 4 failed with a weird error. then we proceeded to investigate the state of the alt account, and found out all sorts of inconsistencies, missing records, and signature validations.
that's when I started writing an immediate issue to get more eyes on it ASAP: migration failure caused account breakdown, as well as going on bluesky to get more eyes on it. I'm not an atproto expert and needed as much help as I could get on both understanding what happened and recovering the account. that same friend suggested to give up on the alt and just create a new alt, I declined that on both a "this should never happen, it is useful for everyone to understand this happened" and a "you are now MY user, I HAVE to fix this 🧎♀️" perspectives.
to make it clear: ATP Airport shouldn't have caused this and @knotbin.com's approach to this is what should be done. migrations on the Airport have been disabled as soon as they found out this happened and will be coming back once the bug has been fixed.
after the relevant issue was made we started to investigate. we didn't understand how the migration process worked, I'm giving you the rundown of it so it can be used as vocabulary. after some evidence gathering we understood what happened: the Airport uploaded the CAR file, containing all user records, to the wrong account. this caused all records on the previous account to just be gone. and that threw me up for a spin on how do we recover from this with minimal disruption.
recovering from a CAR accident IN ANGER #
as soon as we noticed this we ran to get the CAR file that was still available on the bluesky PDS, the one that wasn't overwritten by the Airport, and we were able to get both a CAR file containing all records from both the alt and main accounts. then the bigger question happened of how do we do this? @cinny.bun.how chimed in to help with a pointer and @notnite.com helped me understand it:
i think u need to rm the from the pds db altogether which is pretty involved db surgery (the pds impl can only really deactivate and tombstone accounts afaik) and then re-migrate (you can grab known-good repos from the source pds right? and then importRepo again via atpairport i guess)
so u want to delete rows from tables referencing the did from repo_root, account, actor, and the dirs ./data/actors/*/<did>, ./data/blocks/<did>
would definitely take a full backup of the pds data dir beforehand in case something goes wrong
that made me inspect how the PDS is structured and how we could do this with minimal disruption. the main thing here is that I DO NOT want to delete the account through normal APIs, that's going to trigger a delete at the identity/plc level, which would stream that identity deletion through the entire network, possibly bricking the account. there may be ways of recovering from this, but I did not want to get near them
what bun's suggesting here is to trick my PDS into thinking the account never really existed in the first place, even though the PLC has the identity as hosted by my PDS. this would then let us re-create the account and re-import the data, fixing the alt.
but that's the thing I started to realize, if only the repository that was completely overwritten via the wrong CAR file, if we can re-import without causing disruption on how the PDS behaves with the outside world it would be perfect. no account deletion OR running Airport again required. so I started digging. IMPORTANT NOTE: I was running ghcr.io/bluesky-social/pds:0.4.136
so it's possible your PDS data structure changes over time
the architecture of the official PDS implementation by bluesky heavily relies on sqlite (good!) and keeps per-account sqlite databases for scalability reasons (good!), it's structured a bit like this (i keep my blobs inside the data dir, but you can put it separately):
/o/pds# ll data
total 9.0M
-rw-r--r--. 1 root root 216K Jul 24 00:39 account.sqlite
drwxr-xr-x. 9 root root 4.0K Jul 24 19:53 actors/
drwxr-xr-x. 9 root root 4.0K Jul 24 20:02 blobs/
-rw-r--r--. 1 root root 36K Jul 24 00:36 did_cache.sqlite
-rw-r--r--. 1 root root 3.9M Jul 24 00:36 sequencer.sqlite
the account.sqlite
file contains important key tables as mentioned by @cinny.bun.how:
sqlite> .tables
account device kysely_migration_lock
account_device device_account refresh_token
actor email_token repo_root
app_password invite_code token
authorization_request invite_code_use used_refresh_token
authorized_client kysely_migration
one interesting thing from these tables is that most of these are not actually related to the actual atproto record data. most of these are either high level operations that are only relevant to the PDS (like authentication, oauth, accounts, etc), or repo_root
which IS related to the record data (in this case, the repository)! its structure is very simple:
sqlite> .schema repo_root
CREATE TABLE IF NOT EXISTS "repo_root" ("did" varchar primary key, "cid" varchar not null, "rev" varchar not null, "indexedAt" varchar not null);
and so since its just per-user, its rows are also simple:
did:plc:ghmhveudel6es5chzycsi2hi|bafyreic3r44uhvlho6rv4fs4x5p6wf3kvtu2wm56aszfarjkjx5fpciqqa|3luqlbtdr422y|2024-10-29T01:18:53.243Z
if I want to trick my PDS into thinking the account has no repository, I definitely would have to edit this table.
the did_cache.sqlite
well, by its name contains a cache of the DID document for a given DID identifier (currently on atproto there's did:plc
and did:web
but I won't go too deep into that), since the broken account was already migrated, we don't need to edit anything there
the sequencer.sqlite
's structure is more interesting. only a repo_seq
table:
sqlite> select * from repo_seq limit 10;
1|did:plc:ghmhveudel6es5chzycsi2hi|identity|cdidx did:plc:ghmhveudel6es5chzycsi2hifhandleuluna.pds.bsky.ln4.net|0|2024-10-29T01:18:53.245Z
2|did:plc:ghmhveudel6es5chzycsi2hi|account|cdidx did:plc:ghmhveudel6es5chzycsi2hifactive|0|2024-10-29T01:18:53.275Z
and to me that looks like the "data stream" that the PDS exposes to the world in atproto! this is important, because editing this directly would "change the world" outside of the PDS. I was able to get a hold of @cinny.bun.how through a contact and bun said to not touch or remove sequence data. so I didn't.
that completes the "global" PDS sql data, but more important is the actor/
folder. you can see two files in it, a /key
and a store.sqlite
:
/o/pds# ll data/actors/e5/did:plc:fme5yoqbc4vwdrex7qh6qb7w/
total 104K
-rw-r--r--. 1 root root 32 Jul 24 19:53 key
-rw-r--r--. 1 root root 100K Jul 24 20:21 store.sqlite
from what I understood about atproto and that records are authenticated, I would think the key
is important to be stable and so I shouldn't edit, but store.sqlite
contains tables that very much look like a "deserialized" version of the repository. I don't see the CAR file, but that's because CAR files are just a way to send data through, not the actual storage (as storage requires more than just the raw data, as you want to have indices for fast lookup of data, etc):
sqlite> .tables
account_pref kysely_migration record_blob
backlink kysely_migration_lock repo_block
blob record repo_root
that (and @cinny.bun.how's comment) suggested to me that i should delete the entire actor database, which makes sense to me! so a plan was materialized in my mind to:
- get CAR files, which we already did
- NOTE: you can get the CAR file from a deactivated account on bluesky PDS by reactivating it then fetching it, I believe that won't cause a PLC update and it'll stay on the new PDS. I can be mistaken, though!
- stop my PDS
- back up my PDS
- turns out that's around 131MB, that's pretty cheap
- run, in
account.sqlite
:DELETE FROM repo_root WHERE did = 'did:plc:g...';
- run, in shell:
mkdir broken_alt_repo/
mv data/actors/5a/did:plc:g.../key broken_alt_repo/
mv data/actors/5a/did:plc:g.../store.sqlite broken_alt_repo/
rmdir data/actors/5a/did:plc:g.../
- (I didn't have to delete the file, just move it outside of PDS access and I'd have the equivalent operation to a removal, I didn't want to just
rm -rf ...
, too dangerous, especially on data surgery like this)
- restart my PDS
- import the CAR file we got from the bluesky PDS on the first steps into my PDS
- see if we need to fix any references to blobs
- in atproto, records just have the references to the hash of the blob, so the PDS has some relevant work to remove those blobs if the references are done. we were worried as soon as the repository got surgically removed, the PDS would remove the relevant blobs from the account
- that didn't happen! (maybe it was because I was too fast? not sure!) plus we had the backup which contained everything so I could've just loaded that folder back
- ...profit? account is recovered?
we proceeded to run with the reimport, but that failed:
pds-1 | {"level":50,"time":1753317966334,"pid":8,"hostname":"pleroomba","name":"xrpc-server","err":{"type":"Error","message":"ENOENT: no such file or directory, open '/opt/pds/data/actors/5a/did:plc:g.../key'","stack":"Error: ENOENT: no such file or directory, open '/opt/pds/data/actors/5a/did:plc:g.../key'","errno":-2,"code":"ENOENT","syscall":"open","path":"/opt/pds/data/actors/5a/did:plc:g.../key"},"msg":"unhandled exception in xrpc method com.atproto.repo.importRepo"}
that suggested to me that something in the PDS expected the repository and its keypair to exist, so I put the work on pulling out the key
file from that "PDS backup", as well as taking a .schema
out of sqlite and reconstructing an empty store.sqlite
file that only has the migration-related metadata, so that the PDS doesn't attempt to create tables that already exist. after that was done...
but we had a new error! https://pdsls.dev (an incredible tool for inspecting data like this) was reporting the following error on the records we were fetching: Invalid record: signature verification failed
. this took us an hour to understand why that failed, and how to fix it, going through multiple resources and github repos (many thanks @notnite.com!), writing scripts that say "yep this private key is this public key lol", because we knew the keys should all be matching between PDS and PLC, but the signature just didn't work. after more time spent on this, @notnite.com had an idea:
okay uh
very stupid idea
lyna just make a post
[...]
(on luna's PDS)
what I want to see happen is the repository gets updated with a new commit
and that means that it gets resigned by the PDS key
and that worked. the new record was created, and all the previous old records worked verification as well. I have no idea how merkle trees work, or how that worked, but it's now a part of my migration process to make a new record (post, like, follow, anything).
what's next? #
this whole thing took around 4hours, between the migrations, debugging, fixing, until calling it done. for today (2025-07-24), I sucessfully migrated another friend out of a bluesky PDS with a manual flow with the goat
CLI into my PDS, which has given me a lot of insight into the entire process, and I can't recommend it enough: https://whtwnd.com/bnewbold.net/3l5ii332pf32u
@knotbin.com from the ATP Airport has acknowledged the issue and I can't thank them enough that they're being cooperative! the overall atproto developer ecosystem is very diverse (which is a whole another article I can make), and as someone that was dealing with activitypub and the non-cooperation of Mastodon on various issues has burned me on federated systems. it's all a breath of fresh air, and they're being a part of it all! it is important to remember that this could've happened with ANY migration tool, including the official goat
tool!
you can see effective updates to this article from @knotbin.com on the issue itself: https://github.com/knotbin/airport/issues/6
there's a PR by @cinny.bun.how on adding safeguards to the PDS to prevent any CAR import from going to the wrong account, it's in progress at time of writing: https://github.com/bluesky-social/atproto/pull/4067
very thankful to (in shuffled order):
- @knotbin.com, for making ATP Airport and cooperating on the issue
- @madomagi.bsky.social, for helping on manual flow validation with
goat
CLI - @eule.replika.gay, for being british and patient with me as I was the one that was with her on migration
- @cinny.bun.how, for giving pointers on doing PDS surgery
- @retr0.id, for pointing to the manual migration flow
- @notnite.com, for having way too many fucking contacts and helping me out on sanity-checking my own plans