the Donut algorithm

#author_luna #bluesky #atproto #recommendation_algorithms

%at=2025-09-01T23:05:27.151Z

this is an article related to nagare, my general-purpose bluesky feed side-project. previously described in discover feed fever dream. in this article I talk about what I've been working in the past month, which is trying to solve the "low following" problem. from the previous article:

in this graph, i plot the amount of follows a user has, and the amount of problems they're seeing in their basic likes-from-follows feed. at the low end of the spectrum (10 to 100 followers), the problems they see are shaped as "not enough activity, feed becomes stale", but at the higher follow counts like 1000+ are differently shaped problems, mostly related to having so much activity i don't think i would be able to hold storage for that type of tester.

I was able to find a solution to the "high following" problem and got two testers that can validate the feed's effectiveness. I myself am on the "low following" part of the graph (as well as some others) so I want to find systems that can target both groups and bring them down to the "center" of the graph.

for high following accounts (as in, they follow a bunch of people) it was "simple":

the specifics of the function are related to how various parts of nagare do scoring (for posts, specifically), so going too deep into that wouldn't work for others (especially because those models aren't even finished), but the main idea is:

the low following problem #

genuine brainworms on it. I have multiple notes, some made in VR, many more made in my notebook, on how to solve and implement this. at the core of all these ideas is building something that can extrapolate the "aware social graph" from what a given user likes. in general what I've seen from Twitter's algorithmic behavior is that even a singular like spawns quite "drastic" changes to how the feed personalization is done, and I wanted to implement that because just treating "likes-from-follows" as-is is not enough (as shown in the graph above!)

various ideas, in somewhat order of "idea had":

and then kara joined the tester group and heard of my woes. the last one, "spidercluster", is the one that I got the most progress on, even downloading data and being able to plot out... something that looks like a clustering algorithm on it (NOTE: this screenshot is from experimental work):

upon hearing my woes, his words were

my intuition (read:unsubstantiated guess) is that you'd get good results from sampling a cluster from the outside, under the assumptions that:

so you'd get small clusters of creators who may or may not engage with each other, surrounded by a large ring of similar people who like those creators "frequently"

idk how your graphs are built so I can't make something up for those, but for this (in case it's similar) would probably be something like this:

when feed user likes post:
  // identifying cluster via sampling
  create a histogram
  get users for all likes of post
  for each user:
    get latest N likes
    add authors of those posts to histogram
  select top P% users from histogram
  // post selection from cluster, do what you want, just making something up
  select latest M posts from each
  shuffle those into the feed

after a couple of days he got a proof of concept working: see source here and I saw that something there worked so I proceeded to implement that into the nagare codebase.

I expect the graph to look basically like this, ideally (lines are likes)

and since that diagram kind of looks like a donut with the outer ring being everyone that likes stuff, the Donut Algorithm was properly titled and created.


The Donut Algorithm #

at a high level, this is how it works. from a given seed user did (the feed user is the seed user in my case, but I could run this on other seed users):

it feels simple in description, but its implementation took weeks. the main problem with this algorithm as-is is that I need to make it realtime aware! users change their tastes over time, like different things, which means the candidate user sets that the algorithm finishes with may change over time and I would prefer to not continuously scrape various servers every time one of my testers like something. for this purpose there's the firehose which I can attach to after scraping, but that adds another issue which relates to the previous article: storage use.

the feed runs on a tiny VPS. it doesn't have a lot of storage (but I want to migrate to another VPS, the problem remains though). that means I need my system to be aware of which data to evict away from the database over time (I'm intentionally not scraping all seed user interactions, or scraping more than 20 per co-interactor. I want to also be aware of their tastes changing over time). to make a storage layer for this that works in causing fast evictions of data is a problem that I have multiple pages on my notebook for.

the solution I've landed on for this was reference counting (refcounting). in the schema for the system that runs the donut algorithm, I have something like this:

CREATE TABLE IF NOT EXISTS "records" (
	"at_uri" TEXT NOT NULL PRIMARY KEY,
	"did" TEXT NOT NULL,
	"collection" TEXT NOT NULL, -- app.bsky.feed.like, app.bsky.feed.post, etc...
	"rkey" TEXT NOT NULL,
	"data" TEXT NOT NULL,
	"created_at" INTEGER,
	"record_type" TEXT
);
CREATE UNIQUE INDEX "records_did_collection_rkey_index" ON "records" ("did", "collection", "rkey");
-- reference counting. FROM at_uri TO ref_uri (at_uri->ref_uri)
CREATE TABLE IF NOT EXISTS "record_refs" (
	"at_uri" TEXT NOT NULL,
	"ref_uri" TEXT NOT NULL,
	PRIMARY KEY ("at_uri","ref_uri")
);
-- keep track of what or who i've scraped
CREATE TABLE IF NOT EXISTS "scrape_state" (
	"scrape_type" TEXT, -- "user" (on pds) or "post" (on constellation)
	"entity" TEXT, -- did:... or at://
	"scraped_at" INTEGER NOT NULL,
	PRIMARY KEY ("scrape_type","entity")
);

as I scrape records, they get different record_types:

this is done because, since I keep top-100 likes from the seed user, once a new L1 record (a like from the seed user) comes, I want to evict the oldest one and have a guarantee that the entire chain dependent on that oldest L1 gets deleted too! there has to be a clear path from "L1 gets deleted" to "L3 gets deleted if there's no references anymore" and I don't want a solution that would require me to load the entire graph in-memory to then know what I need to remove, because the VPS only has 4GB of ram! (there are other services on that VPS!). here's what the "reference" graph looks like:

I took three different shots at implementing something that works and each time I did it I had various bugs, most of them are data consistency issues (say, U1 references L2 which should not be possible) which over time would cause data degradation (deleting too much) and so the sets of users would simply break.

(sidenote: you may think that since likes reference posts, it should be modeled as L2->P1 rather than P1->L2. this is not done because the reference direction represents "data dependency from the feed user". doing it in the other way around would involve creating special-cases for how P1 deletions are handled. keeping everything in a singular direction forward simplifies code)

integration #

from the first idea shown at the start of the article, the "virtual follows" system was not very good in terms of expanding follow sets, but the Donut algorithm could provide something akin to virtual follows by instead turning it into "likes-from-virtual-follows". given the candidate likers, they become filters for the posts made by the specific authors, guaranteeing that there was "social proof" that a specific post can be recommended instead of just everything-all-at-once.

once the two sets are made (and you can resync the sets every time the seed user likes, though at least in my implementation I need to keep track of everyone for their L3 records) I can send them over to another system whose job is to keep track of those likes. it's not true "likes-from-virtual-follows" because I'm not choosing to go with all of their likes, only the ones that were made to the candidate authors set.

did it work? #

for the most part, tester reception has been okay (subjectively). part of my work was also adding A/B testing capabilities so testers can feel what the new system does and if they prefer it compared to pure likes-from-follows. it wasn't an unanimous dislike (like it was with the first iteration of virtual follows), which is progress to me!

there's various other things to work in before finalizing it, but I'm happy with it overall and I probably should refactor it for a fourth time because I think I still have some data consistency bugs. areas of improvement:


at the end, a good general-purpose feed is hard. absurdly hard. each single issue or featire spawns more issues or features in a fractal of Things that a feed should be aware of. a feed's interface is simple, just get some posts and give them to the user, right? right????