death by a million geese bites
in which one rando on the internet designs a general purpose feed

#author_luna #recommendation_algorithms #bluesky
this is an article about nagare. for context, check out the previous ones:
it's been a while! things have been busy outside of the project, but I think making this article in some way keeps me on track on what I did in the past two months.
there are two main projects I've worked on:
- re-ranking
 - scoring
 
the recommendation pipeline #
nagare is heavily inspired by the twitter algorithm, although of course not with the same components since the goal of nagare is to be extremely low-cost. this provides a good starting diagram to understand how recommenders CAN work: https://medium.com/understanding-recommenders/how-platform-recommenders-work-15e260d9a15a, from this article you can get a pipeline:

if you had to assign credit to each component on the pipeline, here's how it goes:
- all items are held by the thousands of AT protocol PDSes
 - moderation is done by Bluesky PBC via the moderator services. I did some digging when I was exploring the protocol here: moderators
 - candidate generation: nagare
 - ranking: nagare
 - re-ranking: nagare
 - showing posts to users: a mixture of nagare and Bluesky PBC (via the actual bluesky app client and servers running https://bsky.app)
 
inside nagare, the simplified architecture looks like this (per user):

the Agent is a process that listens to every firehose message, filters out if it's a valid post via the likes-from-follows filter (see previous articles on other ways to get candidate generations), runs the primary scoring models (ranking and scoring are interchangeable terms here, I use "scoring" here)
each post that passes the filters from the Agent gets scored and submitted into the post heap which is just a huge sqlite table, simplified schema:
CREATE TABLE posts(at_uri TEXT PRIMARY KEY, score DECIMAL, time_decay_modifier DECIMAL); CREATE INDEX "posts_score_time_decay_index" ON "posts" (score * time_decay_modifier);
once a user asks nagare for posts via the app.bsky.feed.getFeed XRPC (in the bluesky app), that makes the Bluesky app server request nagare for posts via the app.bsky.feed.getFeedSkeleton XRPC. more information on the AT protocol side here
but for the feed skeleton, nagare creates a virtual component called a View. that View then fetches posts from the heap, applies some basic re-ranking methods (explained later) and offers the post URIs back to the bluesky App Server, which then hydrates the URIs back into fully fledged posts to show in the UI.
after tackling the candidate generation in the Donut algorithm I decided my next steps were to experiment with the latter components of the pipeline, as those were created on the early days of the project just at very prototype quality just to prove I can get everything running for the project.
re-ranking #
the job of the re-ranker is simple:
- after fetching 50 or so posts from the heap, sorted by the highest score (multiplied by some time decay)
 - apply some filtering so that there's some basic guarantees about how the feed looks
 - keep going
 
I had to create this very early on because of one problem that was immediate: too many posts from singular users. say, someone enters a like spree and likes 100 posts from someone. given the current algorithm all 100 posts would be submitted into the heap (in varying scores, but let's assume they all are highest score) and you (as the user) would have to scroll through 100 posts to see anything else. that's not very good UX.
in turn I had to write a very simple policy to keep the feed under control: max_per_author which as the name implies just limits the amount of posts each author can be in every page of 50 posts. it looks simple enough but one important side-effect of this is that the bluesky client can request 50 posts (via the limit parameter), nagare fetches 50 posts from the heap and due to the re-ranking policies it returns less than 50. while the bluesky API allows feeds to return less than what they request and the bluesky client lets that be processed without issue, it leads to weird behavior in the UI (and weird behavior in my mind, if you as an API client asked for 50 and I have way more than 50 lying around, I sure as hell want to provide 50).
so, if you're doing anything like this make sure that if you can't get the amount of posts asked by the client, feel free to keep fetching from whatever data store you have beyond the initial amount. in the case of nagare the main data store for this (the post heap) is literally on the same process as everything else, I can just keep fetching it without any issue (I call this "page extension" but you can call it whatever else) and add some upper bound (for example 100 posts) before actually giving up and returning less than the wanted amount.
there are other policies that I made like max_per_reply_root which is adding a "canonical post URI" for each post (for example the root of a thread, or the quote post that they're embedding) to also prevent these kinds of collapse. there isn't anything super technical to go on these... and I thought that was kind of the problem. these algorithms are way too simple to just work... right? could I design something better?
from the Donut algorithm I mentioned something like this on the last part of the article:
- tweak how the feed attempts to decrease "reader stamina exaustion"
 
- say, someone sees 3 posts from the same person in rapid succession (because the authorship of a post is quite a strong signal in my post score model)
 - having that go over and over for various other users kills reader stamina
 - depending on the type of poster (no callout this time) you can reach those "failure modes" of how the feed works
 - I would prefer if users enjoyed using the feed. so I'll have to find a better way to prevent the feed from getting stale
 
I attempted to find a way to model "stamina" which wanted to be more mathematically distinct than raw max_by_author by doing the following in python pseudocode:
stamina = 1.0 page = [] posts = [post, post, post] # from the heap # start the algorithm with something or else it'd crash page.append(posts[0]) posts.remove(0) while True: if not posts or len(page) >= 50: break staminas_per_candidate = [] for index, post in enumerate(posts): staminas_per_candidate.append(compute_stamina(page[-1], stamina, post)) # find the post that consumes the least amount of "stamina" best_post_index = min(enumerate(staminas_per_candidate), key=lambda x: x[1])[0] stamina = staminas_per_candidate[best_post_index] page.append(posts[best_post_index]) posts.remove(best_post_index) # consume the post and continue
then the definition of how "stamina" actually decreases, something like this:
def compute_stamina(current_post, stamina, post): cost = 0.1 if post.author == current_post.author: cost += 0.3 if post.reply_root == current_post.reply_root: cost += 0.2 return stamina - cost
I wanted to do this because whatever I had before wasn't enough, so I ran some A/B tests over the course of 7 days with tester sample size of 16 and total vote count of 190. for labeling
- v1 is the current system (
max_per_author=3andmax_per_reply_root=3policies) - v2 is random shuffling of the initial 50 posts
 - v3 is actual stamina modeling
 

well... that didn't work as it should. multiple A/B tests were ran, plus overhauls of how A/B testing works (instead of just going by manual vibes I made a whole separate tool to help people vote on two separate feeds), the actual rounds I did were the following:
- v1 vs v2 (part 1)
- v3 vs v1 (part 2)
- v1 vs v3 (part 3)
- v2 vs v3 (part 4)
there are things I could try to do
- @twd.moe helpfully analyzed the feeds and noticed the stamina model reduced the amount of unique authors in the feed by half. how the hell?
- 50% less unique authors! wow!
 
 - @ackwell.au suggested to also add "who liked the post" as a signal in the stamina function to increase diversity. doing it into the stamina function (which "successfully fails" compared to a direct filter which won't let posts pass) should be better than a hypothetical 
max_by_likerpolicy because there are accounts with low following counts which don't have a lot of likers to begin with. adding that as a filter policy would break the feed constantly and require expansion every time the user requests from the feed. 
but what I decided on was to work on the scoring model. in general, the re-ranker only works if there's good scores to begin with, and the current model I had was not good.
scoring #
as the re-ranking story, the scoring model started in the same way: a simple design to make something that works (because I wanted to get the ability to simulate Show Less). it's hard to make something that "works" because the problem with a general scoring model that learns with the user is that the user signals are just Like or Dislike, while nagare's signals are:
- who you follow
 - who liked the post
 - who made the post
 - quote reposts
 - etc.
 
this all turns into an asymetry problem that's shaped like this:

for non-general feeds (say, those made by people using skyfeed or graze), the work of content curation is left to an actual sentient being that constantly tweaks what gets scored higher or lower (or removing accounts, etc). nagare's initial goals are to be general-purpose and personalized, this means everything I make should be as hands-off as possible from my side (save for actual bugs on the mechanisms themselves)
what I did for starters was something like this:
- to initialize, fetch the first 100 likes from the user
 - for each post that needs to interact with the scoring model, split it into 3 components:
- author id
 - liker id (who liked it, AKA one of your Real Follows or Virtual Follows (see the Donut algorithm for virtual follows))
 - words (split by whitespace, normalize to lowercase, then apply stop word filtering)
 
 - when a post is liked, for each of those components (or keys), increase by a fixed amount
- for example:
 authors[author_id] += 0.1likers[liker_id] += 0.1for word in words: word_scores[word] += 0.1
 - when a post is disliked (via Reports or via the Show Less UI which was opened to custom feeds a while ago), do the same thing but decrease the values
 - when a post needs to be scored, fetch from all three and do some basic weighting:
10 * author_score + 5 * liker_score + word_score- map the values given into the 0-1 range with 0.5 being the middle point with a sigmoid function: 
normalized_score = 1 / (1 + math.exp(-raw_score / 5)) - done
 
 
this... works, on a surface level. it does find, over time, accounts you like posts from or which of your follows is a good curator of stuff (by their likes), as well as which words you dislike being mentioned in posts. the issue with this model happens over time though: I haven't included any way for the "like signal" to be evicted out of the model (the model here is just a key-value store, it's nothing else). so over time the "model" has the combined data of thousands of likes with no way for a like made a month or two ago to actually decay down into nothing (because people's tastes change over time!). this leads to author scores being in the hundreds or thousands! completely impossible for a singular Show Less to actually show anything less about that user!
this also leads to other problems mentioned in the Donut algorithm:
- better post scoring model
 
- it's easy to reward hack (requiring things like social proofs from likes into the entire system) with hashtags or just words in general. posts with more words get scored higher!
 - it's hard to represent concepts someone dislikes with it (it does work on downranking users, but it can't reliably model "likes user's posts but only in specific subjects")
 - the model starts with just your top 100 likes, but keeps storing more likes continuously instead of evicting old likes. this contributes to the model becoming "stale" over time, etc
 - this is the part where philpax begs for me to run an embedding model on the entire world. then use cosine similarity. I can't dedicate GPUs to run embedding models, but there could be something to be done here.
 
at the end, I want to get a model with a different design than what I got. I tried to make different versions of the same algorithm but with different weighting, or different ways to split words (like the RAKE algorithm), at the end nothing really worked and the ranking looked different than noise:

the second thing I wanted to do was a proper v2 model built on embeddings as @philpax.me begs me to. I did some experiments and it turns out that the VPS I'm using can run embeddings at 30 per second! currently nagare processes at peak just 4 posts a second (after the firehose filtering) so we can easily have an embedding service running in the background. running batch jobs to rescore all posts (to add embeddings to begin with) would be more of a challenge, but it's fine. I did some initial tests with the v2 model on my data and saw promising results, so I decided to implement it all for testing.
the v2 model works like this:
- build the "model" (run on every new like or Show Less, or startup too):
- from the last 300 likes, compute embeddings for all of the posts referenced by those likes
 - from all the Show Less made by the user, do the same thing
 - with the like embeddings, average everything together to create the "positive preference vector", a 384-dimensional vector that should encode the direction of things that the user likes
 - same for the Show Less'ed posts, creating the "negative preference vector"
 
 - to infer the "model" on a new post, do this:
- compute embedding for the post
 - compute the cosine similarity between each vector
 score = (2 + positive_similarity - 2 * negative_similarity) / 3- scores should be in the 0-1 range, starting at 0.5
 - it's an intentional choice to make the distance to the negative preference higher in terms of weighting, it's to make the model much more sensitive to what the user dislikes
 
 
well, does that work? I don't know yet! the A/B tests for this have been running for a while, and while it works in some areas, it doesn't work in others.
- it's still possible to reward hack on hashtags because I don't really do much filtering on the input text
- for example you can't just lowercase text: "Turkey" (country) vs "turkey" (bird)
 
 - the model is unaware of authors and likers, this is both a good and a bad thing.
- for specific accounts like @chiitan.love which actually act in extremely erratic/automated like behavior (more here) it does help. I call this "chiitan-resistance" (because I had two cases where someone had feed issues and we debugged it to chiitan's likes)
 - but it doesn't help to "add yuri in the feed" (a genuine thing that I've been trying to tackle -- how do you even find all the cute yuri art in the first place?!???!?), as the model only works in the text dimension, not images
- should I use CLIP for images? that sounds computationally expensive... etc
 
 - what other signals I could add?
 
 - it doesn't work at all for accounts with low following counts, for those users it's more of a shuffle of the same things rather than actually providing a meaningful change
- the Donut algorithm's target was expanding the follow set to help on the low-following-count problem, so a scoring model being mostly irrelevant to those accounts makes perfect sense if the Donut algorithm wasn't able to find many candidate authors or active candidate likers
 
 - sometimes all the content that exists is about the Main Character of the Day which some testers just don't care about and would not want to see, but since that's everything nagare knows (due to candidate filtering), that's all it can score (it's the same shape as the aforementioned low-following-count problem)
 
I didn't want to wait for the scoring model experiments to finish before writing the article since I don't know how long those will take (and what other models I could make)
future of the project #
nothing is easy, everything is hard. I'm sure projects/products like Skyfeed and Graze have their own sorts of issues on trying to run what amounts to hundreds/thousands of VMs (in their own language) that filter posts, apply logic, run models, etc, all at massive amounts of scale. I'm not here to degrade their services by saying nagare is better (hell, nothing that I've done has been proven to work at hundred or thousand-user marks) but it is orthogonal both in value and in types of problems. the nagare project's purpose is to make something better than bluesky's Discover feed for myself. what I learned is trying to do it is absurdly hard without the raw engineering (time and money involved) and I don't even want to think about how real Discover is architected (probably extremely different than what I do).
this puts the project in a weird state for its future. currently it's all research which recently hasn't been going super well which hits the motivation part of my brain (as well as general inactivity by others as they have other things in life to do which is valid, though puts me in a spot where I don't know if what I'm doing works or not. weird subjective results leave me frustrated) and soon will cost more money. for the past few months, nagare has been running completely free due to upcloud's generosity on a free trial, but credits run out in january. after that the project will cost 18USD/mo to support the ~20 accounts for the testers which is not bad, plus I was able to optimize various code paths so I think I can put even more users into the single vps.
does put it in context though, I've put hundreds of my own hours over the past few months putting something together and while it works right now and I have testers that are absurdly happy with the feed when compared to Discover, are there any major aspirations for the project? right now, future paths are too risky for me to take.
- run it publically
- it'd be impossible to do this for free to me. @spacecowboy17.bsky.social has been working on their own For You feed which is completely free running out of their own house, which would work for me! I have the compute in my house, but it's a problem of expectations
 - a feed is a Live Service and Live Services imply operations. which also imply angry users when things go down, and making it not free with some hoster somewhere would increase those expectations tenfold. it would become a second job for me which I can't take for my own mental sanity
 - if it's paid, I'd have to create a company, interface with payment providers, do taxes. it's so much.
- doing it via Patreon may help here since any kind of monetization for nagare would have to be a subscription because it's by definition a live service
 
 
 - well, open source it, should be obvious then right?
- I'd be open sourcing months of work for free that is extremely valuable to others just doesn't sit right with what I had to do. you can call me selfish but I had my work resold to others, since then what I open source are the things that are useless to everyone but me. this is very much the same feeling expressed by Xe Iaso in "Open Source" is Broken:
 - "The existing leech culture of "Open Source" being a pool of free labor makes it hard for me to want to have my side projects be actually useful like that unless you pay me."
 - the article was made in 2021. it's been 4 years. nothing has changed.
 - I'll keep sharing my techniques in public through these articles and posts on my bluesky account, and whoever wants to take the ideas and make their own thing should run off and do it, but it won't be on my codebase ready to go (the codebase at the moment is over 14KLOC)
 
 - open source, but not open source? (B2B edition)
- AKA a "paid support contract" for companies that want to use nagare in a commercial manner.
 - it is unclear if any company would actually be interested?
 - I predict it would be less workload than having direct user interactions, so it's a good option, but the fact there's no guarantee that a company wouldn't just... copy it all internally and use without a care in the world leaves this option
 
 - keep it private
- that's what I've started with, and what will keep going moving forward after all these options show themselves to be very disastrous to myself. there will be a time where I call the project done and once that happens maybe my opinions change, but until then things stay as-is
 - sometimes I open up tester spots and accept whoever has the right vibes, sometimes it becomes a tight fit on the VPS but it works out!
 
 
see you on the next article, whenever that comes! maybe a functional scoring model will come out of it! maybe the yuri vector will finally be found! the only way is through.