cloning people for fun

#author_luna #ai #fine_tuning #large_language_models

see petoriagpt for the first edition of this experiment

generated by me with novelai diffusion anime v3, 15 minutes on prompting because i really should pick up a nai subscription so that i stop being so worried about my anlas balance running out

after the "success" of the first model, it was within the chat's brain to create models that would attempt to speak as someone, rather than simply completing the next 5 messages from chat, given the last 10

a lot of things happened since petoriagpt, but most importantly was tooling to finetune locally became easily usable. the v1 of this project used RunPod and a bunch of hacks to make a full finetune of gpt-j, nowadays even llama.cpp can train lora finetunes


it's quite simple when you break it down:

i trained around 41 models in total, using various ways to clean up and extend small datasets for finetunes, also various formats.

here's the loss graph because it's pretty:

here's the power usage graph because it's pretty (i have no idea how much power i've used in total, i don't think i care though, inferencing the loras over a day or two would lead to more power used than training itself):

training notes #

this project started in 2023-10, but only really brought any results on the past week due to a funny bug.

when the project began i was experimenting with various finetunes that could run on my RTX3060 12GB, from qlora on open_llama_3b to lora+gptq on llama-7b, however those runs had very unstable loss graphs

4th iteration of my model

i noticed however that some models were converging somewhere:

yet when attempting to put the lora file from axolotl into text-generation-webui for inferencing, i saw very bad results, as if the model didn't work at all

this stopped me on my tracks for two months, since i didn't know what to do. i went all the way to bumping to 16 epochs, large lora_r and lora_alpha to force the model to overfit on the data, saw the loss curve go all the way down to 0...

and yet the model didn't work. until i had one hail mary and ran the lora with accelerate -m axolotl.cli.inference, it worked, yet text-generation-webui didn't. why is that?

so i attempted to write a small script that uses hf transformers and peft to run the model, it didn't work.

so i began copying code from axolotl to the small script that axolotl.cli.inference did, even though many of these are only relevant for training:

until i print()ed the arguments to model instantiation.

so, turns out that openlm released a v2 of their model.


i was accidentally training on v2 (because axolotl updated their yml files to use it) and inferencing on v1 (because i downloaded open llama 3b when it was first release to test it out with, before this project even started). axolotl.cli.inference worked because it used the same model i used for training, that's all.

i spent some two days banging my head and copypasting model code, all to be the simplest fix in the books.

i said it before throughout workplaces and i'll say it again: the longer you debug something for, the simpler the solution will look like and you'll feel more stupid at it.

how to inference #

as mentioned before, i use text-generation-webui for inferencing, since it uses gradio for its ui it has an "api" (which is what the gradio frontend uses to talk to the server), it's definitely not stable and based on the gradio version and code that is used to write the lora page, here's my code to set a lora for example (you can find how to do it by just opening up the network inspector on the lora page)

async with websockets.connect(
	os.environ["OOGA_GRADIO_ENDPOINT"] + "/queue/join"
) as ws:"applying lora %r", lora_name)
	await ws.send(json.dumps({"fn_index": 292, "session_hash": "asdf"}))
	await ws.send(
				"data": [ [lora_name] ],
				"event_data": None,
				"fn_index": 292,
				"session_hash": "asdf",
	while True:
		msg = await ws.recv()"received %r", msg)
		msg = json.loads(msg)
		typ = msg["msg"]
		if typ == "process_starts":"applying...")
		elif typ == "process_generating":"applying......")
		elif typ == "process_completed":"applied!")
		else:"unexpected event %r, ignoring", typ)

it doesn't take too long to load it, most time is spent fetching the messages from discord to assemble model input, which also has its tricks to make model output better, such as splicing chat history from another "good" chat (that isn't overrun by model output) to create input, as repetition is a large problem with those models.

i attempted to use the "multi-lora" inferencing solutions such as punica or lorax, as those could let me run inference on all the models and say, reply with the one with highest token probability of its own name compared to the rest (at the moment it's random or fixed), however

results #

a lot of people enjoyed the fucked up shit the models said, and the models faithfully copied the writing style from the people i trained on (some are rapid fire, all lower case zoomers, others Capitalize the first letter of their sentences and use "I" instead of "i", etc)

the models weren't "instruction-tuned" but you could use the translation capabilities of a model to invoke a "translator" out of latent space, as this is what happened with a finetune on a friend of ours. she is an english and japanese speaker (-PL is the suffix for model output):

she attempted to get the other models to invoke japanese translation but without success. pretty interesting result

basically, the project is a success, and most advances in the quality of the models can be boiled down to possible future work on dataset curation, i tried to use lilac for some of it but as soon as i installed it i found out i need to install 30 other things but only found out after installing it first, got too tired and just prayed bumping num_epochs for small datasets would work.