cloning people for fun

#author_luna #ai #fine_tuning #large_language_models

see petoriagpt for the first edition of this experiment

generated by me with novelai diffusion anime v3, 15 minutes on prompting because i really should pick up a nai subscription so that i stop being so worried about my anlas balance running out

after the "success" of the first model, it was within the chat's brain to create models that would attempt to speak as someone, rather than simply completing the next 5 messages from chat, given the last 10

a lot of things happened since petoriagpt, but most importantly was tooling to finetune locally became easily usable. the v1 of this project used RunPod and a bunch of hacks to make a full finetune of gpt-j, nowadays even llama.cpp can train lora finetunes

credits

notnite for greenlighting a full scrape of the chat
then, everyone that consented for me to create datasets for the purpose of cloning them (it's private so i won't share it)
lumi for her RTX3090, used for training
my RTX3060, used for training before the 3090
as mentioned before, axolotl
Jensen Huang for making one of the worst pieces of software to exist in the linux ecosystem
- i think i need to use someone's docker images next time

it's quite simple when you break it down:

gather dataset on a channel where a bunch of people talk together
create very simple scripts that turn that data into a "focused" dataset
- e.g chat.json gets turned into luna.jsonl, intel.jsonl, cool.jsonl for each person
train qlora on those focused datasets

i trained around 41 models in total, using various ways to clean up and extend small datasets for finetunes, also various formats.

here's the loss graph because it's pretty:

here's the power usage graph because it's pretty (i have no idea how much power i've used in total, i don't think i care though, inferencing the loras over a day or two would lead to more power used than training itself):

training notes #

this project started in 2023-10, but only really brought any results on the past week due to a funny bug.

when the project began i was experimenting with various finetunes that could run on my RTX3060 12GB, from qlora on open_llama_3b to lora+gptq on llama-7b, however those runs had very unstable loss graphs

4th iteration of my model

i noticed however that some models were converging somewhere:

yet when attempting to put the lora file from axolotl into text-generation-webui for inferencing, i saw very bad results, as if the model didn't work at all

this stopped me on my tracks for two months, since i didn't know what to do. i went all the way to bumping to 16 epochs, large lora_r and lora_alpha to force the model to overfit on the data, saw the loss curve go all the way down to 0...

and yet the model didn't work. until i had one hail mary and ran the lora with accelerate -m axolotl.cli.inference, it worked, yet text-generation-webui didn't. why is that?

so i attempted to write a small script that uses hf transformers and peft to run the model, it didn't work.

so i began copying code from axolotl to the small script that axolotl.cli.inference did, even though many of these are only relevant for training:

lora configs match
print(model) matches
is_trainable=False makes it work better? not really
prepare_model_for_kbit_training? doesn't work
tie_weights() doesn't work
there's a bunch of tensor dtype patching, copied, doesn't work
model.gradient_checkpointing_enable() didn't work
axolotl uses flash attention, that requires more dtype patching, copied that, didn't work
model.config.use_cache = False didn't work

until i print()ed the arguments to model instantiation.

so, turns out that openlm released a v2 of their model.

here

i was accidentally training on v2 (because axolotl updated their yml files to use it) and inferencing on v1 (because i downloaded open llama 3b when it was first release to test it out with, before this project even started). axolotl.cli.inference worked because it used the same model i used for training, that's all.

i spent some two days banging my head and copypasting model code, all to be the simplest fix in the books.

i said it before throughout workplaces and i'll say it again: the longer you debug something for, the simpler the solution will look like and you'll feel more stupid at it.

how to inference #

as mentioned before, i use text-generation-webui for inferencing, since it uses gradio for its ui it has an "api" (which is what the gradio frontend uses to talk to the server), it's definitely not stable and based on the gradio version and code that is used to write the lora page, here's my code to set a lora for example (you can find how to do it by just opening up the network inspector on the lora page)

async with websockets.connect(
	os.environ["OOGA_GRADIO_ENDPOINT"] + "/queue/join"
) as ws:
	log.info("applying lora %r", lora_name)
	await ws.send(json.dumps({"fn_index": 292, "session_hash": "asdf"}))
	await ws.send(
		json.dumps(
			{
				"data": [ [lora_name] ],
				"event_data": None,
				"fn_index": 292,
				"session_hash": "asdf",
			}
		)
	)
	while True:
		msg = await ws.recv()
		log.info("received %r", msg)
		msg = json.loads(msg)
		typ = msg["msg"]
		if typ == "process_starts":
			log.info("applying...")
		elif typ == "process_generating":
			log.info("applying......")
		elif typ == "process_completed":
			log.info("applied!")
			break
		else:
			log.info("unexpected event %r, ignoring", typ)

it doesn't take too long to load it, most time is spent fetching the messages from discord to assemble model input, which also has its tricks to make model output better, such as splicing chat history from another "good" chat (that isn't overrun by model output) to create input, as repetition is a large problem with those models.

i attempted to use the "multi-lora" inferencing solutions such as punica or lorax, as those could let me run inference on all the models and say, reply with the one with highest token probability of its own name compared to the rest (at the moment it's random or fixed), however

punica needs model conversion to its own format, which led me off of running it at first
lorax doesn't need conversion to its own format, but
- it converts to safetensors automatically for some reason
- i attempted to run it locally, without the docker image, because i want to peek into what makes something like this tick with all its dependencies.
- that was painful. i was not able to get lorax to run properly because of juggling cuda/torch/etc into the right versions
- even though the documentation for local install says it doesn't need flash-attention or vllm, it will yell about it at runtime, and then you get surprised when you can't load a lora to a model even though it's already serving requests. that's kind of confusing
- rant over

results #

a lot of people enjoyed the fucked up shit the models said, and the models faithfully copied the writing style from the people i trained on (some are rapid fire, all lower case zoomers, others Capitalize the first letter of their sentences and use "I" instead of "i", etc)

the models weren't "instruction-tuned" but you could use the translation capabilities of a model to invoke a "translator" out of latent space, as this is what happened with a finetune on a friend of ours. she is an english and japanese speaker (-PL is the suffix for model output):

she attempted to get the other models to invoke japanese translation but without success. pretty interesting result

basically, the project is a success, and most advances in the quality of the models can be boiled down to possible future work on dataset curation, i tried to use lilac for some of it but as soon as i installed it i found out i need to install 30 other things but only found out after installing it first, got too tired and just prayed bumping num_epochs for small datasets would work.