About AI / ML Projects from Eclectic Beams

Motivatron 5000, Positron, Negatron, etc. are AI projects made by me, Jakob Anderson, and rose up as a branch of my fiction generation project: Eclectic Beams.


Idea

I saw motivational text art on pinterest, you know the ones. I wondered if I could automate their creation, using my fiction generation workflow from Eclectic Beams as a start. I'd have to first create text phrases along their style.


Process

Dataset

I hand-typed and scraped around 1000 motivational phrases and proverbs from these pinterest motivational quote images, and other sources across the interwebs, and put them into a text file, separated by line returns. I then put a prefix and suffix delimeter around each line, so I could tell my generator where to start and stop with these shorter phrases, instead of the open-ended text that the fiction generation required.

Training the model

I fed that dataset into a google colab GPU notebook to fine-tune a GPT-2 124M model with the phrases. I trained it for a week, off and on. It gave me some good sample output after around 10,000 iterations, so I started training a larger, more complex GPT-2 774M model. Over about a month, I got it up to 200,000 iterations, and it was making pretty good sample outputs at default temperature settings. It had reached a loss average of 0.01 around 90,000 iterations, but I badly needed to rewrite my plagiarism filter for this and the rest of my text generation, so I kept it training every day as I did that.

Generating text

I have a python script that makes random settings, in a safe range for the generator's temperature, top_k, and top_p params. I made a simple bash script that ran this script in a loop, hundreds of times per day to generate samples. I examined these samples, and found some tighter ranges for these params to generate randomly from, and continued. Meanwhile, my rewritten plagiarism filter was completed.

Profanity filtering

Because the models, and GPT-2 are black boxes, I've had some issues with it generating some explicit texts, which I'd rather not deal with, so I used a simple ML profanity filter model at the end of the generator, to weed out unsightly phrasing. Perhaps later, I'll just flag these for a special mode, so the grownups can laugh at them, away from any child's eyes.

Plagiarism filtering

The odds of this model generating phrases that were already contained in the dataset was pretty high for these short text lengths. about 90% of phrases were exact copies, while using a 0.7 temperature, and not much better with a higher, more insane temperature. When the model generated new and intriguing phrases, they always felt like a copy, so I had to be sure I could automate how to check for this, since pasting each into "Find" in my editor against the dataset files was very dull and time-consuming. I built a rust framework around frizensami's plagiarism-basic library, so I could plagiarism-detect quickly in-memory with rust code, instead of by passing a directory to his cli. It uses sliding-windows of configurable text lengths. I filter my shorter texts with a window length of 3-5 words, and my longer texts with a window length of 5-10 words.