Il y a une version française de ce tutoriel : Parler aux robots.
This document is a tutorial/checklist for installing, training, and using GPT-2.
We are going to be generating text with a bot. The field of computer science that figures out how to teach computers to interpret and generate text is Natural Language Processing. Natural Language Processing dates from the earliest reserach in computer science and many of the earliest theoretical texts from the very beginnings — cf. Computing Machinery and Intelligence — are related to teaching computers to understand human languge and use it to communicate with us.
Given this long history, there are many different historical methods used to generate text with a computer. An overview of the most important techniques can be helpful when making art and design projects that generate text. Sometimes, for example, a neural network is overkill and a simpler, or more straightforward approach would be more helpful.
An excellent overview of the history of “generative” artworks and the various techniques used to build them is Philippe Pasquier’s online course Generative Art and Computational Creativity.
Another important tool used for text generation is Daniel Howe’s RiTa library which includes tutorials and examples for integrating this library into a P5 or Processing project. It makes significant use of markov chains to generating text: cf. Tutorial: Generating with RiTa n-grams. RiTa is often a good starting point when learning to make generative works of art.
The type of text transformer we will be using for this tutorial is called « GPT », which stands for « Generative Pre-trained Transformer ». « Generative » means, as the name might suggest, that the computer will be generating text output for us. « Pre-trained » describes how previously captured text will be use to create a previously-trained model that we will then modify to generate a new text in the style we want. And finally, « transformer » describes the neural network algorithm itself that is used to make all this possible.
Dale Markowitz has a really good video from August 2021 describing briefly what a transformer “is”. It is probably your best starting point: Transformers, explained: Understand the model behind GPT, BERT, and T5
The original paper that invented this approach is probably hard to read for most people reading this tutorial, but here it is if you are curious: Attention Is All You Need (December 2017). This paper was a collaboration between Google (and more specifically, Google Brain) and the University of Toronto, i.e. the place where Geoffrey Hinton teaches and does research. If you want a data science analysis of why this work was important, and why it replaced the techniques that came before it, here is a somewhat more digestible presentation on its importance: LSTM is dead. Long Live Transformers!.
Let's talk to some oneline bots written with GPT-2.
GPT-2 is trained with a general dataset of knowledge and human conversations. This makes for boring converstaions. Let's teach it to speak in a specific style, manner, or syntax.
- Download Visual Studio Code for your machine (Mac, PC, Linux)
- Open VS Code
- Open the "Extensions" Tab and verify that the Extension Jupyter is installed
- Create a folder for your project and open the folder in VS Code.
- Create a filename titled
train.ipynb
in that folder - This should open the Jupyter input interface. For example you should see new buttons titled
+Code
,+Markdown
,>Run All
- This interface is now ready to receive Python commands (see below). Each command is entered step by step into your Jupyter interface. Results at the end of each command will be displayed directly in this window, following which you can enter the next command.
Probably the most important part of retraining is preparing the dataset — well before the actual neural network training itself. This is a process that more ressembles the art of curation, and less that of writing. I.e. you will mostly avoid writing new material during this phase, especially given the massive amount of text that you will need to feed to GPT in order to make it compelling and — textually speaking — diverse. A very short text file won't work for training GPT in any significant way. Therefore it is simply impractical in most cases to write the text yourself. Instead you will need to find a collection of texts that adhere in their diversity to the style of whatever it is you are trying to create stylistically, or in terms of personality.
For example, in Simon Senn's and Tammara Leites' performance dSimon, the duo used a huge collection of several years of Simon Senn's personal emails to train his GPT double. This created a disturbingly familiar, and yet consistently surreal, textual double that mimicked eerily Simon's own writings, musings, and interpersonnal communications — while simultaneously producing a lot of interresting absurd proclamations. For their result to work on a poetic level, this required thousands and thousands of lines of text, and not just a few emails selected here or there from Simon's epistolary life.
The type text you need to prepare is raw text — .txt
format text — which simply means that the text has no extraneous fonts, formatting, color, size or even indications of bold or italic. It is just a collection of raw letters from A to Z, numbers, punctuation, and various accented characters. Special note: for now, we will be working entirely in english. Even though GPT is not technically limited to english, the original training was done in english and so will not simply tranfer easily over to another language during the retraining process. Perhaps you will have better luck than we did, but be forewarned.
There are many places where you can find raw blocks of text to source your training dataset. A good starting point is Project Guttenberg, if for example you wish to teach your bot to speak like Virginia Woolf.
Texts found at Guttenberg and other sources online will contain a lot of extraneous legal disclaimers, punctuation, and other undesired textual noise that will enter into your training if you do not remove it. So once you have selected your text sources and copied it into a single .txt
file, you will need to clean up your text.
We suggest using VS Code to edit your dataset. It has a simple 'find', 'replace', and 'replace all' function, much like many other text and code editors. For simple pattern matching cf. VS Code, Basic Editing, Find and Replace. Use this find
+ replace all
method to clean out your giant block of text of all of the extraneous spaces and punctuation that you do not want to enter into GPT.
If you want to fall down a deep rabbit hole of text-wrangling awesomeness, this simple find
/replace
tool has a little button with a star that will activate a crazy-looking but sophisticated and powerful text “parsing” language called Regular Expressions. Regular Expressions are amazing, but also look like the unfortunate accident of some randpm computer error: it is famously very difficult to read. Thankfully, VS Code gives you some visual pointers on what your regular expressions will do to your text before you commit, and you will also found numerous tutorials online to teach you this nutso language:
- RegexOne: Learn Regular Expressions with simple, interactive exercises
- Regex 101
- RegexLearn
Now that your text is ready for training, let us load it into our GPT pretrained network (the P of GPT stands for “pretrained”) and start retraining with it.
We will use the Jupyter
environment of VS Code
which should automatically activate whenever you open a text file ending in the .ipynb
format.
Either create a new text file named train.ipynb
or simply use the file provided with this document inside the Jupyter
folder.
If you are writing your own text from scratch, you will need to clock on the +Code
button in order to begin to create a new code block. This first bloc of code is :
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
The above code checks to see if the gpt-2-simple
library is installed in your system. We also need to import
the gpt_2_simple
and datetime
libraries.
Now you can press the button inside of the [ ]
brackets, which will execute this part of code. It will take a little bit of time...
When this has finished executing, we try to accelerate the training process by testing if you have an Nvidia graphics card.
!nvidia-smi
Press Play to see the result. If you are on a Mac, there is little to no chances of this nvifia
function activating anything.
After that we will downlad the smallest model of GPT2:
gpt2.download_gpt2(model_name="124M")
Press Play again and also wait for the model to download to your computer.
Now we will actually train GPT. First, we need to create a training session that we will place in a variable named sess
. Then we need to tell GPT what the name of our text file will be. This file should be in the same folder as our train.ipynb
file. Finally, we use this filename to start the training process, which needs some extra information for example the amount of training steps it will need to take, and how often to test out the state of its results.
sess = gpt2.start_tf_sess()
file_name = "shakespeare.txt"
gpt2.finetune(sess,
dataset=file_name,
model_name='124M',
steps=1000,
restore_from='fresh',
run_name='run1',
print_every=100,
sample_every=200,
save_every=500
)
Depending on your computer this take several hours.
Now that we have trained our bot with its own style, let's starting talking to it. To test it out yourself, use the generate()
command:
gpt2.generate(sess)
We probably want more control over our output, and also to define a “prompt” which will be used to begin opening words and characters that GPT will attempt to continue:
gpt2.generate(sess,
length=250,
temperature=0.7,
prefix="JULIET:",
nsamples=1,
batch_size=1
)
(direct communication with the bot on the command line)
(create a quick server in Python, speak to it from a webpage)
Different ways of speaking to your bot.
- Webpage
- Processing
- P5
- Unity
- Twine
- Godot
- Max/MSP
This tutorial repackages Max Woolf's Train a GPT-2 Text-Generating Model w/ GPU Google Colab notebook and their gpt-2-simple training code.
- GPT-2 Neural Network Poetry
- How To Make Custom AI-Generated Text With GPT-2
- How to Build a Twitter Text-Generating AI Bot With GPT-2
This tutorial was created by the Pool Numérique and the Master Media Design (HEAD–Genève), for the Virtual Beings semester project (2021-22).