Spread the love

While I’ve been experimenting with the LLMs through Hugging Face, I didn’t have a reason to start my own models. Everything changes regularly, so I haven’t yet found what fits my ideas well enough. That being said, I still haven’t decided what I will be using, yet. This is just to share some of the starting steps I’ve used to play around with models.

My experimentation started in the early days. I went through tons of models, and methods to use the models, and must share I finally found the stable methods I love working with. I’m using LM Studio 0.2.19 at the time of this post, and it has been super capable of running models on my slightly limited laptop. It has been able to use my graphics card to the max, where running via my WSL2 originally, I had issues with the onboard graphics.

The broader tutorial with more explanation is available on YouTube: Ingenium Academy – Generating AI With Hugging Face. I felt it was a lovely, ordered, plan for this blog post when I saw it a day or two ago.

Note: I trimmed some redundant text. You don’t need array lists of hundreds of tokens, for instance, right?

Machine Specs

Model first single training: 1h 08min odd.

Model second triple training: 3h 53min odd.

  • Acer Nitro 5
  • Windows 11
  • 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz (12 CPUs), ~2.7Ghz
  • 32Gb DDR4 2933hz RAM
  • WSL2 Ubuntu 22.04.3
  • Nvidia GeForce RTX 2050 4Gb Laptop GPU

I would recommend you understand you can do this faster through google collab, for instance, I just prefer to do things on my own, on my personal hardware. As limiting as that may be.

The Plan Today

Today, without being in tutorial mode, mostly, I will share the steps I took for random experimentation, and the little code I roughly wrote up. The idea for this post is to try let others know how easy some of it is. Sure, I used several strange methods, at times. This is just the tip of the iceberg for the ease at doing things with your own LLM.

Reminder: Everything is changing at a rapid pace, so you should familiarise yourself with at least some of it so you aren’t in the dark about it in the future.

Getting Started: Exploring an existing dataset

For the start I used WSL under my Ubuntu 22.04.3 instance.


You can always find a tutorial for setting up your WSL environment.

sudo apt install python3 python-is-python3 python3-pip ipython3 -y
pip3 install transformers datasets evaluate transformers[torch]

Using the above on a fresh WSL instance worked, I hope I didn’t add something else which I’ve forgotten about.

In python we can load a dataset:

from datasets import load_dataset
dataset = load_dataset('microsoft/LCC_csharp')
Dataset Structure dictionaries

As you can see, with all models, they try use a large dataset which has a validation set of rows, and a test set. You can even take a look at it a little:

The dataset[‘train’][0] row

As you can tell, it’s a fair bit larger than you would expect. I share this since you can understand we can work with larger sets of data to create a dataset for ourselves. Then, to split these sets, we use simple code:

split_set = dataset['train'].train_test_split(train_size=0.8, seed=1234)
A split ‘train’ dataset

As you can see, working with your own datasets should be quick and easy once loaded. This is the first idea you need to take a look at for training since if you can’t work with datasets, you might be a little stuck. Just imagine this as the way you cut sections of rows in Excel and put them on a different sheet, used for data for your LLMs instead.

Build My Own Reuters Dataset

I do all my work in WSL on a separate drive, hence some steps I just do quicker through my OS GUI. On windows, I pulled the Reuters 21578 dataset, opened in WinRar and extracted the tar.gz file. Then in my WSL I un-tarred it:

tar -xzvf reuters21578.tar.gz
The un-tarred files

The important ones, which the tutorial uses, are the SGM files. We then create a Reuters article set for ourselves. We then organise the dataset structure we learned above for us to use.

from bs4 import BeautifulSoup
import json

reuters_articles = []

for i in range(22):
     if (i < 10):
       i = f"0{i}"

     # load data
     with open(f'./reut2-0{i}.sgm','r',encoding='latin-1') as file:
       soup = BeautifulSoup(file, 'html.parser')
     # load data
     articles = []
     for reuters in soup.find_all('reuters'):
       title = reuters.title.string if reuters.title else ''
       body = reuters.body.string if reuters.body else ''
       #print('{title:"',title,'", body:"', body, '"}') # broken json, mind you


# split data
train_articles = reuters_articles[:int(len(reuters_articles)*TRAIN_PCT)]
valid_articles = reuters_articles[int(len(reuters_articles)*TRAIN_PCT): int(len(reuters_articles)*(TRAIN_PCT + VALID_PCT))]
test_articles = reuters_articles[int(len(reuters_articles)*(TRAIN_PCT + VALID_PCT)):]

# fn to save articles as JSONL
def save_json(data, filename):
  with open(filename, 'w') as file:
    for article in data:
      file.write(json.dumps(article) + '\n')

# save the data
Experimentally looked at the data structures.

I then saw the need to move over to jupyter notebook. It would make things easier for jumping back and forth as I played with several models and datasets.

vim ~/.bashrc

Then at the end of the file, add the line:

alias jupyter-notebook="~/.local/bin/jupyter-notebook --no-browser"

Then saving out, in the WSL shell:

source ~/.bashrc
jupyter notebook

In your browser go to

Create your own first Python 3 (ipykernel)

Using the load.py data created above, I used the notebook values below, to create the dataset edg3/reuters_articles through jupyter.

As a note for those who do not know, on HuggingFace go to your profile, click Edit Profile, the open Access Tokens. That’s what is used for the login above.

Voila, that is my first dataset

Build Your First Tokenizer

No idea why I wanted to use an ‘s’ instead of a ‘z’. I just wasn’t paying enough attention.

So the steps are simple enough for creating a tokenizer with the dataset above. As a note, the tutorial didn’t use 128k as the size, I just happened to feel like it. This is the look of the tokenizers:

Tokenizer output

The easy way to understand it is a tokenizer is used to make tokens of the text for the LLMs to go through. The grouping of letters makes a quicker method for mapping the answers. That being said, using the jupyter below:

With that, we have edg3/bert

The tokenizer model created using my edg3/reuter_articles

As a note, we can now load it with AutoTokenizer:

Loaded tokenizer off HuggingFace direct

Also we can then use it for more tokenization:

Tokenizer.tokenize output

While I will experiment with it more in the future, the next fun experimental step is to do a models additional training.

Fine-Tuning An LLM

As a base starter, I hid warnings for future removals. What happens below is we load a tokenizer and Seq2Seq model facebook/bart-large-cnn. Then get the samsum dataset, random conversations. You can see when used without training it answers with a random snippet from the Sample.

At the comment # Training starts here, actually we set up the future deprecated training_args and trainer. At num_train_epochs=1 like the tutorial it took around 1 hour, and it wasn’t successful, the answers were incorrect. I jumped to num_train_epochs=3, and while it took almost 4 hours on my laptop (gaming was involved),

This is where I can share, with ease, I messed the data up a little. However it’s working, somewhat, and I figured I will leave it there today. One note on the training, my GPU showed this:

GPU Memory Used

Since I did play some games that’s likely the reason it took 4 hours. As another note, on Hugging Face the first single training model also gave way more incorrect results.

Test 1 on Hugging Face

Then, with that, my first trained model is up on Hugging Face: edg3/bart-cnn-samsum-finetuned. I did get 4 downloads, which I find amusing, they were before I finalised the secondary training.

The model after training


There will always be tons more to learn, and do, with LLMs and AI, these days. I recommend everyone interacts and experiments with it as a whole. Things will get broader, better, and stronger, in the foreseeable future.