While I’ve been experimenting with the LLMs through Hugging Face, I didn’t have a reason to start my own models. Everything changes regularly, so I haven’t yet found what fits my ideas well enough. That being said, I still haven’t decided what I will be using, yet. This is just to share some of the starting steps I’ve used to play around with models.
My experimentation started in the early days. I went through tons of models, and methods to use the models, and must share I finally found the stable methods I love working with. I’m using LM Studio 0.2.19 at the time of this post, and it has been super capable of running models on my slightly limited laptop. It has been able to use my graphics card to the max, where running via my WSL2 originally, I had issues with the onboard graphics.
The broader tutorial with more explanation is available on YouTube: Ingenium Academy – Generating AI With Hugging Face. I felt it was a lovely, ordered, plan for this blog post when I saw it a day or two ago.
Note: I trimmed some redundant text. You don’t need array lists of hundreds of tokens, for instance, right?
Machine Specs
Model first single training: 1h 08min odd.
Model second triple training: 3h 53min odd.
- Acer Nitro 5
- Windows 11
- 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz (12 CPUs), ~2.7Ghz
- 32Gb DDR4 2933hz RAM
- WSL2 Ubuntu 22.04.3
- Nvidia GeForce RTX 2050 4Gb Laptop GPU
I would recommend you understand you can do this faster through google collab, for instance, I just prefer to do things on my own, on my personal hardware. As limiting as that may be.
The Plan Today
Today, without being in tutorial mode, mostly, I will share the steps I took for random experimentation, and the little code I roughly wrote up. The idea for this post is to try let others know how easy some of it is. Sure, I used several strange methods, at times. This is just the tip of the iceberg for the ease at doing things with your own LLM.
Reminder: Everything is changing at a rapid pace, so you should familiarise yourself with at least some of it so you aren’t in the dark about it in the future.
Getting Started: Exploring an existing dataset
For the start I used WSL under my Ubuntu 22.04.3 instance.
wsl
You can always find a tutorial for setting up your WSL environment.
sudo apt install python3 python-is-python3 python3-pip ipython3 -y
pip3 install transformers datasets evaluate transformers[torch]
python
Using the above on a fresh WSL instance worked, I hope I didn’t add something else which I’ve forgotten about.
In python we can load a dataset:
from datasets import load_dataset
dataset = load_dataset('microsoft/LCC_csharp')
As you can see, with all models, they try use a large dataset which has a validation set of rows, and a test set. You can even take a look at it a little:
dataset['train'][0]
As you can tell, it’s a fair bit larger than you would expect. I share this since you can understand we can work with larger sets of data to create a dataset for ourselves. Then, to split these sets, we use simple code:
split_set = dataset['train'].train_test_split(train_size=0.8, seed=1234)
As you can see, working with your own datasets should be quick and easy once loaded. This is the first idea you need to take a look at for training since if you can’t work with datasets, you might be a little stuck. Just imagine this as the way you cut sections of rows in Excel and put them on a different sheet, used for data for your LLMs instead.
Build My Own Reuters Dataset
I do all my work in WSL on a separate drive, hence some steps I just do quicker through my OS GUI. On windows, I pulled the Reuters 21578 dataset, opened in WinRar and extracted the tar.gz file. Then in my WSL I un-tarred it:
tar -xzvf reuters21578.tar.gz
The important ones, which the tutorial uses, are the SGM files. We then create a Reuters article set for ourselves. We then organise the dataset structure we learned above for us to use.
from bs4 import BeautifulSoup
import json
reuters_articles = []
print('loading...')
for i in range(22):
if (i < 10):
i = f"0{i}"
# load data
with open(f'./reut2-0{i}.sgm','r',encoding='latin-1') as file:
soup = BeautifulSoup(file, 'html.parser')
# load data
articles = []
for reuters in soup.find_all('reuters'):
title = reuters.title.string if reuters.title else ''
body = reuters.body.string if reuters.body else ''
#print('{title:"',title,'", body:"', body, '"}') # broken json, mind you
articles.append({'title':title,'body':body})
reuters_articles.extend(articles)
TRAIN_PCT, VALID_PCT = 0.8, 0.1
print('splitting...')
# split data
train_articles = reuters_articles[:int(len(reuters_articles)*TRAIN_PCT)]
valid_articles = reuters_articles[int(len(reuters_articles)*TRAIN_PCT): int(len(reuters_articles)*(TRAIN_PCT + VALID_PCT))]
test_articles = reuters_articles[int(len(reuters_articles)*(TRAIN_PCT + VALID_PCT)):]
# fn to save articles as JSONL
def save_json(data, filename):
with open(filename, 'w') as file:
for article in data:
file.write(json.dumps(article) + '\n')
print('writing...')
# save the data
save_json(train_articles,'train.jsonl')
save_json(valid_articles,'valid.jsonl')
save_json(test_articles,'test.jsonl')
I then saw the need to move over to jupyter notebook. It would make things easier for jumping back and forth as I played with several models and datasets.
vim ~/.bashrc
Then at the end of the file, add the line:
alias jupyter-notebook="~/.local/bin/jupyter-notebook --no-browser"
Then saving out, in the WSL shell:
source ~/.bashrc
jupyter notebook
In your browser go to http://127.0.0.1:8888/tree
Using the load.py data created above, I used the notebook values below, to create the dataset edg3/reuters_articles through jupyter.
from datasets import load_dataset
from huggingface_hub import notebook_login
from bs4 import BeautifulSoup
import json
# Load
data_files = {'train': 'train.jsonl', 'validation': 'valid.jsonl', 'test': 'test.jsonl'}
dataset = load_dataset('json', data_files=data_files)
dataset
DatasetDict({ train: Dataset({ features: ['title', 'body'], num_rows: 17262 }) validation: Dataset({ features: ['title', 'body'], num_rows: 2158 }) test: Dataset({ features: ['title', 'body'], num_rows: 2158 }) })
notebook_login()
VBox(children=(HTML(value='<center> <img\nsrc=http...
!git config --global credential.helper store
dataset.push_to_hub('reuters_articles')
Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 0%| | 0/18 [00:00<?, ?ba/s]
Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 0%| | 0/3 [00:00<?, ?ba/s]
Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 0%| | 0/3 [00:00<?, ?ba/s]
CommitInfo(commit_url='http...
As a note for those who do not know, on HuggingFace go to your profile, click Edit Profile, the open Access Tokens. That’s what is used for the login above.
Build Your First Tokenizer
No idea why I wanted to use an ‘s’ instead of a ‘z’. I just wasn’t paying enough attention.
So the steps are simple enough for creating a tokenizer with the dataset above. As a note, the tutorial didn’t use 128k as the size, I just happened to feel like it. This is the look of the tokenizers:
The easy way to understand it is a tokenizer is used to make tokens of the text for the LLMs to go through. The grouping of letters makes a quicker method for mapping the answers. That being said, using the jupyter below:
!pip install --force-reinstall numpy==1.24.0
Defaulting to user installation because normal site-packages is not writeable Collecting numpy==1.24.0 Downloading numpy-1.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 2.4 MB/s eta 0:00:0000:0100:01 Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 1.26.4 Uninstalling numpy-1.26.4: Successfully uninstalled numpy-1.26.4 Successfully installed numpy-1.24.0
!pip3 install transformers datasets torch
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: transformers in /home/edg3/.local/lib/python3.10/site-packages (4.39.3) Requirement already ...
from datasets import load_dataset
dataset = load_dataset('edg3/reuters_articles')
dataset
DatasetDict({ train: Dataset({ features: ['title', 'body'], num_rows: 17262 }) validation: Dataset({ features: ['title', 'body'], num_rows: 2158 }) test: Dataset({ features: ['title', 'body'], num_rows: 2158 }) })
def create_full_article_col(eg):
return {'full_article': f"TITLE:{eg['title']}\n\nBODY:{eg['body']}"}
dataset = dataset.map(create_full_article_col)
dataset
Map: 0%| | 0/17262 [00:00<?, ? examples/s]
Map: 0%| | 0/2158 [00:00<?, ? examples/s]
Map: 0%| | 0/2158 [00:00<?, ? examples/s]
DatasetDict({ train: Dataset({ features: ['title', 'body', 'full_article'], num_rows: 17262 }) validation: Dataset({ features: ['title', 'body', 'full_article'], num_rows: 2158 }) test: Dataset({ features: ['title', 'body', 'full_article'], num_rows: 2158 }) })
# Create tokeniser
training_corpus = (
dataset['train'][i:i+1000]['full_article']
for i in range(0, len(dataset['train']), 1000)
)
from transformers import AutoTokenizer
old_tokeniser = AutoTokenizer.from_pretrained('google-bert/bert-base-cased') # Train on google-bert/bert-base-cased which has a decent english base
tokenizer_config.json: 0%| | 0.00/49.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/570 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/436k [00:00<?, ?B/s]
example1 = dataset['test'][7]['full_article']
example1
'TITLE:GROUP CUTS ZONDERVAN<ZOND.O> STAKE TO 3.8 PCT\n\nBODY:One of several investor groups\nformerly associated with London investor Christopher Moran in\nhis unsuccessful bid to take over Zondervan Corp last year,\nsaid it cut its stake in the company to less than five pct.\n In a filing with the Securities and Exchange Commission,\nthe group, led by investors Lawrence Altschul and James\nApostolakis, said it cut its Zondervan stake to 157,500 shares,\nor 3.8 pct of the total, from 246,500 shares, or 5.9 pct.\n The group, which earlier this month said in an SEC filing\nit wanted join with other groups to maximize share values, said\nit sold 89,000 shares between June 9 and 15 for 1.5 mln dlrs.\n The group had joined with the Moran group, which last year\nassembled a combined 44 pct stake in Zondervan during its\nunsuccessful takeover try.\n Last month, the Moran group broke up and splintered into\nvarious factions. Moran himself withdrew from the takeover\neffort and last reported his personal stake at 4.8 pct.\n A group led by Miwok Capital Corp, a California broker with\na 10.6 pct stake, and another one led by Minneapolis\nstockbroker Jeffrey Wendel with 2.6 pct, have both made recent\nSEC filings saying they are seeking agreements with other\nparties who may want to seek control of the company.\n Reuter\n\x03'
#old_tokeniser.tokenize(example1)
# Fine tune google bert on articles from Reuter; it will take a little time
new_tokeniser = old_tokeniser.train_new_from_iterator(training_corpus, 128000) # vocab size of 128k
#new_tokeniser.tokenize(example1)
from huggingface_hub import notebook_login
notebook_login()
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
new_tokeniser.push_to_hub('bert-base-cased-reuters-tokenizer')
CommitInfo(commit_url='https...
from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained('edg3/bert-base-cased-reuters-tokenizer')
tokenizer_config.json: 0%| | 0.00/1.18k [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/599k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.85M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
tokeniser.tokenize(example1)
['TITLE', ':', 'GROUP', 'CUTS', 'ZONDERVAN', ... 'Reuter']
With that, we have edg3/bert
As a note, we can now load it with AutoTokenizer:
Also we can then use it for more tokenization:
While I will experiment with it more in the future, the next fun experimental step is to do a models additional training.
Fine-Tuning An LLM
As a base starter, I hid warnings for future removals. What happens below is we load a tokenizer and Seq2Seq model facebook/bart-large-cnn. Then get the samsum dataset, random conversations. You can see when used without training it answers with a random snippet from the Sample.
At the comment # Training starts here, actually we set up the future deprecated training_args and trainer. At num_train_epochs=1 like the tutorial it took around 1 hour, and it wasn’t successful, the answers were incorrect. I jumped to num_train_epochs=3, and while it took almost 4 hours on my laptop (gaming was involved),
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
#!pip install transformers datasets evaluate transformers[torch]
# Since I'm running through WSL, can't go: Runtime > Change runtime type > T4-Gpu
# => The model I choose to train will be slow
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
'''facebook/bart-large-cnn'''
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn')
from datasets import load_dataset
dataset = load_dataset('samsum')
dataset
DatasetDict({ train: Dataset({ features: ['id', 'dialogue', 'summary'], num_rows: 14732 }) test: Dataset({ features: ['id', 'dialogue', 'summary'], num_rows: 819 }) validation: Dataset({ features: ['id', 'dialogue', 'summary'], num_rows: 818 }) })
sample = dataset['test'][2]['dialogue']
label = dataset['test'][2]['summary']
def gen_summary(input, llm):
input_prompt = f"""
Summarize the following conversation.
{input}
Summary:
"""
input_ids = tokenizer(sample, return_tensors='pt')
tokenizer_output = llm.generate(input_ids['input_ids'], min_length=20, max_length=200)
output = tokenizer.decode(tokenizer_output[0], skip_special_tokens=True)
return output
output = gen_summary(sample, llm=model)
print('Sample')
print(sample)
print('--------------------')
print('Model generated summary:')
print(output)
print('Correct summary:')
print(label)
Sample Lenny: Babe, can you help me with something? Bob: Sure, what's up? Lenny: Which one should I pick? Bob: Send me photos Lenny: <file_photo> Lenny: <file_photo> Lenny: <file_photo> Bob: I like the first ones best Lenny: But I already have purple trousers. Does it make sense to have two pairs? Bob: I have four black pairs :D :D Lenny: yeah, but shouldn't I pick a different color? Bob: what matters is what you'll give you the most outfit options Lenny: So I guess I'll buy the first or the third pair then Bob: Pick the best quality then Lenny: ur right, thx Bob: no prob :) -------------------- Model generated summary: Lenny: Babe, can you help me with something? Bob: Sure, what's up? Lenny: Which one should I pick?Bob: Send me photos. Correct summary: Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.
# It isn't "learning the dialogue"
# Prepare our data set for training, first
def tokenize_inputs(eg):
start_prompt = "Summarize the following conversation.\n\n"
end_prompt = "\n\nSummary: "
prompt = [start_prompt + dialogue + end_prompt for dialogue in eg['dialogue']]
eg['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True, return_tensors='pt').input_ids
eg['labels'] = tokenizer(eg['summary'], padding='max_length', truncation=True, return_tensors='pt').input_ids
return eg
tokenizer.pad_token = tokenizer.eos_token
tokenized_datasets = dataset.map(tokenize_inputs, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id','dialogue','summary'])
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)
Map: 0%| | 0/819 [00:00<?, ? examples/s]
Filter: 0%| | 0/819 [00:00<?, ? examples/s]
# note, try shuffle and select
#print(tokenized_datasets)
#tokenized_datasets['train'][0].keys()
from huggingface_hub import notebook_login
notebook_login()
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
# Training starts here, actually
from transformers import TrainingArguments, Trainer
from accelerate import DataLoaderConfiguration, Accelerator
dataloader_config = DataLoaderConfiguration(
dispatch_batches=None,
split_batches=False,
even_batches=True,
use_seedable_sampler=True
)
training_args = TrainingArguments(
output_dir='./bart-cnn-samsum-finetuned', # local dir
hub_model_id='edg3/bart-cnn-samsum-finetuned', # Identity on Hub
learning_rate=1e-5,
num_train_epochs=3,
weight_decay=0.01,
auto_find_batch_size=True,
evaluation_strategy='epoch',
logging_steps=10
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation']
)
trainer.train()
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 0.098100 | 0.136044 |
2 | 0.100900 | 0.133031 |
3 | 0.095700 | 0.132983 |
TrainOutput(global_step=111, training_loss=0.24325278555756216, metrics={'train_runtime': 14105.8681, 'train_samples_per_second': 0.031, 'train_steps_per_second': 0.008, 'total_flos': 962194443337728.0, 'train_loss': 0.24325278555756216, 'epoch': 3.0})
# First training: 1h 08m odd
# - Acer Nitro 5
# - Windows 11
# - 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz (12 CPUs), ~2.7GHz
# - 32Gb DDR4 2933hz
# - through WSL Ubuntu 22.04.3
# - used Nvidia GeForce RTX 3050 Laptop GPU
# - Intel(R) UHD Graphics - unused, disabled as much as possible, super slow compared through defaults at start
trainer.push_to_hub()
CommitInfo(commit_url='https...
# Test
sample1 = """
summarise this conversation:
Eric: MACHINE!
Rob: That''s so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it''s really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I''ll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I''ll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE! Eric: TTYL?
Rob: Sure :)
Summary:
"""
loaded_model = AutoModelForSeq2SeqLM.from_pretrained('edg3/bart-cnn-samsum-finetuned')
output = gen_summary(sample1, llm=loaded_model)
print('Sample')
print(sample1)
print('--------------------')
print('Model generated summary:')
print(output)
print('Correct summary:')
print(label)
Sample summarise this conversation: Eric: MACHINE! Rob: That''s so gr8! Eric: I know! And shows how Americans see Russian ;) Rob: And it''s really funny! Eric: I know! I especially like the train part! Rob: Hahaha! No one talks to the machine like that! Eric: Is this his only stand-up? Rob: Idk. I''ll check. Eric: Sure. Rob: Turns out no! There are some of his stand-ups on youtube. Eric: Gr8! I''ll watch them now! Rob: Me too! Eric: MACHINE! Rob: MACHINE! Eric: TTYL? Rob: Sure :) Summary: -------------------- Model generated summary: Lenny asks Bob to help him pick out a pair of purple trousers. Bob says he has four black pairs. Lenny asks him to send him photos of them. Correct summary: Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.
# training 3 times worked
This is where I can share, with ease, I messed the data up a little. However it’s working, somewhat, and I figured I will leave it there today. One note on the training, my GPU showed this:
Since I did play some games that’s likely the reason it took 4 hours. As another note, on Hugging Face the first single training model also gave way more incorrect results.
Then, with that, my first trained model is up on Hugging Face: edg3/bart-cnn-samsum-finetuned. I did get 4 downloads, which I find amusing, they were before I finalised the secondary training.
Afterthoughts
There will always be tons more to learn, and do, with LLMs and AI, these days. I recommend everyone interacts and experiments with it as a whole. Things will get broader, better, and stronger, in the foreseeable future.