Spread the love

As part of getting back into the right head space, I needed to decide on decent side projects to keep myself busy with. After looking, and debating, I’m moving through the edX TinyML course at a steady pace, then I picked up the humble bundle courses for GPT in Engineering, API, and programming.

Sure, I felt like kickstarting my game development hobby as well – which the bundle course also assists with, but the starting focus is to expand my knowledge.

Introducing edg3/GPT-systems to track my experiments, and their results, in the AI/ML space.

I must share, a bunch of my experiments are still missing from this project, but I will most definitely update it over time as I play around with new ideas in this space of computing. This post is just to give a broad overview of what I have up already and notes for ideas I am working through.

00 General Experiments

First, in 00-llama.ipynb, you can see how to setup llama.cpp, then loading TheBloke/Llama-2-13B-chat-GGML, and getting an answer to a question. I chose this 9.7Gb model since I feel it doesn’t help for me to go up to the 16Gb+ models when showing others little snippets. With the prompt “Write a linear regression in python” it gave the following response:

To write a linear regression in Python, you can use the scikit-learn library. Here is an example of how to do this:
```
from sklearn.linear_model import LinearRegression
import pandas as pd

# Load your dataset into a Pandas DataFrame
df = pd.read_csv('your_data.csv')

# Create a linear regression object and fit the data
reg = LinearRegression().fit(df[['x1', 'x2']], df['y'])

# Print the coefficients
print(reg.coef_)

# Predict on new data
new_data = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]})
preds = reg.predict(new_data)
```
This code will load your dataset into a Pandas DataFrame, create a linear regression object and fit the data using the `fit()` method. It will then print the coefficients of the linear regression and use those coefficients to make predictions on new data.

Please note that this is just an example, you may need to modify it to suit your specific needs
TheBloke/Llama-2-13B-chat-GGML

Suffice it to say, I ignore the fact some code answers need additional information, for instance a pip install, when sharing it here on my blog due to the fact what I share is meant to stay broad public knowledge. So, I offer apologies if what I ask, and the results I show off, aren’t in your area of expertise.

In 01-extract-paragraphs-in-format.ipynb I just took a public free PDF and converted it to a json structure I could use to train an ML model. My training didn’t use the multimeter documentation used in this experiment, but I need to decide on language needs in the future as this shows off.

Then in 02-scikit-prep-training-data.py, I load the RipTutorial C# ebook, then organise the data into train_lines, test_lines, and val_lines. Once that is ready, load GPT2, then go through a training loop evaluate using the validation set in val_lines, then evaluate using the test set at the end. The longest time was spent in the training loop, it was an approximate 4 to 5 hours on my laptop with my GeForce 3050 Laptop GPU. Finally, it makes a sample.safetensor, to test with.

You will see, I do believe training needs to go way further than the 5 iterations I used. The final console output at the end of training was:

Loading PDF

Printing  898  lines to sampledata.jsonl

Number of lines in training set: 574

Number of lines in test set: 180

Number of lines in validation set: 144

Loading GPT2

– loading gpt 2 head model

– loading gpt 2 tokenizer model

– done loading gpt 2

Specifying params

Training model for  5  epochs, using AdamW

    [at Epoch  0 / 5 ]

Epoch 1/5 – Average Loss: 2.2074966033299765

    [at Epoch  1 / 5 ]

Epoch 2/5 – Average Loss: 1.805796856681506

    [at Epoch  2 / 5 ]

Epoch 3/5 – Average Loss: 1.723598135014375

    [at Epoch  3 / 5 ]

Epoch 4/5 – Average Loss: 1.6694742176267836

    [at Epoch  4 / 5 ]

Epoch 5/5 – Average Loss: 1.6284758713510301

Validation Loss: 1.5808773504363165

Test Loss: 1.6046571938887886

02-scikit-prep-training-data.py

With the losses at these numbers, they are too high, I didn’t expect the model to give super accurate answers. As you can see, the starting 02a-test.py shows clearly it needs way more time training:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input’s `attention_mask` to obtain reliable results.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input’s `attention_mask` to obtain reliable results.

Generated response: Make a data type in latest C# that has get and set for different data types for lost and found objects.

public class DataType { public string lost; }

public class FindDataType() : string { get; set; }

public class FindDataType() : string { get; set; }

02a-test.py

It isn’t an issue, really, it’s more to experiment with training ones own model, from scratch. You’ll need to remember that it will need way more time training to become more accurate, and that training works up to a point.

The other thing to consider is that there are way more models that are better than GPT2, I know that. All this intends to share is how one can easily start experimenting on one’s own, as well.

01 OCR Experiment

What might be advantageous for others is this sample. The plan is simple: convert my own till slips to json data that I might make my software use directly. Using MiniCPM-Llama3-V-2_5-int4 it uses a simple descriptive method:

ProcessImage(
  "000sample.jpg",
  "convert this image to a json structure answer, summarise all content " +
  "that is either a store_name, date_time, total, and items with name, " +
  "quantity, and price, ignoring anything with names 'cash' or 'credit' " +
  "in the image it's a till slip with the store name which doesn't need " +
  "ownership details, the payment amount, and a json array with each " +
  "line item purchased with focus on name of item, the quantity received" +
  ", and the price of the listed item, while ignoring items that are " +
  "just additional business info as well, or have price '0.00'"
)

To give a simple answer:

{
  "store_name": "KFC South Africa Copper Moon Trading As KFC Rosebank",
  "date_time": "Nov'01" 18:12:12,
  "total": 49.90,
  "items": [
    {
      "name": "1 Chicken LunchBox",
      "quantity": 1,
      "price": 49.90
    },
    {
      "name": "1 Small Lbox M1 S/SF",
      "quantity": 1,
      "price": 0.00
    }
  ]
}

My only reason for not using it for fun, yet, is the time taken. I’m not happy it took 14 minutes and 44 seconds to do this. I experimented with image scales, and making the text darker, but it definitely needs work.

Given that one can use dynamic objects in C#, it should fit into my fun personal project in MAUI well, to send an image to an API and get the response in this format. I’m just not comfy with the amount of time it would take to wait for the response. I’m thinking I could still use it, just not as quick as I initially wanted.

02 Course

This is not content from edX, or the GPT bundle mentioned above, but rather free and public from huggingface. You can start at step 1 here. The idea is I will slowly, when I have time, add snippets from sections here, to first show my progress, then also show my experimentation that come from it as well.

It’s always better to broaden one’s knowledge in the massive field that’s taking the world by storm at the moment. It’s super descriptive and explains it all well. You should take a look at it, if you’re interested.

03 WebApi

Here, I’m slowly working towards an AI API for myself. This way, I can send an API request to a Raspberry Pi 5 running my own LLM when at home. This is planned in no particular way, just for when I have questions I’d like to look up, instead of searching for them, I can use an LLM.

This would have an app that tracks what I ask on my phone and can even suggest answers I already got for similar questions, when remote. More along the lines of a smart way to get all my ducks in my head in a row when I think of interesting things.

At present, the issue is I’m struggling to load the gguf. I have other python scripts that are working, just the revamped version I’m thinking of has this small blocking point.