As part of my lunch time research for the day, on LLMs, I actually saw a Reddit comment mention a small size LLM that does OCR in a speedy manner, for me to try out: AIDC-AI/Ovis2-1B.
I only had a single small delay today. I had to reinstall all the usual pip requirements on my machine, as I lost them recently. The largest amount of work happened to be the prompt choice for the output data, and it still needs tweaking for individual items costs.
It can be noted, my original OCR was slow, over 14 minutes just for the image processing. With no downscale on the random image, of a till slip I found, it only takes around 7 to 8 seconds. This already feels better for me to automate my own expense tracking for more fun!
OCR Experiment
Without sharing all the details, as they’re available on my repo, the simple slip image I used was:

And without many changes from the Huggingface Ovis2-1B repo, it was quick to test out with the Jupyter Notebook sections below. Note, I use my habit of keeping my own offline copy of the model, as I don’t want to have to wait for it to download each time I’d like to reuse it.
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
# load model
model = AutoModelForCausalLM.from_pretrained("Ovis2-1B",
torch_dtype=torch.bfloat16,
multimodal_max_length=32768,
trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()
# single-image input
image_path = 'example.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = '''
Give the json data format of the slip text inside the image;
give the name of the store the logo shows, the name of the branch separate,
a list of items purchased that has the cost to the right of the row,
and the final total the slip shows.'''
query = f'<image>\n{text}'
# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]
# generate output
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=1024,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
print(f'{output}')
{"storeName": "Pick n Pay", "branchName": "Mini Market Oakdene", "itemsPurchased": ["SOUP PACK", "ALMONDS SLICED 100GR", "5% Pensioners Discount5.0%"], "total": 54.13}Sample Output
It took 1 minute 49 seconds odd to load the model; my WSL only using 2.491 Gb RAM. Then 6 seconds for the OCR to run on the image loaded in a couple milliseconds. I’m happy with it, so will soon be making an API, which I can run to convert my till slips into data for my personal budget tracking software. It will make things easier!
I’ll decide how I want to do that in the future.