Spread the love

As my next idea for LLMs use, I decided to dabble with some OCR. Converting images to text in a speedy manner. Over the weekend I spent a little time experimenting with MiniCPM-Llama3-V-2_5-int4. The idea I have is to get details from my purchase slips, to add to my personal finance system I built. It has tons of other uses as well, though.

As a note, all shared below come from the idea that one would adjust the prompt text to get your desired result. I do assume you know, it’s been shared a lot, how to experiment with LLMs so you can easily get more knowledge on how to use them.

The idea for today is to share how it’s easy for you to do your own helpful assistants. This time, a glance at OCR as a whole, to get descriptive data answers. The slips, I tried to make private, and I’m still adjusting sizes, as well as prompts, but this is more an exploratory sharing.

The Experiments

First off, if you’ve already set up and used jupyter notebook before, make sure you have the correct libraries set up in your environment. Then clone the MiniCPM Llama3 repo, I used the smaller int4 version, for a little speed.

!pip install Pillow==10.1.0
!pip install torch==2.1.2
!pip install torchvision==0.16.2
!pip install transformers==4.40.0
!pip install sentencepiece==0.1.99
!pip install accelerate==0.30.1
!pip install bitsandbytes==0.43.1
!git clone https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4

Note, the idea behind downloading the model is so you have the 11.4 Gb repository on your machine. It uses less data to just load the ~6 Gb model into memory from the repository, but that ~6 Gb data every time you want to load it again.

Then we load MiniCPM Llama3:

# Load model
import torch
from PIL import Image, ExifTags
from transformers import AutoModel, AutoTokenizer
import time

model = AutoModel.from_pretrained('MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
model.eval()

For ease, I created a function that takes a file name, and a prompt. The additional information I liked to watch while experimenting was the original width and height, I had some testing images load at -90° so if width greater than height it rotates 90°, then the new proportion is shared, the MiniCPM Llama3 model is given the question for the image, and the result comes out.

def ProcessImage(img, question, fixed_prop = False):
    start_time = time.time()
    image = Image.open(img).convert('RGB')
    print('w:',image.width,',h:',image.height)
    if (image.width > image.height):
        image = image.transpose(Image.ROTATE_90)
    if fixed_prop == False:
        proportion = image.width / 600
        print(proportion)
        smaller_size = (int(image.width / proportion), int(image.height / proportion))
        print(smaller_size)
        image = image.resize(smaller_size)
    msgs = [{'role': 'user', 'content': question}]
    res = model.chat(
        image=image,
        msgs=msgs,
        tokenizer=tokenizer,
        sampling=True,
        temperature=0.7
    )
    print(res)
    end_time = time.time()
    time_diff = end_time - start_time
    minutes, seconds = divmod(time_diff,60)
    print("[{0} min, {1} sec]".format(int(minutes), int(seconds)))

The reason for me experimenting with this idea is simple, I originally wanted my laptop to automate inputs for my personal budgeting system. Yes, look at a slip, get all the relevant information from it, and add it into my budgeting automatically. I’m undecided, somewhat, since the fastest processing after loading for X min Y seconds, was over 3 minutes, at full size photos. Then scaling them down to 300px wide and highlighting the darkness it was quicker but had fewer accurate results. I chose 600px width and re-ran them all for what I share here today.

This is ignoring if the detail was too choppy, I used crumpled slips from my purchases, it hallucinated a little too often. Yes, a slip with 1 item on it, a Capuchino Grande, it shared random milk, eggs, and more, with random prices. It isn’t fool proof enough, yet, but it’s still useful already.

The question where I got to the most accurate answers in json format, is for this first example:

001a.jpg

As you can tell, it’s a simple slip, for a simple purchase. While some images have a little bit edited for this post, the originals were used for the tests I ran.

ProcessImage(
  "001b.jpg",
  "convert this image to a json structure answer, summarise all content " +
  "that is either a store_name, date_time, total, and items with name, " +
  "quantity, and price, ignoring anything with names 'cash' or 'credit' " +
  "in the image it's a till slip with the store name which doesn't need " +
  "ownership details, the payment amount, and a json array with each " +
  "line item purchased with focus on name of item, the quantity received" +
  ", and the price of the listed item, while ignoring items that are " +
  "just additional business info as well, or have price '0.00'"
)

This prompt taking 1 min 55 seconds, gave the result:

{
  "store_name": "Cape Union Mart",
  "date_time": "1...7",
  "total": 399.00,
  "items": [
    {
      "name": "Customer card",
      "quantity": 1,
      "price": 0.00
    },
    {
      "name": "Chair Blue STD",
      "quantity": 1,
      "price": 399.00
    }
  ]
}

As you can tell, even specifying to not get R0.00 price items, it did add “Customer card” into the items for the sale. Nevemind how I segmented a portion of the photo for this example. From 4 out of 5 decent responses, with minor issues, at 300px wide images, moving to 600px wide: 1 out of 5 results were accurate and correct for my till slips, taking around 4 minutes per slip. For instance, I never knew Lay’s Chips cost 120g, or 18.99 is a date, never mind the euro costs.

Suffice it to say the with a larger size image it can hallucinate way more, and the 1 out of 5 that was disastrously inaccurate at 300px might be from the dark tabletop background in the photo I took, the rest were with a light whitened tabletop behind the slips. I’ll just need to test with a lighter background in photos, as well as improve the dark highlighting to emphasise the text, I wanted to use originally.

Next up, to make this more interesting, I decided to get some reviews on images I’ve already shown on my blog. Please note, I left all responses as is, and there are still a few hallucinations I’ve noticed in the reviews. The 300px test was similar to the 600px results, so the 600px results are what I share below. The prompt stayed the same for each, just the image changes:

ProcessImage(
  "e001.jpg",
  "give the most adventurous magiazine review style paragraph describing " +
  "the scene, pay attention to the layout of the image, the hue and " +
  "saturation, as well as how things line up. Score the image on a scale " +
  "of 1 to 10 with 1 being lowest, and 10 being the best score, on the " +
  "quality of the photo itself."
)
e001.jpg
The image presents a tranquil scene of a rustic garden area, likely part of a café or restaurant named "Bank Be Sturwer" and "Cocoblo". The wooden signpost, with its weathered appearance, stands as a welcoming beacon amidst the lush greenery. The arrangement of the signs is such that they are visible to someone approaching from the left side of the frame, inviting them into this serene setting. The overcast sky adds a soft light to the scene, highlighting the natural colors of the foliage and the earthy tones of the wooden structures. This image scores an 8 on the quality scale; while it captures the essence of the location well, the perspective is slightly tilted, and the image could benefit from a more centered composition for optimal clarity.
MiniCPM-Llama3-V-2_5-int4 review of e001.jpg
e002.jpg
The image captures a moment of serene beauty in the wild. A giraffe, with its distinctive spotted coat and long neck, stands majestically against the backdrop of a clear blue sky. The animal's head is lowered, perhaps grazing on the lush greenery that fills the background. The scene is framed by a wire fence, adding an element of rustic charm to the composition. The colors are vivid and rich, with the giraffe's brown spots contrasting sharply against the green foliage. The positioning of the giraffe in the foreground gives depth to the image, while the clear sky provides a sense of vastness. Overall, this photo would score an 8 on a scale of 1 to 10 for its quality, as it effectively captures the essence of wildlife in a visually appealing manner.
MiniCPM-Llama3-V-2_5-int4 review of e002.jpg
e003.jpg
In the verdant heart of a woodland, where sunlight filters through a tapestry of green leaves, a solitary figure emerges. A young moose, its coat as dark as a night sky, peers out from behind the sturdy trunk of an ancient tree. The moose's large ears and soft eyes convey a sense of youthful curiosity and wonder. The surrounding foliage, a mix of vibrant greens and earthy browns, provides a natural backdrop that underscores the tranquility of this wilderness moment. The composition is balanced, with the moose centrally placed and the tree trunk serving as a vertical anchor in the frame. The lighting suggests it's a bright day, casting shadows that add depth to the scene. This image would score a 9 on the quality scale, with the only deduction being for the slightly overexposed areas that lose some detail.
MiniCPM-Llama3-V-2_5-int4 review of e003.jpg
e004.jpg
The image presents a striking tableau of avian life, set against the stark backdrop of a winter's day. Two birds, their feathers a stark contrast of white and black, are perched in a tree that is devoid of leaves, suggesting a season of dormancy. The birds' positioning, one atop the other, creates a sense of depth and perspective. Their beaks are open wide, perhaps in mid-song or call, adding a dynamic element to the otherwise still scene. The sky, visible through the bare branches, is clear and bright, indicating good weather conditions. The photo composition is well-balanced, with the birds serving as the focal point amidst the intricate network of branches. The color palette is limited yet effective, relying on natural tones to convey the image's mood. Considering the clarity, composition, and lighting, this image would score an 8 out of 10 for its quality, capturing a moment of wildlife in a visually appealing manner.
MiniCPM-Llama3-V-2_5-int4 review of e004.jpg
e005.jpg
The image captures a moment of repose in the wild, where a leopard's majestic form is sprawled out on a bed of dry grass. The animal's spotted coat stands out vividly against the muted tones of its surroundings. The perspective of the photo is from above and slightly to one side, providing a comprehensive view of the leopard's anatomy and the intricate pattern of its fur. The color palette is rich yet earthy, with the leopard's spots providing a stark contrast to the otherwise subdued hues. The leopard's head is turned away from the camera, adding an air of mystery to the scene. This image would score a 9 out of 10 for its quality, given the clarity of detail, the depth of field, and the overall composition that draws the viewer into the narrative of wildlife photography.
MiniCPM-Llama3-V-2_5-int4 review of e005.jpg
e006.jpg
In the vast expanse of a bright, azure sky, streaked with wisps of white cloud, the landscape unfolds in a tableau of nature's tranquility. A zebra, the star of this pastoral scene, stands in profile, its gaze directed towards the right, as if contemplating the serenity around it. The grassy field, a canvas of green hues, provides a soft contrast to the zebra's distinctive black and white stripes. Scattered trees dot the landscape, their leaves a vibrant echo of the field's verdant tones. The composition is balanced, with the zebra positioned on the left, drawing the viewer's eye across the frame. The image captures the essence of a peaceful moment in the wild, scored at 9 for its quality, with the slight misalignment of the horizon subtly detracting from an otherwise perfect photograph.
MiniCPM-Llama3-V-2_5-int4 review of e006.jpg
e007.jpg
The scene unfolds like a page from an adventurous tale, where the wild and serene coexist. A majestic cow stands tall in the foreground, its dark form contrasting against the vibrant green of the grassy field. The cow's stance is one of quiet dominance, as if it were the guardian of this pastoral landscape. In the background, a smaller creature, possibly a calf, adds to the narrative of life and growth. The colors are rich and saturated, with the greens of the trees and grass standing out against the earthy tones of the dirt road. The composition is balanced, with the cow drawing the eye into the image, while the road and trees lead the viewer's gaze deeper into the scene. The sky above is a canvas of blue, dotted with clouds that add depth and dimension to the setting. This image scores an 8 on the quality scale, with minor imperfections like the thin wire line adding character to the scene.
MiniCPM-Llama3-V-2_5-int4 review of e007.jpg

As you can tell, my prompt might have been asking for a little too much detail. All above images were around 3 minutes 15 seconds average to process for these reviews.

With fixed_prop set to True, so it kept original photo sizes, it turned the slip json output to 1.5 out of 5 correct results, at around 4 minutes each. Note, this is going to images up to 1884x4080px, it got rotated, and shares things like at CAFE NINJA I ordered TEA at R3.70, but the scaled down images had higher reliability and accuracy.

With all my tests being from crumpled paper slips, I’ll soon run these tests again with the width scaled down to 300px wide, with highlighted dark text on white paper, and better rotation. I’m sure I’ve missed adjustments which could help a ton, but it’s clear to me that with thousands more pixels for the model to process in images it can lead to tons of hallucinations.