Quite unusually for me, this essay started its life as a scribble in a notebook rather than something I typed into a markdown editor. A few weeks ago, Tiago Forte made a video suggesting that people can use GPT-4 to capture their handwritten notes digitally. I've been looking for a "smart" OCR that can process my terribly scratchy, spidery handwriting for many years but none have quite cut the mustard. I thought, why not give it a go? To my absolute surprise, GPT did a reasonable job of parsing my scrawling and capturing text. I was seriously impressed.

Handwriting OCR has now gone from a fun toy and conversion piece to something I can actually, seriously use in my workflow. So then, Is GPT the best option I've got for this task or are there any FOSS tools and models that I could use instead?

An Experiment

In my experiment I use the following system prompt with each model:

Please transcribe these handwritten notes for me. Please use markdown formatting. Do not apply line wrapping to any paragraphs. Do try to capture headings and subheadings as well as any formatting such as bold or italic. Omit any text that has been scribbled out. Try your best to understand the writing and produce a first draft. If anything is unclear, follow up with questions at the end.

and then proceed to send each of them this photo of my notes:

A page in a notebook with handwritten text which is the first draft of this very post. It starts
My lovely scrawly hand writing displa

The main flaw in my experiment is its scale. Ideally I would use a large number of handwritten documents with different pens, ink colours and photographed under different lighting conditions to see how this affects the model output. However, I didn't have hours and hours to spare manually transcribing my own notes.

However, I would say that the photo is pretty representative of a page of notes in my typical handwriting. It is written in my bujo/notebook that I carry around with me and my fountain pen that I use most of the time. I wouldn't expect results to vary wildly for me specifically. However dear reader: your mileage may vary.

What I'm Looking For

I want to understand which model is best able to understand my handwriting without misinterpreting it. Bonus points for also providing sensible styling/formatting and being robust to slightly noisy images (e.g. the next page in the notebook being slightly visible in the image).

Ideal Outcomes

What I hope to find is an open source/local model that I could run which would provide high quality handwriting OCR that I can run on my own computer. This would give me peace of mind about not having to send my innermost thoughts to OpenAI for transcription and would be a budget option. Also, from an energy and water usage point of view, it would be great to be able to do this stuff with a small model that can run on a single consumer-grade GPU like I have in my computer.


GPT-4V

GPT-4V did the best job of transcribing my notes, making the fewest mistakes interpretting my handwriting and almost perfectly following my instructions regarding formatting. It did make a few errors. For example, renaming Tiago Forte as Rafe Furst and referring to LLAVA as "Clara".

Read GPT-4 Full Response

Gemini

Google Gemini did a pretty poor job compared to GPT-4V. It ignored my instructions not to wrap lines in markdown and it also attempted to read the first word of each line from the second page of my notes which is visible in the image. It made a complete hash of my work and also misread a large chunk of the words. I used the model in chat mode and even when I asked it to "try again but this time don't try to read the next page", it ignored me and spat out more or less the same thing.

Read Full Gemini Output

Claude Opus

The quality of Claude's response was much closer to that of GPT-4V's. There were a couple of very prominent spelling/misinterpretations that made me immediately suspicious in general the output is quite faithful to the writing. However, Claude did say it would provide markdown output and then completely fail to do that.

Read full Claude Output

Local Model: LLAVA

One of the biggest advances in Local LLMs in the last 12 months has been LLAVA which provides reasonable performance on a number of multi-modal benchmarks. Earlier versions of LLAVA were trained using images and by having GPT-4 expand the text descriptors. These images that were originally provided by humans. However, they didn't yet have access to opt-4.5k so training was done on descriptors that were often second-hand and pretty surprising that this worked at all tbh - but it did!

LLAVA 1.6 is trained on data with more in-depth annotations including LAION GPT-4V which, incidentally, I can't find any formal evaluations of. This seems to improve the model's ability to reason about the contents of images. They also train using TextVQA which aims to train models to be able to reason about and perform question answering for text documents. It is likely that elements of both of these datasets help improve the model's OCR ability.

I set up LLAVA to run locally using llama.cpp and tried running some experiments to see how well it picked up my handwriting (instructions here). Although LLAVA 1.6 fared a lot better than earlier versions, it didn't do an amazingly good job. It also got stuck in a generation loop a few times, repeatedly outputting the same mad ramblings until terminated.

Read LLAVA Output Here

The Classic option - CRAFT-TrOCR

HF Space Repo

CRAFT-TrOCR is an early example of a transformer-based large multi-modal model. According to TrOCR's abstract:

we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation

The CRAFT component is use to detect regions of the image that may contain text and TrOCR is used to recognise the text. However, TrOCR is quite small for a large language model, the largest version trained in the paper has 550M parameters - around the same size as BERT and around 40 times smaller than GPT-3.5 is believed to be.

I played with this pipeline a couple of years ago and I remembered it being reasonably good at certain context but only if I write in my very best handwriting all the time. And I don't do that because when I am taking notes in a meeting or something, I am thinking about the meeting, not my handwriting.

Although the HF space is broken, I was able to make some local changes (published here to get it running on my PC). This is not a generative model with a prompt. It is a pipeline with a single purpose: to recognise and output text. I simply uploaded the photo of my note and hit submit.

However, I found that the output was pretty disappointing and nonsensical. I wonder if the limitation is the model size, the architecture or even the image preprocessing and segmentation that is going on (which is "automatic" in more recent LMMs).

Read CRAFT-TROCR Output

Conclusion: The Best Option Right Now

Right now, GPT-4V seems to be the best option for handwriting OCR. I have created a custom GPT which has a tuned variation on the system prompt used above in my experiment which people can use here. Using GPT to do this work makes copying my analogue thoughts into my digital garden a breeze.

However, I am very much on the lookout for a new open source model that will do this job well. What could change in this space? A version of LLaVa that has been fine-tuned on handwriting OCR task could emerge or perhaps we'll see an all together new model. The rate of open source model development is phenomenal so I am sure we will see progress in this space soon. Until then, I'm stuck paying OpenAI for their model.