Systems designed to turn handwriting into text typically work best on pages with a consistent layout, such as a single column unbroken by drawings, diagrams, or extraneous symbols. A new system removes that requirement.
What’s new: Sumeet Singh and Sergey Karayev of Turnitin, a company that detects plagiarism, created a general-purpose image-to-sequence model that converts handwriting into text regardless of its layout and elements such as sketches, equations, and scratched-out deletions.
Key insight: Handwriting recognition systems typically use separate models to segment pages into blocks of words and turn the writing into text. Neural networks allow an end-to-end approach. Convolutional neural networks are good at processing images, and transformers are good at extracting information from sequences. A CNN can create representations of text in an image, and a transformer can turn those representations into text.
How it works: The system feeds pages through an encoder based on a 34-layer ResNet followed by a transformer-based decoder.
- The researchers trained the system on five datasets including the IAM-database of handwritten forms and Free Form Answers, which comprises scans of STEM-test answers including equations, tables, and drawings.
- They augmented IAM by collaging words and lines at random and generated synthetic data by superimposing text from Wikipedia in various fonts and sizes on different background colors. In addition, they augmented examples by adding noise and changing brightness, contrast, scale, and rotation at random.
- The data didn’t include labels for sketches, equations, and scratched-out deletions, so the system learned to ignore them. The variety of layouts encouraged the system to learn to transcribe text regardless of other elements.
Results: On IAM, the author’s system achieved a character error rate of 6.3 percent, while an LSTM designed for 2D achieved 7.9 percent. On Free Form Answers, it achieved a character error rate of 7.6 percent. Among Microsoft’s Cognitive Services, Google’s Cloud Vision, and Mathpix, the best achieved 14.4 percent.
Why it matters: End-to-end approaches to deep learning have been overhyped. But, given the large amount of data, including easily synthesized data, available for handwriting recognition, this task is an excellent candidate for end-to-end learning.
We’re thinking: But can it decipher your doctor’s scrawl?