Optical Character Recognition (OCR) is a technology that transforms images of text into editable and searchable text. With the increasing digitalization of documents—from contracts to receipts and reports—OCR has become an essential tool for automating the organization and analysis of information. However, standard OCR models often face limitations when dealing with unusual text formats, visual noise, and other context-specific variations.
In this article, we will explore how to perform fine-tuning of an OCR model, adapting it to overcome these limitations and meet the specific needs of your application. You will learn how to prepare a custom dataset by manually annotating the characters present in your images and how to train a tuned OCR model to provide more accurate and efficient results.
What is OCR and why is it important?
OCR stands for Optical Character Recognition. This technology is capable of converting images containing text into a readable digital format, facilitating the organization and storage of information.
With OCR, you can process and analyze long documents, contracts, and reports automatically, making what was once a manual and slow task more efficient.
What are the steps for implementing OCR?
OCR involves four main steps:
Image Extraction: This involves acquiring image data and converting it into binary files. Input data can include photos of documents, scanned PDFs, or images of signs. However, machine learning models require data in a specific format for training. Therefore, after collecting the images, they should be converted into binary format.
Pre-processing: Pre-processing applies a set of computer vision techniques aimed at optimizing the AI’s performance. Techniques include linearizing text (if the text is rotated relative to the horizontal), highlighting text contours and smoothing background noise, and changing the image’s color scale.
Character Recognition: The OCR algorithm reads each character in the image, recognizes morphological features, and compares the result with a list of possible characters, returning the one that shows the highest similarity.
Post-processing: After character recognition, the algorithm performs post-processing. This phase starts with error checking and correction, adjusting poorly recognized characters using linguistic contexts and natural language models. Then, the recognized characters are organized into a structured format, segmenting the text into lines, words, and paragraphs to preserve the original document’s structure.
Additionally, post-processing includes normalizing data, adjusting text format to specific standards, such as removing extra spaces and correcting punctuation. Finally, the processed text is converted into a readable and usable format, ready to be integrated into automated systems or databases. Thus, post-processing transforms raw data into valuable information, ensuring maximum utility and accuracy of OCR results.
What are the limitations of standard OCR models?
Although OCR improves process efficiency, with the wide range of document formatting available today, processing has become challenging. Other limitations include:
- Limited support for certain languages and orthographies.
- Dependence on the quality of the extracted image.
- Lack of contextual knowledge of the image.
- Background noise in the image.
One way to overcome these limitations is to develop a custom model from the standard model for the specific application. The technique of using a pre-trained model and adapting it for a specific purpose is called fine-tuning. We will discuss how to fine-tune an OCR model below.
How to perform fine-tuning of an OCR model?
Fine-tuning an OCR model involves two main steps: preparing the dataset and training the model. Let’s delve into each of them now.
First step of fine-tuning an OCR model: Dataset Preparation
Preparing the dataset aims to transform the images into a format that the algorithm can process. This step starts with cropping the region of interest in the images, i.e., removing areas where no characters are present.
To do this, we developed the following set of functions:
def detect_text_bounding_box(img, output_folder:str=''):
"""
Detects text in the image using EasyOCR.
Args:
img: The image in which to detect text.
output_folder (str, optional): The directory to save intermediate images. Default is an empty string.
Returns:
list: A list of polygon points representing the detected text.
"""
cimg = img.copy()
bbox_list, polygon_list = reader.detect(img)
polygon_list = polygon_list[0]
bbox_list = bbox_list[0]
for bbox in bbox_list:
x1, x2, y1, y2 = bbox
polygon = [[x1, y1], [x2, y1], [x2, y2], [x1, y2]]
polygon_list.append(polygon)
return [np.rint(polygon).astype(int) for polygon in polygon_list]
def rearrange_src_pts(box, w_rect, h_rect):
bl, tl, tr, br = box
if w_rect < h_rect:
aux = w_rect
w_rect = h_rect
h_rect = aux
box = [br, bl, tl, tr]
src_pts = np.int0(box).astype("float32")
return src_pts, w_rect, h_rect
def simple_warp_rectangle(img, points, output_folder:str=''):
cimg = img.copy()
rect = cv2.minAreaRect(points)
box = cv2.boxPoints(rect)
box = np.int0(box)
width = int(rect[1][0])
height = int(rect[1][1])
src_pts, width, height = rearrange_src_pts(box, width, height)
dst_pts = np.array([[0, height-1],
[0, 0],
[width-1, 0],
[width-1, height-1]], dtype="float32")
M = cv2.getPerspectiveTransform(src_pts, dst_pts)
warped_img = cv2.warpPerspective(cimg, M, (width, height))
return warped_img
Basically, the detect_text_bounding_box
function takes an image as input and returns the coordinates of the polygon surrounding the region containing characters. From these coordinates, we use the simple_warp_rectangle
function to crop the image only to the region of interest. By the end of this step, you will have cropped sections of images to be used for training the model.
With the cropped images, we can start annotating the data. This is the process of manually writing out which characters are present in each image. For this, we use the IPython library and the display_data
function.
The display_data
function creates a prompt where you can view the image and write the respective set of characters present in it.
def display_data(data):
label_dict = {}
for i in data.iterrows():
img_path = i[1]["path"]
label = i[1]["label"]
ipd.display(Image(filename=img_path))
word_input = widgets.Text(value=label, placeholder='Type something', description='Word:', disabled=False)
ipd.display(word_input)
label_dict[f"{img_path}"] = word_input # Store the object, so it can be changed after we run the cell.
return label_dict
Finally, you should split the annotated dataset into training and testing sets.
Training the OCR Model
To train the OCR model, it is highly recommended to use a GPU processing environment due to the computational intensity involved. Google Colab is an excellent free option that offers this capability.
The first step is to clone the EasyOCR library repository using the command git clone
. After cloning the repository, you need to change the working directory to where the repository was cloned. This ensures that we are in the correct context to run the training scripts.
!git clone https://github.com/JaidedAI/EasyOCR.git {path/to/save}
%cd {path/to/save}/trainer
import os
# Get the current working directory
current_working_directory = os.getcwd()
print(current_working_directory)
import os
import torch.backends.cudnn as cudnn
import yaml
from train import train
from utils import AttrDict
import pandas as pd
get_config
function. It reads a YAML file containing all the necessary configurations for training, including model parameters, data paths, and other specific settings. The function also prepares the set of characters the model should recognize based on the provided training data.
def get_config(file_path):
with open(file_path, 'r', encoding="utf8") as stream:
opt = yaml.safe_load(stream)
opt = AttrDict(opt)
if opt.lang_char == 'None':
characters = ''
for data in opt['select_data'].split('-'):
csv_path = os.path.join(opt['train_data'], data, 'labels.csv')
df = pd.read_csv(csv_path, sep='^([^,]+),', engine='python', usecols=['filename', 'words'], keep_default_na=False)
all_char = ''.join(df['words'])
characters += ''.join(set(all_char))
characters = sorted(set(characters))
opt.character= ''.join(characters)
else:
opt.character = opt.number + opt.symbol + opt.lang_char
os.makedirs(f'./saved_models/{opt.experiment_name}', exist_ok=True)
return opt
%%writefile config_files/custom_model.yaml
number: '0123456789'
symbol: "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~ €"
lang_char: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
experiment_name: 'en_filtered'
train_data: 'all_data'
valid_data: 'all_data/en_val'
manualSeed: 1111
workers: 2
batch_size: 16 # 32
num_iter: 3000
valInterval: 1000
saved_model: '' #'saved_models/en_filtered/iter_300000.pth'
FT: False
optim: False # default is Adadelta
lr: 1.
beta1: 0.9
rho: 0.95
eps: 0.00000001
grad_clip: 5
#Data processing
select_data: 'en_train_filtered' # this is dataset folder in train_data
batch_ratio: '1'
total_data_usage_ratio: 1.0
batch_max_length: 34
imgH: 64
imgW: 600
rgb: False
contrast_adjust: False
sensitive: True
PAD: True
contrast_adjust: 0.0
data_filtering_off: False
# Model Architecture
Transformation: 'None'
FeatureExtraction: 'VGG'
SequenceModeling: 'BiLSTM'
Prediction: 'CTC'
num_fiducial: 20
input_channel: 1
output_channel: 256
hidden_size: 256
decode: 'greedy'
new_prediction: False
freeze_FeatureFxtraction: False
freeze_SequenceModeling: False
The YAML file defines several important parameters: the characters the model should recognize (number, symbol, lang_char), the experiment name (experiment_name), paths for training and validation data (train_data, valid_data), optimization and training settings, as well as details about the model architecture.
To fine-tune a pre-trained OCR model, you should specify the path to the model in the saved_model
variable. On this page, you can find pre-trained models for different languages. With the configuration file ready, we can start training the model. To do this, load the configuration and call the train
function:
config_filename = 'custom_model'
path_config_file = f"{path/to/save}/trainer/config_files/{config_filename}.yaml"
opt = get_config(path_config_file)
train(opt, amp=False)
Model Usage
After training, you need to download the support files and configure them with the same values used during the training setup. These support files include a YAML file and a customized Python script, which should be copied to the correct EasyOCR directories:
!cp /support_files/custom_example.yaml /root/.EasyOCR/user_network/{custom_model_name}.yaml
!cp /support_files/custom_example.py /root/.EasyOCR/user_network/{custom_model_name}.py
!cp {path/to/save}/trainer/saved_models/{experiment_name}/best_accuracy.pth /root/.EasyOCR/model/{custom_model_name}.pth
Finally, to use the trained model, initialize an EasyOCR reader with the customized model and recognize text in new images:
custom_reader = easyocr.Reader(['en'], gpu=True, recog_network='custom_model')
custom_results = custom_reader.recognize(img)
Transform your organization’s efficiency with an OCR model!
If your organization aims to improve document processing efficiency, an OCR model might be the ideal solution. With OCR, you can convert images containing text into a readable digital format, simplifying information organization and storage.
BIX offers customized OCR solutions to meet your specific needs. Click the banner below and contact us to find out how we can help increase your organization’s efficiency and productivity!