Ad Code

Data Science labs blog

Image Table to DataFrame using Python OCR

 

Image Table to DataFrame using Python OCR


Ever had an image of a table and wanted to get the data into your DataFrame? well, I have the article for you!

Lidor ES

·

In this article I will be mainly using the python library called pytesseract you can read more about this library here.

This solution is for borderless tables, tables with partially borders and fully bordered tables (generally, all tables)!

Python provides several libraries that can get textual data from an image (for example opencvpytesseract and more), each library got its ways and approaches.

Download Tesseract

In this article in order to get the images I will use the library called Pillow (PIL) and not opencv, but you can use either it wont affect the solution.

A little background

pytesseract is a python library that wraps Google optical character recognition (OCR) called Tesseract if you’ll follow the link you’ll be able to read much more in-depth about the tesseract engine.
But generally what it does is as follows:

  • Preprocessing of the Image
  • Text Localization
  • Character Segmentation
  • Character Recognition
  • Post Processing

Let’s Dive In

Ok so I have an image with some sort of table on it (keep in mind, the image has only the table on it regardless of its borders condition) and I want to get its content to my DataFrame, I’ll use this image:

If you want to use deep learning in order to detect the tables in your images you can read my other articles here

Now let’s install the following libraries:

!sudo apt install tesseract-ocr — yes
!pip install pytesseract

Let’s import all the libraries that will be used in this article

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import PIL
import pytesseract
import os

If you have non-English text data in your tables you will need to download the suitable tesseract data file from this link.
After you’ve downloaded it, put it in your tessdata_dir folder, if you set it to be a custom folder like me you’ll also need the following line:

os.environ['TESSDATA_PREFIX'] = '/full/path/for/your/folder/tessdata_dir/'

To check what languages that pytesseract detected used the following print statement print(pytesseract.get_languages()).

Now let’s set the languages we will use and our config for the pytesseract library:

special_config = '--psm 12 --oem 1'
languages_ = "eng" # For multiple language use like "eng+rus+swe" and so on

If you want to read more about the config options, please refer to this link.
The above special_config worked best for my personal needs.

Now let’s load the image:

image_path = "/path/of/the/image/image name.png" # Any image format will do
img_pl=PIL.Image.open(image_path) # You can use opencv for that option too

Now let’s use the function pytesseract.image_to_data(), this function returns verbose data including boxes, confidences, line, and page numbers:

data = pytesseract.image_to_data(
img_pl,
lang=languages_,
output_type='data.frame',
config=special_config)

Now let’s “optimize” the DataFrame so it will hold only data that is important, I will apply the following:

  • Take only the columns: lefttopwidthtext
  • Sum the columns left and width to create a new column left+width
  • Sort according to top, reindex columns, and drop None values
def optimizeDf(old_df: pd.DataFrame) -> pd.DataFrame:
df = old_df[["left", "top", "width", "text"]]
df['left+width'] = df['left'] + df['width']
df = df.sort_values(by=['top'], ascending=True)
df = df.groupby(['top', 'left+width'], sort=False)['text'].sum().unstack('left+width')
df = df.reindex(sorted(df.columns), axis=1).dropna(how='all').dropna(axis='columns', how='all')
df = df.fillna('')
return df

data_imp_sort = optimizeDf(data)

Now we have the data in our DataFrame called data_imp_sort, if you followed so far you’ll notice that some of the columns and rows are split (for example a cell has the value of “Happy Birthday Jim” but now you have 3 columns one for each of the words), from what I saw the difference between the columns or the rows are up to 10 pixels so I’ll use that as a threshold but you can change it as you like:

The below functions are merging the columns and rows respectively:

def mergeDfColumns(old_df: pd.DataFrame, threshold: int = 10, rotations: int = 5) -> pd.DataFrame:
df = old_df.copy()
for j in range(0, rotations):
new_columns = {}
old_columns = df.columns
i = 0
while i < len(old_columns):
if i < len(old_columns) - 1:
# If the difference between consecutive column names is less than the threshold
if any(old_columns[i+1] == old_columns[i] + x for x in range(1, threshold)):
new_col = df[old_columns[i]].astype(str) + df[old_columns[i+1]].astype(str)
new_columns[old_columns[i+1]] = new_col
i = i + 1
else: # If the difference between consecutive column names is greater than or equal to the threshold
new_columns[old_columns[i]] = df[old_columns[i]]
else: # If the current column is the last column
new_columns[old_columns[i]] = df[old_columns[i]]
i += 1
df = pd.DataFrame.from_dict(new_columns).replace('', np.nan).dropna(axis='columns', how='all')
return df.replace(np.nan, '')

def mergeDfRows(old_df: pd.DataFrame, threshold: int = 10) -> pd.DataFrame:
new_df = old_df.iloc[:1]
for i in range(1, len(old_df)):
# If the difference between consecutive index values is less than the threshold
if abs(old_df.index[i] - old_df.index[i - 1]) < threshold:
new_df.iloc[-1] = new_df.iloc[-1].astype(str) + old_df.iloc[i].astype(str)
else: # If the difference is greater than the threshold, append the current row
new_df = new_df.append(old_df.iloc[i])
return new_df.reset_index(drop=True)

df_new_col = mergeDfColumns(data_imp_sort)
merged_row_df = mergeDfRows(df_new_col)

Now that the columns and rows are merged according to the threshold, in some of the cases we will still have one or more of the following:

  • Empty rows and/or columns that hold an empty value (not None but still empty, like an empty string)
  • Columns that hold only the value of | with or without empty cells (sometimes if the inner borders are not thick enough it may recognize it as a character)

The following function takes care of these scenarios, if you have any additional scenarios you can easily customize the function

def clean_df(df):
# Remove columns with all cells holding the same value and its length is 0 or 1
df = df.loc[:, (df != df.iloc[0]).any()]
# Remove rows with empty cells or cells with only the '|' symbol
df = df[(df != '|') & (df != '') & (pd.notnull(df))]
# Remove columns with only empty cells
df = df.dropna(axis=1, how='all')
return df.fillna('')

cleaned_df = cleanDf(merged_row_df.copy())

The resulting DataFrame will be as follows:

Note that it’s still not perfect, for example, the column 450 and 481 should be together, however, the location of the columns has a big difference for some reason, maybe with different config options or some other computations it will be able to be solved, or perhaps a different threshold but if the threshold will be increased it might affect the previous column as well, also under the current config it isn't suitable for distinguishing between a dot and a comma (other configs can handle it).

Reactions

Post a Comment

0 Comments