How to extract Image Table to DataFrame using Python OCR
In this article I will be mainly using the python library called pytesseract
you can read more about this library here.
This solution is for borderless tables, tables with partially borders and fully bordered tables (generally, all tables)!
Python provides several libraries that can get textual data from an image (for example opencv
, pytesseract
and more), each library got its ways and approaches.
In this article in order to get the images I will use the library called Pillow (PIL) and not
opencv
, but you can use either it wont affect the solution.
A little background
pytesseract
is a python library that wraps Google optical character recognition (OCR) called Tesseract if you’ll follow the link you’ll be able to read much more in-depth about the tesseract engine.
But generally what it does is as follows:
- Preprocessing of the Image
- Text Localization
- Character Segmentation
- Character Recognition
- Post Processing
Let’s Dive In
Ok so I have an image with some sort of table on it (keep in mind, the image has only the table on it regardless of its borders condition) and I want to get its content to my DataFrame, I’ll use this image:
If you want to use deep learning in order to detect the tables in your images you can read my other articles here
Now let’s install the following libraries:
!sudo apt install tesseract-ocr — yes
!pip install pytesseract
Let’s import all the libraries that will be used in this article
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import PIL
import pytesseract
import os
If you have non-English text data in your tables you will need to download the suitable tesseract data file from this link.
After you’ve downloaded it, put it in your tessdata_dir folder, if you set it to be a custom folder like me you’ll also need the following line:
os.environ['TESSDATA_PREFIX'] = '/full/path/for/your/folder/tessdata_dir/'
To check what languages that pytesseract
detected used the following print statement print(pytesseract.get_languages())
.
Now let’s set the languages we will use and our config
for the pytesseract
library:
special_config = '--psm 12 --oem 1'
languages_ = "eng" # For multiple language use like "eng+rus+swe" and so on
If you want to read more about the config options, please refer to this link.
The abovespecial_config
worked best for my personal needs.
Now let’s load the image:
image_path = "/path/of/the/image/image name.png" # Any image format will do
img_pl=PIL.Image.open(image_path) # You can use opencv for that option too
Now let’s use the function pytesseract.image_to_data()
, this function returns verbose data including boxes, confidences, line, and page numbers:
data = pytesseract.image_to_data(
img_pl,
lang=languages_,
output_type='data.frame',
config=special_config)
Now let’s “optimize” the DataFrame so it will hold only data that is important, I will apply the following:
- Take only the columns:
left
,top
,width
,text
- Sum the columns
left
andwidth
to create a new columnleft+width
- Sort according to
top
, reindex columns, and dropNone
values
def optimizeDf(old_df: pd.DataFrame) -> pd.DataFrame:
df = old_df[["left", "top", "width", "text"]]
df['left+width'] = df['left'] + df['width']
df = df.sort_values(by=['top'], ascending=True)
df = df.groupby(['top', 'left+width'], sort=False)['text'].sum().unstack('left+width')
df = df.reindex(sorted(df.columns), axis=1).dropna(how='all').dropna(axis='columns', how='all')
df = df.fillna('')
return df
data_imp_sort = optimizeDf(data)
Now we have the data in our DataFrame called data_imp_sort
, if you followed so far you’ll notice that some of the columns and rows are split (for example a cell has the value of “Happy Birthday Jim” but now you have 3 columns one for each of the words), from what I saw the difference between the columns or the rows are up to 10 pixels so I’ll use that as a threshold but you can change it as you like:
The below functions are merging the columns and rows respectively:
def mergeDfColumns(old_df: pd.DataFrame, threshold: int = 10, rotations: int = 5) -> pd.DataFrame:
df = old_df.copy()
for j in range(0, rotations):
new_columns = {}
old_columns = df.columns
i = 0
while i < len(old_columns):
if i < len(old_columns) - 1:
# If the difference between consecutive column names is less than the threshold
if any(old_columns[i+1] == old_columns[i] + x for x in range(1, threshold)):
new_col = df[old_columns[i]].astype(str) + df[old_columns[i+1]].astype(str)
new_columns[old_columns[i+1]] = new_col
i = i + 1
else: # If the difference between consecutive column names is greater than or equal to the threshold
new_columns[old_columns[i]] = df[old_columns[i]]
else: # If the current column is the last column
new_columns[old_columns[i]] = df[old_columns[i]]
i += 1
df = pd.DataFrame.from_dict(new_columns).replace('', np.nan).dropna(axis='columns', how='all')
return df.replace(np.nan, '')
def mergeDfRows(old_df: pd.DataFrame, threshold: int = 10) -> pd.DataFrame:
new_df = old_df.iloc[:1]
for i in range(1, len(old_df)):
# If the difference between consecutive index values is less than the threshold
if abs(old_df.index[i] - old_df.index[i - 1]) < threshold:
new_df.iloc[-1] = new_df.iloc[-1].astype(str) + old_df.iloc[i].astype(str)
else: # If the difference is greater than the threshold, append the current row
new_df = new_df.append(old_df.iloc[i])
return new_df.reset_index(drop=True)
df_new_col = mergeDfColumns(data_imp_sort)
merged_row_df = mergeDfRows(df_new_col)
Now that the columns and rows are merged according to the threshold, in some of the cases we will still have one or more of the following:
- Empty rows and/or columns that hold an empty value (not
None
but still empty, like an empty string) - Columns that hold only the value of | with or without empty cells (sometimes if the inner borders are not thick enough it may recognize it as a character)
The following function takes care of these scenarios, if you have any additional scenarios you can easily customize the function
def clean_df(df):
# Remove columns with all cells holding the same value and its length is 0 or 1
df = df.loc[:, (df != df.iloc[0]).any()]
# Remove rows with empty cells or cells with only the '|' symbol
df = df[(df != '|') & (df != '') & (pd.notnull(df))]
# Remove columns with only empty cells
df = df.dropna(axis=1, how='all')
return df.fillna('')
cleaned_df = cleanDf(merged_row_df.copy())
The resulting DataFrame will be as follows:
Note that it’s still not perfect, for example, the column 450
and 481
should be together, however, the location of the columns has a big difference for some reason, maybe with different config options or some other computations it will be able to be solved, or perhaps a different threshold but if the threshold will be increased it might affect the previous column as well, also under the current config it isn't suitable for distinguishing between a dot and a comma (other configs can handle it).
0 Comments