Extracting Text from PDF table

Preparation

Download and install python from https://www.python.org/downloads

open command prompt and run:

pip install pdfminer
PDF Miner

This will install PDFMiner python library for working with PDF files

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows obtaining the exact location of texts on a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.

https://pypi.python.org/pypi/pdfminer/

Extracting data from PDF tables

With the help of stackoverflow we created pdftable2csv.py python script which can be downloaded from here

This command will convert PDF tables into csv file:

python C:\Python27\Scripts\pdftable2csv.py 2 CDE-Report1.pdf CDE-Report1.csv

Where “2” is the distance multiplier after which a character is considered part of a new word/column/block. Usually, 1.5 works quite well

Running python scripts from package

Is done by using “External application” Package Action:

 PDF tables

 PDF tables

Dealing with Encrypted PDF files

QPDF.exe is here to help

qpdf -password= --decrypt test.pdf test1.pdf

For more technologies supported by our ETL Software see Advanced ETL Processor Versions

Confused? Ask question on our ETL Forum
Last updated: March 14, 2023