Converting PDF to Text using Tesseract

Convert the pdf file to a tiff file

Tesseract will not directly handle pdf files, so the file must first be converted to a tiff. This can be done using ghostscript. Also, because tesseract does not have the ability to process multiple page tiffs, we want each page of the pdf to be its own tiff file. The command to do this is:

$ gs -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=scan_%d.tif {input.pdf}

In this example, the sOutputFile should be the name of the output files. By appending a %d to the end, it will create and number, sequentially, different files for each page. {input.pdf} should be your source, multi-page pdf file.

Perform the OCR to convert your file to text

Each file must be independently converted to txt. This can be done simply with the following command:

$ tesseract scan_1.tif scan_1

Tesseract will automatically append .txt to the file name, so the result of the above command would be a file named scan_1.txt containing the text from scan_1.tif.

Combine the text files into one

When you are all done, you can combine the files into one. Suppose, for this example, that you have three txt files, titled scan_1.txt through scan_3.txt. You can combine them all into one result file by doing the following:

$ scan_1.txt > result.txt 
$ scan_2.txt >> result.txt 
$ scan_3.txt >> result.txt

Notice that the first command overwrites result.txt with a new file by using the redirect ‘>’ operator. The remaining commands append the output to the result.txt file.

Combine it all together into a script

As an alternative to the above, manual approach, I have written a simple script to automate this task. The script takes as a command line argument the input file and will produce result.txt, overwriting any existing result.txt file.

# $1 is the first argument    
# remove result.txt    
rm result.txt

# convert the pdf to a group of tiffs    
gs -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=scan_%d.tif $1 
i=1 
while [ $i -ge 0 ] 
do 
if [ -a scan_$i.tif ] 
then 
tesseract scan_$i.tif scan_$i

# add the text to the result.txt file
cat scan_$i.txt >> result.txt 
rm scan_$i.txt scan_$i.tif 
i=$(( $i + 1 )) 

else 
i=-100 

fi done

This script will leave only one .txt file containing all of the text from the original pdf in unformatted form.

Leave a Reply

Your email address will not be published. Required fields are marked *