Sunday, September 13, 2015

How to Convert a PDF File to Editable Text Using the Command Line in Linux

How to Convert a PDF File to Editable Text Using the Command Line in Linux

00_lead_image_pdf_to_text
There are various reasons why you might want to convert a PDF file to editable text. Maybe you need to revise an old document and all you have is the PDF version of it. Converting PDF files in Windows is easy, but what if you’re using Linux?
RELATED ARTICLE
Convert PDF Files to Word Documents and Other Formats
Most of us know easy ways to turn a Word or other text document into a PDF, but what if... [Read Article]
No worries. We’ll show you how to easily convert PDF files to editable text using a command line tool called pdftotext, that is part of the “poppler-utils” package. This tool may already be installed. To check if pdftotext is installed on your system, press “Ctrl + Alt + T” to open a terminal window. Type the following command at the prompt and press “Enter”.
dpkg –s poppler-utils
NOTE: When we say to type something in this article and there are quotes around the text, DO NOT type the quotes, unless we specify otherwise.
01_checking_for_poppler_utils
If pdftotext is not installed, type the following command at the prompt and press “Enter”.
sudo apt-get install poppler-utils
Type your password when prompted and press “Enter”.
02_installing_poppler_utils
There are several tools available in the poppler-utils package for converting PDF to different formats, manipulating PDF files, and extracting information from files.
03_utilities_in_poppler_utils
The following is the basic command for converting a PDF file to an editable text file. Press “Ctrl + Alt + T” to open a Terminal window, type the command at the prompt, and press “Enter”.
pdftotext /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
Change the path to each file to correspond to the location and name of your original PDF file and where you want to save the resulting text file. Also, change the filenames to correspond to the names of your files.
04_running_pdftotext
The text file is created and can be opened just as you would open any other text file in Linux.
05_text_file_created
The converted text may have line breaks in places you don’t want. Line breaks are inserted after every line of text in the PDF file.
06_converted_text_file_in_editor
You can preserve the layout of your document (headers, footers, paging, etc.) from the original PDF file in the converted text file using the “-layout” flag.
pdftotext -layout /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
06a_converting_using_layout
If you want to only convert a range of pages in a PDF file, use the “-f” and “-l” (a lowercase “L”) flags to specify the first and last pages in the range you want to convert.
pdftotext -f 5 -l 9 /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
07_converting_page_range
To convert a PDF file that’s protected and encrypted with an owner password, use the “-opw” flag (the first character in the flag is a lowercase letter “O”, not a zero).
pdftotext -opw ‘password’ /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt
Change “password” to the one used to protect the original PDF file being converted. Make sure there are single quotes, not double, around “password”.
08_converting_pdf_with_password
If the PDF file is protected and encrypted with a user password, use the “-upw” flag instead of the “-opw” flag. The rest of the command is the same.
09_converting_pdf_with_user_password
You can also specify the type of end-of-line character that is applied to the converted text. This is especially useful if you plan to access the file on a different operating system like Windows or Mac. To do this, use the “-eol” flag (the middle character in the flag is a lowercase letter “O”, not a zero) followed by a space and the type of end-of-line character you want to use (“unix”, “dos”, or “mac”).
10_converting_pdf_with_eol_format
NOTE: If you don’t specify a filename for the text file, pdftotext automatically uses the base of the PDF filename and adds the “.txt” extension. For example, “file.pdf” will be converted to “file.txt”. If the text file is specified as “-“, the converted text is sent to stdout, which means the text is displayed in the Terminal window and not saved to a file.
To close the Terminal window, click the “X” button in the upper-left corner.
For more information about the pdftotext command, type “man page pdftotext” at the prompt in a Terminal window.