Contents Extracter for PDF

What's Contents Extracter for Pdfbox

Contents Extracter for PDF is a java library with a command line tool to extract characters, paths and images from a PDF file.

Extract text
Extract real glyph bounds of characters
Extract bounding boxes of character
Extract position and size of images
Extract images
Extract simple Paths such as lines
Create image of each page

Release Notes

2017/03/15 v1.0.3

Fix bugs
Coordinate system of PDF is used in this tools.
Name of Attribute "size" in Element "page" changed to "cropBox".
Change to use "." as decimal point under all local.

2016/12/13 v1.0.2
- Fixed bugs
- Add page size of each page of Pdf file into a XML file.
- Add vertical flag of character into a XML file.
2016/11/18 v1.0.1
- Released fist version of ContentsExtracter

Download

【Lastest version】

【Archive】

License

Contents Extracter for PDF is published under the Apache License v2.0. Also Contents Extracter for PDF include several library (Apache Commons CLI, Apache Commons IO, Apache Commons IO) publised under Apache License v2.0.

How to use (Command Line Tool)

This tool provides command line tool, so that you can use easily without prepare program.
Options is as follows:

-i,--input <input pdf file> : input a pdf file (Required)
-o,--output <output directory> : output directory
-s,--start <start page> : start page
-e,--end <end page> : end page
-d,--dpi <DPI> : DPI of rendering and preview image files
-x,--xml : create a xml file
-r,--rawImage : Extract embed image in pdf file
-c,--createRenderingImage : create image of each page of pdf file.

To show usages:

contentsExtracter.jar -h

To create rendering image files of all pages:

java -jar contentsExtracter.jar -i sample.pdf -c

To create xml file, preview images from page 1 to page 5:

java -jar contentsExtracter.jar -i sample.pdf -s 1 -e 5 -x -p

To create xml file, preview images from page 1 to page 5 and set dpi of image:

java -jar contentsExtracter.jar -i sample.pdf -s 1 -e 5 -x -p -d 300

To create xml file, preview images and set a output directory:

java -jar contentsExtracter.jar -i sample.pdf -o ~/Desktop -x -p

How to use (as Java Library)

You can use the tool in your java project with Pdfbox.

add "contentsExtracter-src.jar" into your Java project
ContentsExtracter depends Apache PDFBox, Apache commons CLI, Apache commons IO and Apache commons Lang.
You need add each libraly into your Java project.
Create a instance of Class "ContentsExtracter"

ContentsExtracter contentsExtracter = new contentsExtracter("sample.pdf", 1,5);

get contents

List<List<Element>> pages = contentsExtracter.get();

BufferedImage images[] = contentsExtracter.getImages();

FAQ

Memory error occur.

Heap of Java VM is not enough.
Use following VM options to set heap size.

-Xms:[Memory size]
-Xmx:[Memory size]