Contents Extracter for PDF

What's Contents Extracter for Pdfbox
Contents Extracter for PDF is a java library with a command line tool to extract characters, paths and images from a PDF file.

...
  • Extract text
  • Extract real glyph bounds of characters
  • Extract bounding boxes of character
  • Extract position and size of images
  • Extract images
  • Extract simple Paths such as lines
  • Create image of each page
Release Notes
  • 2017/03/15 v1.0.3
    • Fix bugs
    • Coordinate system of PDF is used in this tools.
    • Name of Attribute "size" in Element "page" changed to "cropBox".
    • Change to use "." as decimal point under all local.
  • 2016/12/13 v1.0.2
    • Fixed bugs
    • Add page size of each page of Pdf file into a XML file.
    • Add vertical flag of character into a XML file.
  • 2016/11/18 v1.0.1
    • Released fist version of ContentsExtracter
License
Contents Extracter for PDF is published under the Apache License v2.0. Also Contents Extracter for PDF include several library (Apache Commons CLI, Apache Commons IO, Apache Commons IO) publised under Apache License v2.0.
How to use (Command Line Tool)
This tool provides command line tool, so that you can use easily without prepare program.
Options is as follows:
  • -i,--input <input pdf file> : input a pdf file (Required)
  • -o,--output <output directory> : output directory
  • -s,--start <start page> : start page
  • -e,--end <end page> : end page
  • -d,--dpi <DPI> : DPI of rendering and preview image files
  • -x,--xml : create a xml file
  • -r,--rawImage : Extract embed image in pdf file
  • -c,--createRenderingImage : create image of each page of pdf file.


To show usages:
contentsExtracter.jar -h
To create rendering image files of all pages:
java -jar contentsExtracter.jar -i sample.pdf -c
To create xml file, preview images from page 1 to page 5:
java -jar contentsExtracter.jar -i sample.pdf -s 1 -e 5 -x -p
To create xml file, preview images from page 1 to page 5 and set dpi of image:
java -jar contentsExtracter.jar -i sample.pdf -s 1 -e 5 -x -p -d 300
To create xml file, preview images and set a output directory:
java -jar contentsExtracter.jar -i sample.pdf -o ~/Desktop -x -p
How to use (as Java Library)
You can use the tool in your java project with Pdfbox.
  1. add "contentsExtracter-src.jar" into your Java project
  2. ContentsExtracter depends Apache PDFBox, Apache commons CLI, Apache commons IO and Apache commons Lang.
    You need add each libraly into your Java project.
  3. Create a instance of Class "ContentsExtracter"
  4. ContentsExtracter contentsExtracter = new contentsExtracter("sample.pdf", 1,5);
  5. get contents
  6. List<List<Element>> pages = contentsExtracter.get();
    Contents of a pdf file is represented "Element" class. "Element" class is a abstarct class and have three subclasses, Image, CharacterData, PathElement. When you extract each content, you have to cast each instance of Element class to suitable subclass (Image, CharacterData, PathElement) and use "get method".
    get image of each page:
    BufferedImage images[] = contentsExtracter.getImages();
FAQ
Memory error occur.
Heap of Java VM is not enough.
Use following VM options to set heap size.
  • -Xms:[Memory size]
  • -Xmx:[Memory size]

copyright 2016 fujiyoshi Lab Ibaraki Univirsity.