Package org.apache.pdfbox.pdfparser
Class PDFParser
- java.lang.Object
-
- org.apache.pdfbox.pdfparser.BaseParser
-
- org.apache.pdfbox.pdfparser.PDFParser
-
- Direct Known Subclasses:
NonSequentialPDFParser
public class PDFParser extends BaseParser
This class will handle the parsing of the PDF document.- Version:
- $Revision: 1.53 $
- Author:
- Ben Litchfield
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
isFDFDocment
protected XrefTrailerResolver
xrefTrailerResolver
Collects all Xref/trailer objects and resolves them into single object using startxref reference.-
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE
-
-
Constructor Summary
Constructors Constructor Description PDFParser(java.io.InputStream input)
Constructor.PDFParser(java.io.InputStream input, RandomAccess rafi)
Constructor to allow control over RandomAccessFile.PDFParser(java.io.InputStream input, RandomAccess rafi, boolean force)
Constructor to allow control over RandomAccessFile.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
clearResources()
Release all used resources.COSDocument
getDocument()
This will get the document that was parsed.FDFDocument
getFDFDocument()
This will get the FDF document that was parsed.PDDocument
getPDDocument()
This will get the PD document that was parsed.protected boolean
isContinueOnError(java.lang.Exception e)
Returns true if parsing should be continued.void
parse()
This will parse the stream and populate the COSDocument object.protected void
parseHeader()
protected boolean
parseStartXref()
This will parse the startxref section from the stream.protected boolean
parseTrailer()
This will parse the trailer from the stream and add it to the state.void
parseXrefStream(COSStream stream, long objByteOffset)
Fills XRefTrailerResolver with data of given stream.void
parseXrefStream(COSStream stream, long objByteOffset, boolean isStandalone)
Fills XRefTrailerResolver with data of given stream.protected boolean
parseXrefTable(long startByteOffset)
This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.protected void
readVersionInTrailer(COSDictionary parsedTrailer)
The document catalog can also have a /Version parameter which overrides the version specified in the header if, and only if it is greater.void
setTempDirectory(java.io.File tmpDir)
This is the directory where pdfbox will create a temporary file for storing pdf document stream in.-
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSStream, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpaces
-
-
-
-
Field Detail
-
isFDFDocment
protected boolean isFDFDocment
-
xrefTrailerResolver
protected XrefTrailerResolver xrefTrailerResolver
Collects all Xref/trailer objects and resolves them into single object using startxref reference.
-
-
Constructor Detail
-
PDFParser
public PDFParser(java.io.InputStream input) throws java.io.IOException
Constructor.- Parameters:
input
- The input stream that contains the PDF document.- Throws:
java.io.IOException
- If there is an error initializing the stream.
-
PDFParser
public PDFParser(java.io.InputStream input, RandomAccess rafi) throws java.io.IOException
Constructor to allow control over RandomAccessFile.- Parameters:
input
- The input stream that contains the PDF document.rafi
- The RandomAccessFile to be used in internal COSDocument- Throws:
java.io.IOException
- If there is an error initializing the stream.
-
PDFParser
public PDFParser(java.io.InputStream input, RandomAccess rafi, boolean force) throws java.io.IOException
Constructor to allow control over RandomAccessFile. Also enables parser to skip corrupt objects to try and force parsing- Parameters:
input
- The input stream that contains the PDF document.rafi
- The RandomAccessFile to be used in internal COSDocumentforce
- When true, the parser will skip corrupt pdf objects and will continue parsing at the next object in the file- Throws:
java.io.IOException
- If there is an error initializing the stream.
-
-
Method Detail
-
setTempDirectory
public void setTempDirectory(java.io.File tmpDir)
This is the directory where pdfbox will create a temporary file for storing pdf document stream in. By default this directory will be the value of the system property java.io.tmpdir.- Parameters:
tmpDir
- The directory to create scratch files needed to store pdf document streams.
-
isContinueOnError
protected boolean isContinueOnError(java.lang.Exception e)
Returns true if parsing should be continued. By default, forceParsing is returned. This can be overridden to add application specific handling (for example to stop parsing when the number of exceptions thrown exceed a certain number).- Parameters:
e
- The exception if available. Can be null if there is no exception available- Returns:
- true if parsing could be continued, otherwise false
-
parse
public void parse() throws java.io.IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.- Throws:
java.io.IOException
- If there is an error reading from the stream or corrupt data is found.
-
parseHeader
protected void parseHeader() throws java.io.IOException
- Throws:
java.io.IOException
-
getDocument
public COSDocument getDocument() throws java.io.IOException
This will get the document that was parsed. parse() must be called before this is called. When you are done with this document you must call close() on it to release resources.- Returns:
- The document that was parsed.
- Throws:
java.io.IOException
- If there is an error getting the document.
-
getPDDocument
public PDDocument getPDDocument() throws java.io.IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources.- Returns:
- The document at the PD layer.
- Throws:
java.io.IOException
- If there is an error getting the document.
-
getFDFDocument
public FDFDocument getFDFDocument() throws java.io.IOException
This will get the FDF document that was parsed. When you are done with this document you must call close() on it to release resources.- Returns:
- The document at the PD layer.
- Throws:
java.io.IOException
- If there is an error getting the document.
-
parseStartXref
protected boolean parseStartXref() throws java.io.IOException
This will parse the startxref section from the stream. The startxref value is ignored.- Returns:
- false on parsing error
- Throws:
java.io.IOException
- If an IO error occurs.
-
parseXrefTable
protected boolean parseXrefTable(long startByteOffset) throws java.io.IOException
This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.- Parameters:
startByteOffset
- the offset to start at- Returns:
- false on parsing error
- Throws:
java.io.IOException
- If an IO error occurs.
-
parseTrailer
protected boolean parseTrailer() throws java.io.IOException
This will parse the trailer from the stream and add it to the state.- Returns:
- false on parsing error
- Throws:
java.io.IOException
- If an IO error occurs.
-
readVersionInTrailer
protected void readVersionInTrailer(COSDictionary parsedTrailer)
The document catalog can also have a /Version parameter which overrides the version specified in the header if, and only if it is greater.- Parameters:
parsedTrailer
- the parsed catalog in the trailer
-
parseXrefStream
public void parseXrefStream(COSStream stream, long objByteOffset) throws java.io.IOException
Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.- Parameters:
stream
- the stream to be readobjByteOffset
- the offset to start at- Throws:
java.io.IOException
- if there is an error parsing the stream
-
parseXrefStream
public void parseXrefStream(COSStream stream, long objByteOffset, boolean isStandalone) throws java.io.IOException
Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.- Parameters:
stream
- the stream to be readobjByteOffset
- the offset to start atisStandalone
- should be set to true if the stream is not part of a hybrid xref table- Throws:
java.io.IOException
- if there is an error parsing the stream
-
clearResources
public void clearResources()
Release all used resources.- Overrides:
clearResources
in classBaseParser
-
-