PoDoFo  0.9.4
Public Member Functions | Static Public Member Functions | Protected Member Functions | List of all members
PoDoFo::PdfParser Class Reference

#include <PdfParser.h>

Inheritance diagram for PoDoFo::PdfParser:
PoDoFo::PdfTokenizer

Public Member Functions

 PdfParser (PdfVecObjects *pVecObjects)
 
 PdfParser (PdfVecObjects *pVecObjects, const char *pszFilename, bool bLoadOnDemand=true)
 
 PdfParser (PdfVecObjects *pVecObjects, const char *pBuffer, long lLen, bool bLoadOnDemand=true)
 
 PdfParser (PdfVecObjects *pVecObjects, const PdfRefCountedInputDevice &rDevice, bool bLoadOnDemand=true)
 
virtual ~PdfParser ()
 
void ParseFile (const char *pszFilename, bool bLoadOnDemand=true)
 
void ParseFile (const char *pBuffer, long lLen, bool bLoadOnDemand=true)
 
void ParseFile (const PdfRefCountedInputDevice &rDevice, bool bLoadOnDemand=true)
 
bool QuickEncryptedCheck (const char *pszFilename)
 
int GetNumberOfIncrementalUpdates () const
 
const PdfVecObjectsGetObjects () const
 
EPdfVersion GetPdfVersion () const
 
const char * GetPdfVersionString () const
 
const PdfObjectGetTrailer () const
 
bool GetLoadOnDemand () const
 
bool IsLinearized () const
 
size_t GetFileSize () const
 
bool GetEncrypted () const
 
const PdfEncryptGetEncrypt () const
 
PdfEncryptTakeEncrypt ()
 
void SetPassword (const std::string &sPassword)
 
bool IsStrictParsing () const
 
void SetStrictParsing (bool bStrict)
 
bool GetIgnoreBrokenObjects ()
 
void SetIgnoreBrokenObjects (bool bBroken)
 
- Public Member Functions inherited from PoDoFo::PdfTokenizer
virtual bool GetNextToken (const char *&pszToken, EPdfTokenType *peType=NULL)
 
bool IsNextToken (const char *pszToken)
 
pdf_long GetNextNumber ()
 
void GetNextVariant (PdfVariant &rVariant, PdfEncrypt *pEncrypt)
 

Static Public Member Functions

static long GetMaxObjectCount ()
 
static void SetMaxObjectCount (long nMaxObjects)
 
- Static Public Member Functions inherited from PoDoFo::PdfTokenizer
static PODOFO_NOTHROW bool IsWhitespace (const unsigned char ch)
 
static PODOFO_NOTHROW bool IsDelimiter (const unsigned char ch)
 
static PODOFO_NOTHROW bool IsRegular (const unsigned char ch)
 
static PODOFO_NOTHROW bool IsPrintable (const unsigned char ch)
 
static PODOFO_NOTHROW int GetHexValue (const unsigned char ch)
 

Protected Member Functions

void FindToken (const char *pszToken, const long lRange)
 
void FindToken2 (const char *pszToken, const long lRange, size_t searchEnd)
 
void ReadDocumentStructure ()
 
void HasLinearizationDict ()
 
void MergeTrailer (const PdfObject *pTrailer)
 
void ReadTrailer ()
 
void ReadXRef (pdf_long *pXRefOffset)
 
void ReadXRefContents (pdf_long lOffset, bool bPositionAtEnd=false)
 
void ReadXRefSubsection (pdf_int64 &nFirstObject, pdf_int64 &nNumObjects)
 
void ReadXRefStreamContents (pdf_long lOffset, bool bReadOnlyTrailer)
 
void ReadObjects ()
 
void ReadObjectsInternal ()
 
void ReadObjectFromStream (int nObjNo, int nIndex)
 
bool IsPdfFile ()
 
void CheckEOFMarker ()
 
- Protected Member Functions inherited from PoDoFo::PdfTokenizer
void GetNextVariant (const char *pszToken, EPdfTokenType eType, PdfVariant &rVariant, PdfEncrypt *pEncrypt)
 
EPdfDataType DetermineDataType (const char *pszToken, EPdfTokenType eType, PdfVariant &rVariant)
 
void ReadDictionary (PdfVariant &rVariant, PdfEncrypt *pEncrypt)
 
void ReadArray (PdfVariant &rVariant, PdfEncrypt *pEncrypt)
 
void ReadString (PdfVariant &rVariant, PdfEncrypt *pEncrypt)
 
void ReadHexString (PdfVariant &rVariant, PdfEncrypt *pEncrypt)
 
void ReadName (PdfVariant &rVariant)
 
void QuequeToken (const char *pszToken, EPdfTokenType eType)
 

Additional Inherited Members

- Static Public Attributes inherited from PoDoFo::PdfTokenizer
static const unsigned int HEX_NOT_FOUND = std::numeric_limits<unsigned int>::max()
 

Detailed Description

PdfParser reads a PDF file into memory. The file can be modified in memory and written back using the PdfWriter class. Most PDF features are supported

Constructor & Destructor Documentation

PoDoFo::PdfParser::PdfParser ( PdfVecObjects pVecObjects)

Create a new PdfParser object You have to open a PDF file using ParseFile later.

Parameters
pVecObjectsvector to write the parsed PdfObjects to
See also
ParseFile
PoDoFo::PdfParser::PdfParser ( PdfVecObjects pVecObjects,
const char *  pszFilename,
bool  bLoadOnDemand = true 
)

Create a new PdfParser object and open a PDF file and parse it into memory.

Parameters
pVecObjectsvector to write the parsed PdfObjects to
pszFilenamefilename of the file which is going to be parsed
bLoadOnDemandIf true all objects will be read from the file at the time they are accessed first. If false all objects will be read immediately. This is faster if you do not need the complete PDF file in memory.

This might throw a PdfError( ePdfError_InvalidPassword ) exception if a password is required to read this PDF. Call SetPassword() with the correct password in this case.

See also
SetPassword
PoDoFo::PdfParser::PdfParser ( PdfVecObjects pVecObjects,
const char *  pBuffer,
long  lLen,
bool  bLoadOnDemand = true 
)

Create a new PdfParser object and open a PDF file and parse it into memory.

Parameters
pVecObjectsvector to write the parsed PdfObjects to
pBufferbuffer containing a PDF file in memory
lLenlength of the buffer containing the PDF file
bLoadOnDemandIf true all objects will be read from the file at the time they are accessed first. If false all objects will be read immediately. This is faster if you do not need the complete PDF file in memory.

This might throw a PdfError( ePdfError_InvalidPassword ) exception if a password is required to read this PDF. Call SetPassword() with the correct password in this case.

See also
SetPassword
PoDoFo::PdfParser::PdfParser ( PdfVecObjects pVecObjects,
const PdfRefCountedInputDevice rDevice,
bool  bLoadOnDemand = true 
)

Create a new PdfParser object and open a PDF file and parse it into memory.

Parameters
pVecObjectsvector to write the parsed PdfObjects to
rDeviceread from this PdfRefCountedInputDevice
bLoadOnDemandIf true all objects will be read from the file at the time they are accessed first. If false all objects will be read immediately. This is faster if you do not need the complete PDF file in memory.

This might throw a PdfError( ePdfError_InvalidPassword ) exception if a password is required to read this PDF. Call SetPassword() with the correct password in this case.

See also
SetPassword
PoDoFo::PdfParser::~PdfParser ( )
virtual

Delete the PdfParser and all PdfObjects

Member Function Documentation

void PoDoFo::PdfParser::CheckEOFMarker ( )
protected

Checks for the existence of the %EOF marker at the end of the file. When strict mode is off it will also attempt to setup the parser to ignore any garbage after the last %EOF marker. Simply raises an error if there is a problem with the marker.

void PoDoFo::PdfParser::FindToken ( const char *  pszToken,
const long  lRange 
)
protected

Searches backwards from the end of the file and tries to find a token. The current file is positioned right after the token.

Parameters
pszTokena token to find
lRangerange in bytes in which to search begining at the end of the file
void PoDoFo::PdfParser::FindToken2 ( const char *  pszToken,
const long  lRange,
size_t  searchEnd 
)
protected

Searches backwards from the specified position of the file and tries to find a token. The current file is positioned right after the token.

Parameters
pszTokena token to find
lRangerange in bytes in which to search begining at the specified position of the file
searchEndspecifies position
const PdfEncrypt* PoDoFo::PdfParser::GetEncrypt ( ) const
inline
Returns
the parsers encryption object or NULL if the read PDF file was not encrypted
bool PoDoFo::PdfParser::GetEncrypted ( ) const
inline
Returns
true if this PdfWriter creates an encrypted PDF file
size_t PoDoFo::PdfParser::GetFileSize ( ) const
inline
Returns
the length of the file
bool PoDoFo::PdfParser::GetIgnoreBrokenObjects ( )
inline
Returns
if broken objects are ignored while parsing
bool PoDoFo::PdfParser::GetLoadOnDemand ( ) const
inline
Returns
true if this PdfParser loads all objects on demand at the time they are accessed for the first time. The default is to load all object immediately. In this case false is returned.
long PoDoFo::PdfParser::GetMaxObjectCount ( )
inlinestatic
Returns
maximum object count to read (default is LONG_MAX which means no limit)
int PoDoFo::PdfParser::GetNumberOfIncrementalUpdates ( ) const
inline

Retrieve the number of incremental updates that have been applied to the last parsed PDF file.

0 means no update has been applied.

Returns
the number of incremental updates to the parsed PDF.
const PdfVecObjects * PoDoFo::PdfParser::GetObjects ( ) const
inline

Get a reference to the sorted internal objects vector.

Returns
the internal objects vector.
EPdfVersion PoDoFo::PdfParser::GetPdfVersion ( ) const
inline

Get the file format version of the pdf

Returns
the file format version as enum
const char * PoDoFo::PdfParser::GetPdfVersionString ( ) const

Get the file format version of the pdf

Returns
the file format version as string
const PdfObject * PoDoFo::PdfParser::GetTrailer ( ) const
inline

Get the trailer dictionary which can be written unmodified to a pdf file.

void PoDoFo::PdfParser::HasLinearizationDict ( )
protected

Checks wether this pdf is linearized or not. Initializes the linearization directory on sucess.

bool PoDoFo::PdfParser::IsLinearized ( ) const
inline
Returns
whether the parsed document contains linearization tables
bool PoDoFo::PdfParser::IsPdfFile ( )
protected

Checks the magic number at the start of the pdf file and sets the m_ePdfVersion member to the correct version of the pdf file.

Returns
true if this is a pdf file, otherwise false
bool PoDoFo::PdfParser::IsStrictParsing ( ) const
inline
Returns
true if strict parsing mode is enabled
See also
SetStringParsing
void PoDoFo::PdfParser::MergeTrailer ( const PdfObject pTrailer)
protected

Merge the information of this trailer object in the parsers main trailer object.

Parameters
pTrailertake the keys to merge from this dictionary.
void PoDoFo::PdfParser::ParseFile ( const char *  pszFilename,
bool  bLoadOnDemand = true 
)

Open a PDF file and parse it.

Parameters
pszFilenamefilename of the file which is going to be parsed
bLoadOnDemandIf true all objects will be read from the file at the time they are accessed first. If false all objects will be read immediately. This is faster if you do not need the complete PDF file in memory.

This might throw a PdfError( ePdfError_InvalidPassword ) exception if a password is required to read this PDF. Call SetPassword() with the correct password in this case.

See also
SetPassword
void PoDoFo::PdfParser::ParseFile ( const char *  pBuffer,
long  lLen,
bool  bLoadOnDemand = true 
)

Open a PDF file and parse it.

Parameters
pBufferbuffer containing a PDF file in memory
lLenlength of the buffer containing the PDF file
bLoadOnDemandIf true all objects will be read from the file at the time they are accessed first. If false all objects will be read immediately. This is faster if you do not need the complete PDF file in memory.

This might throw a PdfError( ePdfError_InvalidPassword ) exception if a password is required to read this PDF. Call SetPassword() with the correct password in this case.

See also
SetPassword
void PoDoFo::PdfParser::ParseFile ( const PdfRefCountedInputDevice rDevice,
bool  bLoadOnDemand = true 
)

Open a PDF file and parse it.

Parameters
rDevicethe input device to read from
bLoadOnDemandIf true all objects will be read from the file at the time they are accessed first. If false all objects will be read immediately. This is faster if you do not need the complete PDF file in memory.

This might throw a PdfError( ePdfError_InvalidPassword ) exception if a password is required to read this PDF. Call SetPassword() with the correct password in this case.

See also
SetPassword
bool PoDoFo::PdfParser::QuickEncryptedCheck ( const char *  pszFilename)

Quick method to detect secured PDF files, i.e. a PDF with an /Encrypt key in the trailer directory.

Returns
true if document is secured, false otherwise
void PoDoFo::PdfParser::ReadDocumentStructure ( )
protected

Reads the xref sections and the trailers of the file in the correct order in the memory and takes care for linearized pdf files.

void PoDoFo::PdfParser::ReadObjectFromStream ( int  nObjNo,
int  nIndex 
)
protected

Read the object with index nIndex from the object stream nObjNo and push it on the objects vector m_vecOffsets.

All objects are read from this stream and the stream object is free'd from memory. Further calls who try to read from the same stream simply do nothing.

Parameters
nObjNoobject number of the stream object
nIndexindex of the object which should be parsed
void PoDoFo::PdfParser::ReadObjects ( )
protected

Reads all objects from the pdf into memory from the offsets listed in m_vecOffsets.

If required an encryption object is setup first.

The actual reading happens in ReadObjectsInternal() either if no encryption is required or a correct encryption object was initialized from SetPassword.

void PoDoFo::PdfParser::ReadObjectsInternal ( )
protected

Reads all objects from the pdf into memory from the offsets listed in m_vecOffsets.

Requires a correctly setup PdfEncrypt object with correct password.

This method is called from ReadObjects or SetPassword.

See also
ReadObjects
SetPassword
void PoDoFo::PdfParser::ReadTrailer ( )
protected

Read the trailer directory at the end of the file.

void PoDoFo::PdfParser::ReadXRef ( pdf_long *  pXRefOffset)
protected

Looks for a startxref entry at the current file position and saves its byteoffset to pXRefOffset.

Parameters
pXRefOffsetstore the byte offset of the xref section into this variable.
void PoDoFo::PdfParser::ReadXRefContents ( pdf_long  lOffset,
bool  bPositionAtEnd = false 
)
protected

Reads the xref table from a pdf file. If there is no xref table, ReadXRefStreamContents() is called.

Parameters
lOffsetread the table from this offset
bPositionAtEndif true the xref table is not read, but the file stream is positioned directly after the table, which allows reading a following trailer dictionary.
void PoDoFo::PdfParser::ReadXRefStreamContents ( pdf_long  lOffset,
bool  bReadOnlyTrailer 
)
protected

Reads a xref stream contens object

Parameters
lOffsetread the stream from this offset
bReadOnlyTraileronly the trailer is skipped over, the contents of the xref stream are not parsed
void PoDoFo::PdfParser::ReadXRefSubsection ( pdf_int64 &  nFirstObject,
pdf_int64 &  nNumObjects 
)
protected

Read a xref subsection

Throws ePdfError_NoXref if the number of objects read was not the number specified by the subsection header (as passed in `nNumObjects').

Parameters
nFirstObjectobject number of the first object
nNumObjectshow many objects should be read from this section
void PoDoFo::PdfParser::SetIgnoreBrokenObjects ( bool  bBroken)
inline

Specify if the parser should ignore broken objects, i.e. XRef entries that do not point to valid objects.

Default is to not ignore broken objects and throw an exception if one is found.

Parameters
bBrokenif true broken objects will be ignored
void PoDoFo::PdfParser::SetMaxObjectCount ( long  nMaxObjects)
inlinestatic

Specify the maximum number of objects the parser should read. An exception is thrown if document contains more objects than this. Use to avoid problems with very large documents with millions of objects, which use 500MB of working set and spend 15 mins in Load() before throwing an out of memory exception.

Parameters
nMaxObjectsset max number of objects
void PoDoFo::PdfParser::SetPassword ( const std::string &  sPassword)

If you try to open an encrypted PDF file, which requires a password to open, PoDoFo will throw a PdfError( ePdfError_InvalidPassword ) exception.

If you got such an exception, you have to set a password which should be used for opening the PDF.

The usual way will be to ask the user for the password and set the password using this method.

PdfParser will immediately continue to read the PDF file.

Parameters
sPassworda user or owner password which can be used to open an encrypted PDF file If the password is invalid, a PdfError( ePdfError_InvalidPassword ) exception is thrown!
void PoDoFo::PdfParser::SetStrictParsing ( bool  bStrict)
inline

Enable/disable strict parsing mode. Strict parsing is by default disabled.

If you enable strict parsing, PoDoFo will fail on a few more common PDF failures. Please note that PoDoFo's parser is by default very strict already and does not recover from e.g. wrong XREF tables.

Parameters
bStrictnew setting for strict parsing mode.
PdfEncrypt * PoDoFo::PdfParser::TakeEncrypt ( )
inline

Gives the encryption object from the parser. The internal handle will be set to NULL and the ownership of the object is given to the caller.

Only call this if you need access to the encryption object before deleting the parser.

Returns
the parser's encryption object, or NULL if the read PDF file was not encrypted.