Manuals

 

SetaPDF-Signer API - Caching

PDF parsing and handling can be an expensive task in view of needed cpu-power.

To avoid doing default tasks for a single document a few times the parser class offers a caching mechanism to reduce the overhead and avoid reparsing of PDF documents a few times.

The parser simply saves serialized data in the filesystem and load them back if needed. This data can be used with ANY SetaPDF API. So if for example the SetaPDF-Merger API creates the cache data, the SetaPDF-Stamper API can benefit from them.

As of this, the handling of the cache mechanism is done through static methods of the SetaPDF_Parser class. Calls to this methods will change static variables in their method contexts, so that changes doesn't depend on the object instance but applies to all instances of a parser object. (We used static variable because of compatibility to PHP4)

There are 2 parts that the parser can cache:

1. The Xref Table

This is a kind of table of contents of a PDF document. It includes information about all objects in a document and their byte-offset positions in the document. Often documents include several hundreds or thousands of entries in that table. Further more a PDF document can include more than one xref table, which relys on several updates of a document (incremental updates). But at least all tables have to be processed to get the final state of the document... By caching that data, the parser don't have to reparse the xref table out of the document.

2. Objects

Each entry in the above described xref table points to an object representing specific data, like Images, Fonts, Pages,... If the parser should read such an object it have to go to the desired byte-offset position in the document, known from the xref-table, and have to parse the object token-wise. This process needs several string comparsions and also runs recursive until the object is totally read.

The parser can cache the read objects and use the cached versions at the next situation when it is needed. No byte-position change or parsing of any string is done but simply unserializing the data from the cached data.

Usage

As already written the handling of the cache functionallity is done by static methods of the SetaPDF_Parser class.

You can use the static method right after including a desired API like the SetaPDF-Merger API:

First of all you have to tell the API where you would like to save the cached data. You have to use the SetaPDF_Parser::cacheDir()-method for this:

Now you were able to activate the caching by calling the SetaPDF_Parser::cacheFlags()-method with special flags. The flags are predefined in Constants:

After this the cache is active for all instances of any SetaPDF API.

Furthermore you can do some fintuning:

Build the cache slowly

If you want the cache to be build piecemeal you can use the SetaPDF_Parser::cacheNoOfObjectsPerInstance()-method to define a maximum of objects to cache in a single script instance. With this method you can avoid performance peaks because the cache writing process, for sure, also needs cpu time.

How is a file identified and how you can control it

By default the cache mechanism uses the md5_file()-function to get an unique file identifier of the document. This file identifier is used as the directoryname in the cache output directory. To give you the possibility to use another method for the fileidentification you can define your own function/method, which will be called when a fileidentifier is needed, with the SetaPDF_Parser::cacheHashFunction()-method.

An Example: You already have your documents arranged in a database. This data have already unique ids related to the documents local path in your filesystem. As the ids are already known and are unique you should use the ids as a fileidentifier to avoid creating a hash with md5_file().

Furthermore it is easier for you to manage the cache data, as you can for example delete the cache data if the data in the database table were deleted or changed.

The passed argument is of the pseudo-type callback and will be used with call_user_func()-function.