NAME

    PDF::Data - Manipulate PDF files and objects as data structures

VERSION

    version v1.0.1

SYNOPSIS

      use PDF::Data;

DESCRIPTION

    This module can read and write PDF files, and represents PDF objects as
    data structures that can be readily manipulated.

METHODS

 new

      my $pdf = PDF::Data->new(-compress => 1, -minify => 1);

    Constructor to create an empty PDF::Data object instance. Any arguments
    passed to the constructor are treated as key/value pairs, and included
    in the $pdf hash object returned from the constructor. When the PDF
    file data is generated, this hash is written to the PDF file as the
    trailer dictionary. However, hash keys starting with "-" are ignored
    when writing the PDF file, as they are considered to be flags or
    metadata.

    For example, $pdf->{-compress} is a flag which controls whether or not
    streams will be compressed when generating PDF file data. This flag can
    be set in the constructor (as shown above), or set directly on the
    object.

    The $pdf->{-minify} flag controls whether or not to save space in the
    generated PDF file data by removing comments and extra whitespace from
    content streams. This flag can be used along with $pdf->{-compress} to
    make the generated PDF file data even smaller, but this transformation
    is not reversible.

 clone

      my $pdf_clone = $pdf->clone;

    Deep copy the entire PDF::Data object itself.

 new_page

      my $page = $pdf->new_page;
      my $page = $pdf->new_page('LETTER');
      my $page = $pdf->new_page(8.5, 11);

    Create a new page object with the specified size (in inches).
    Alternatively, certain page sizes may be specified using one of the
    known keywords: "LETTER" for U.S. Letter size (8.5" x 11"), "LEGAL" for
    U.S. Legal size (8.5" x 14"), or "A0" through "A8" for ISO A-series
    paper sizes. The default page size is U.S. Letter size (8.5" x 11").

 copy_page

      my $copied_page = $pdf->copy_page($page);

    Deep copy a single page object.

 append_page

      $page = $pdf->append_page($page);

    Append the specified page object to the end of the PDF page tree.

 read_pdf

      my $pdf = PDF::Data->read_pdf($file, %args);

    Read a PDF file and parse it with $pdf->parse_pdf(), returning a new
    object instance. Any streams compressed with the /FlateDecode filter
    will be automatically decompressed. Unless the $pdf->{-decompress} flag
    is set, the same streams will also be automatically recompressed again
    when generating PDF file data.

 parse_pdf

      my $pdf = PDF::Data->parse_pdf($data, %args);

    Used by $pdf->read_pdf() to parse the raw PDF file data and create a
    new object instance. This method can also be called directly instead of
    calling $pdf->read_pdf() if the PDF file data comes another source
    instead of a regular file.

 write_pdf

      $pdf->write_pdf($file, $time);

    Generate and write a new PDF file from the current state of the
    PDF::Data object.

    The $time parameter is optional; if not defined, it defaults to the
    current time. If $time is defined but false (zero or empty string), no
    timestamp will be set.

    The optional $time parameter may be used to specify the modification
    timestamp to save in the PDF metadata and to set the file modification
    timestamp of the output file. If not specified, it defaults to the
    current time. If a false value is specified, this method will skip
    setting the modification time in the PDF metadata, and skip setting the
    timestamp on the output file.

 pdf_file_data

      my $pdf_file_data = $document->pdf_file_data($time);

    Generate PDF file data from the current state of the PDF data
    structure, suitable for writing to an output PDF file. This method is
    used by the $pdf->write_pdf() method to generate the raw string of
    bytes to be written to the output PDF file. This data can be directly
    used (e.g. as a MIME attachment) without the need to actually write a
    PDF file to disk.

    The optional $time parameter may be used to specify the modification
    timestamp to save in the PDF metadata. If not specified, it defaults to
    the current time. If a false value is specified, this method will skip
    setting the modification time in the PDF metadata.

 dump_pdf

      $pdf->dump_pdf($file, $mode);

    Dump the PDF internal structure and data for debugging. If the $mode
    parameter is "outline", dump only the PDF internal structure without
    the data.

 dump_outline

      $pdf->dump_outline($file);

    Dump an outline of the PDF internal structure for debugging. (This
    method simply calls the $pdf->dump_pdf() method with the $mode
    parameter specified as "outline".)

 merge_content_streams

      my $stream = $pdf->merge_content_streams($array_of_streams);

    Merge multiple content streams into a single content stream.

 find_bbox

      $pdf->find_bbox($content_stream, $new);

    Analyze a content stream to determine the correct bounding box for the
    content stream. The current implementation was purpose-built for a
    specific use case and should not be expected to work correctly for most
    content streams.

    The $content_stream parameter may be a stream object or a string
    containing the raw content stream data.

    The current algorithm breaks the content stream into lines, skips over
    various "neutral" lines and examines the coordinates specified for
    certain PDF drawing operators: "m" (moveto), "l" (lineto), "v"
    (curveto, initial point replicated), "y" (curveto, final point
    replicated), and "c" (curveto, all points specified).

    The minimum and maximum X and Y coordinates seen for these drawing
    operators are used to determine the bounding box (left, bottom, right,
    top) for the content stream. The bounding box and equivalent rectangle
    (left, bottom, width, height) are printed.

    If the $new boolean parameter is set, an updated content stream is
    generated with the coordinates adjusted to move the lower left corner
    of the bounding box to (0, 0). This would be better done by translating
    the transformation matrix.

 new_bbox

      $new_content = $pdf->new_bbox($content_stream);

    This method simply calls the $pdf->find_bbox() method above with $new
    set to 1.

 timestamp

      my $timestamp = $pdf->timestamp($time);
      my $now       = $pdf->timestamp;

    Generate timestamp in PDF internal format.

UTILITY METHODS

 round

      my @numbers = $pdf->round(@numbers);

    Round numeric values to 12 significant digits to avoid floating-point
    rounding error and remove trailing zeroes.

 concat_matrix

      my $matrix = $pdf->concat_matrix($transformation_matrix, $original_matrix);

    Concatenate a transformation matrix with an original matrix, returning
    a new matrix. This is for arrays of 6 elements representing standard
    3x3 transformation matrices as used by PostScript and PDF.

 invert_matrix

      my $inverse = $pdf->invert_matrix($matrix);

    Calculate the inverse of a matrix, if possible. Returns undef if the
    matrix is not invertible.

 translate

      my $matrix = $pdf->translate($x, $y);

    Returns a 6-element transformation matrix representing translation of
    the origin to the specified coordinates.

 scale

      my $matrix = $pdf->scale($x, $y);

    Returns a 6-element transformation matrix representing scaling of the
    coordinate space by the specified horizontal and vertical scaling
    factors.

 rotate

      my $matrix = $pdf->rotate($angle);

    Returns a 6-element transformation matrix representing counterclockwise
    rotation of the coordinate system by the specified angle (in degrees).

INTERNAL METHODS

 validate

      $pdf->validate;

    Used by $pdf->new(), $pdf->parse_pdf() and $pdf->write_pdf() to
    validate some parts of the PDF structure. Currently, $pdf->validate()
    uses $pdf->validate_key() to verify that the document catalog and page
    tree root node exist and have the correct type, and that the page tree
    root node has no parent node. Then it calls $pdf->validate_page_tree()
    to validate the entire page tree.

    By default, if a validation error occurs, it will be output as
    warnings, but the $pdf->{-validate} flag can be set to make the errors
    fatal.

 validate_page_tree

      my $count = $pdf->validate_page_tree($path, $page_tree_node);

    Used by $pdf->validate(), and called by itself recursively, to validate
    the PDF page tree and its subtrees. The $path parameter specifies the
    logical path from the root of the PDF::Data object to the page subtree,
    and the $page_tree_node parameter specifies the actual page tree node
    data structure represented by that logical path. $pdf->validate()
    initially calls $pdf->validate_page_tree() with "Root/Pages" for $path
    and $pdf->{Root}{Pages} for $page_tree_node.

    Each child of the page tree node (in $page_tree_node->{Kids}) should be
    another page tree node for a subtree or a single page node. In either
    case, the parameters used for the next method call will be "$path\[$i]"
    for $path (e.g. "Root/Pages[0][1]") and $page_tree_node->{Kids}[$i] for
    $page_tree_node (e.g. $pdf->{Root}{Pages}{Kids}[0]{Kids}[1]). These
    parameters are passed to either $pdf->validate_page_tree() recursively
    (if the child is a page tree node) or to $pdf->validate_page() (if the
    child is a page node).

    After validating the page tree, $pdf->validate_resources() will be
    called to validate the page tree's resources, if any.

    If the count of pages in the page tree is incorrect, it will be fixed.
    This method returns the total number of pages in the specified page
    tree.

 validate_page

      $pdf->validate_page($path, $page);

    Used by $pdf->validate_page_tree() to validate a single page of the
    PDF. The $path parameter specifies the logical path from the root of
    the PDF::Data object to the page, and the $page parameter specifies the
    actual page data structure represented by that logical path.

    This method will call $pdf->merge_content_streams() to merge the
    content streams into a single content stream (if $page->{Contents} is
    an array), then it will call $pdf->validate_content_stream() to
    validate the page's content stream.

    After validating the page, $pdf->validate_resources() will be called to
    validate the page's resources, if any.

 validate_resources

      $pdf->validate_resources($path, $resources);

    Used by $pdf->validate_page_tree(), $pdf->validate_page() and
    $pdf->validate_xobject() to validate associated resources. The $path
    parameter specifies the logical path from the root of the PDF::Data
    object to the resources, and the $resources parameter specifies the
    actual resources data structure represented by that logical path.

    This method will call validate_xobjects for $resources->{XObject}, if
    set.

 validate_xobjects

      $pdf->validate_xobjects($path, $xobjects);

    Used by $pdf->validate_resources() to validate form XObjects in the
    resources. The $path parameter specifies the logical path from the root
    of the PDF::Data object to the hash of form XObjects, and the $xobjects
    parameter specifies the actual hash of form XObjects represented by
    that logical path.

    This method simply loops across all the form XObjects in $xobjects and
    calls $pdf->validate_xobject() for each of them.

 validate_xobject

      $pdf->validate_xobject($path, $xobject);

    Used by $pdf->validate_xobjects() to validate a form XObject. The $path
    parameter specifies the logical path from the root of the PDF::Data
    object to the form XObject, and the $xobject parameter specifies the
    actual form XObject represented by that logical path.

    This method verifies that $xobject is a stream and $xobject->{Subtype}
    is "/Form", then calls $pdf->validate_content_stream() with $xobject to
    validate the form XObject content stream, then calls
    $pdf->validate_resources() to validate the form XObject's resources, if
    any.

 validate_content_stream

      $pdf->validate_content_stream($path, $stream);

    Used by $pdf->validate_page() and $pdf->validate_xobject() to validate
    a content stream. The $path parameter specifies the logical path from
    the root of the PDF::Data object to the content stream, and the $stream
    parameter specifies the actual content stream represented by that
    logical path.

    This method calls $pdf->parse_objects() to make sure that the content
    stream can be parsed. If the $pdf->{-minify} flag is set,
    $pdf->minify_content_stream() will be called with the array of parsed
    objects to minify the content stream.

 minify_content_stream

      $pdf->minify_content_stream($stream, $objects);

    Used by $pdf->validate_content_stream() to minify a content stream. The
    $stream parameter specifies the content stream to be modified, and the
    optional $objects parameter specifies a reference to an array of parsed
    objects as returned by $pdf->parse_objects().

    This method calls $pdf->parse_objects() to populate the $objects
    parameter if unspecified, then it calls $pdf->generate_content_stream()
    to generate a minimal content stream for the array of objects, with no
    comments and only the minimum amount of whitespace necessary to parse
    the content stream correctly. (Obviously, this means that this
    transformation is not reversible.)

    Currently, this method also performs a sanity check by running the
    replacement content stream through $pdf->parse_objects() and comparing
    the entire list of objects returned against the original list of
    objects to ensure that the replacement content stream is equivalent to
    the original content stream.

 generate_content_stream

      my $data = $pdf->generate_content_stream($objects);

    Used by $pdf->minify_content_stream() to generate a minimal content
    stream to replace the original content stream. The $objects parameter
    specifies a reference to an array of parsed objects as returned by
    $pdf->parse_objects(). These objects will be used to generate the new
    content stream.

    For each object in the array, this method will call an appropriate
    serialization method: $pdf->serialize_dictionary() for dictionary
    objects, $pdf->serialize_array() for array objects, or
    $pdf->serialize_object() for other objects. After serializing all the
    objects, the newly-generated content stream data is returned.

 serialize_dictionary

      $pdf->serialize_dictionary($stream, $hash);

    Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary()
    (recursively) and $pdf->serialize_array() to serialize a hash as a
    dictionary object. The $stream parameter specifies a reference to a
    string containing the data for the new content stream being generated,
    and the $hash parameter specifies the hash reference to be serialized.

    This method will serialize all the key-value pairs of $hash, prefixing
    each key in the hash with "/" to serialize the key as a name object,
    and calling an appropriate serialization routine for each value in the
    hash: $pdf->serialize_dictionary() for dictionary objects (recursive
    call), $pdf->serialize_array() for array objects, or
    $pdf->serialize_object() for other objects.

 serialize_array

      $pdf->serialize_array($stream, $array);

    Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary()
    and $pdf->serialize_array() (recursively) to serialize an array. The
    $stream parameter specifies a reference to a string containing the data
    for the new content stream being generated, and the $array parameter
    specifies the array reference to be serialized.

    This method will serialize all the array elements of $array, calling an
    appropriate serialization routine for each element of the array:
    $pdf->serialize_dictionary() for dictionary objects,
    $pdf->serialize_array() for array objects (recursive call), or
    $pdf->serialize_object() for other objects.

 serialize_object

      $pdf->serialize_object($stream, $object);

    Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary()
    and $pdf->serialize_array() to serialize a simple object. The $stream
    parameter specifies a reference to a string containing the data for the
    new content stream being generated, and the $object parameter specifies
    the pre-serialized object to be serialized to the specified content
    stream data.

    This method will strip leading and trailing whitespace from the
    pre-serialized object if the $pdf->{-minify} flag is set, then append a
    newline to ${$stream} if appending the pre-serialized object would
    exceed 255 characters for the last line, then append a space to
    ${$stream} if necessary to parse the object correctly, then append the
    pre-serialized object to ${$stream}.

 validate_key

      $pdf->validate_key($hash, $key, $value, $label);

    Used by $pdf->validate() to validate specific hash key values.

 get_hash_node

      my $hash = $pdf->get_hash_node($path);

    Used by $pdf->validate_key() to get a hash node from the PDF structure
    by path.

 parse_objects

      my @objects = $pdf->parse_objects($objects, $data, $offset);

    Used by $pdf->parse_pdf() to parse PDF objects into Perl
    representations.

 parse_data

      my @objects = $pdf->parse_data($data);

    Uses $pdf->parse_objects() to parse PDF objects from standalone PDF
    data.

 filter_stream

      $pdf->filter_stream($stream);

    Used by $pdf->parse_objects() to inflate compressed streams.

 compress_stream

      $new_stream = $pdf->compress_stream($stream);

    Used by $pdf->write_object() to compress streams if enabled. This is
    controlled by the $pdf->{-compress} flag, which is set automatically
    when reading a PDF file with compressed streams, but must be set
    manually for PDF files created from scratch, either in the constructor
    arguments or after the fact.

 resolve_references

      $object = $pdf->resolve_references($objects, $object);

    Used by $pdf->parse_pdf() to replace parsed indirect object references
    with direct references to the objects in question.

 write_indirect_objects

      my $xrefs = $pdf->write_indirect_objects($pdf_file_data, $objects, $seen);

    Used by $pdf->write_pdf() to write all indirect objects to a string of
    new PDF file data.

 enumerate_indirect_objects

      $pdf->enumerate_indirect_objects($objects);

    Used by $pdf->write_indirect_objects() to identify which objects in the
    PDF data structure need to be indirect objects.

 enumerate_shared_objects

      $pdf->enumerate_shared_objects($objects, $seen, $ancestors, $object);

    Used by $pdf->enumerate_indirect_objects() to find objects which are
    already shared (referenced from multiple objects in the PDF data
    structure).

 add_indirect_objects

      $pdf->add_indirect_objects($objects, @objects);

    Used by $pdf->enumerate_indirect_objects() and
    $pdf->enumerate_shared_objects() to add objects to the list of indirect
    objects to be written out.

 write_object

      $pdf->write_object($pdf_file_data, $objects, $seen, $object, $indent);

    Used by $pdf->write_indirect_objects(), and called by itself
    recursively, to write direct objects out to the string of new PDF file
    data.

 dump_object

      my $output = $pdf->dump_object($object, $label, $seen, $indent, $mode);

    Used by $pdf->dump_pdf(), and called by itself recursively, to dump (or
    outline) the specified PDF object.