Skip to content

Instantly share code, notes, and snippets.

@Rourke101
Forked from hubgit/README.md
Created February 2, 2018 19:45
Show Gist options
  • Select an option

  • Save Rourke101/cc98a0dc37780710dc893de6c5c67858 to your computer and use it in GitHub Desktop.

Select an option

Save Rourke101/cc98a0dc37780710dc893de6c5c67858 to your computer and use it in GitHub Desktop.

Revisions

  1. @hubgit hubgit revised this gist Jul 25, 2013. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -12,7 +12,9 @@ Metadata in PDF files can be stored in at least two places:
    A PDF file contains a) objects and b) pointers to those objects.

    When information is added to a PDF file, it is appended to the end of the file and a pointer is added.

    When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.

    To remove previously-deleted data, the PDF file must be rebuilt.

    ## pdftk
  2. @hubgit hubgit revised this gist Jul 25, 2013. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -19,7 +19,7 @@ To remove previously-deleted data, the PDF file must be rebuilt.

    [pdftk][pdftk] can be used to update the Info Dictionary of a PDF file. See `pdftk-unset-info-dictionary-values.php` below for an example. As noted in [the pdftk documentation](http://www.pdflabs.com/docs/pdftk-man-page/), though, pdftk does not alter XMP metadata.

    [pdftk] http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
    [pdftk]: http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

    ## exiftool

    @@ -30,15 +30,15 @@ To remove previously-deleted data, the PDF file must be rebuilt.

    `exiftool -all:all` also removes the pointer to the Info Dictionary, but does not completely remove it.

    [exiftool] http://www.sno.phy.queensu.ca/~phil/exiftool/
    [exiftool]: http://www.sno.phy.queensu.ca/~phil/exiftool/

    ## qpdf

    [qpdf][qpdf] can be used to linearize PDF files (`qpdf --linearize $FILE`), which optimises them for fast web loading and removes any orphan data.

    [qpdf]: http://qpdf.sourceforge.net/

    ## Embedded objects.

    After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of
    embedded objects, use `exiftool -extractEmbedded -all:all $FILE`.

    [qpdf] http://qpdf.sourceforge.net/
  3. @hubgit hubgit revised this gist Jul 25, 2013. 3 changed files with 43 additions and 1 deletion.
    42 changes: 42 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,2 +1,44 @@
    # Anonymising PDFs

    ## PDF metadata

    Metadata in PDF files can be stored in at least two places:

    * the Info Dictionary, a limited set of key/value pairs
    * XMP packets, which contain RDF statements expressed as XML

    ## PDF files

    A PDF file contains a) objects and b) pointers to those objects.

    When information is added to a PDF file, it is appended to the end of the file and a pointer is added.
    When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.
    To remove previously-deleted data, the PDF file must be rebuilt.

    ## pdftk

    [pdftk][pdftk] can be used to update the Info Dictionary of a PDF file. See `pdftk-unset-info-dictionary-values.php` below for an example. As noted in [the pdftk documentation](http://www.pdflabs.com/docs/pdftk-man-page/), though, pdftk does not alter XMP metadata.

    [pdftk] http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

    ## exiftool

    [exiftool][exiftool] can be used to read/write XMP metadata from/to PDF files.

    * `exiftool -all:all` => read all the tags.
    * `exiftool -all:all=` => remove all the tags.

    `exiftool -all:all` also removes the pointer to the Info Dictionary, but does not completely remove it.

    [exiftool] http://www.sno.phy.queensu.ca/~phil/exiftool/

    ## qpdf

    [qpdf][qpdf] can be used to linearize PDF files (`qpdf --linearize $FILE`), which optimises them for fast web loading and removes any orphan data.

    ## Embedded objects.

    After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of
    embedded objects, use `exiftool -extractEmbedded -all:all $FILE`.

    [qpdf] http://qpdf.sourceforge.net/
    2 changes: 1 addition & 1 deletion remove-pdf-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -18,4 +18,4 @@ qpdf --linearize $FILE
    #cat $FILE | strings > $FILE.txt

    # read XMP from embedded objects in the modified PDF
    #exiftool -extractEmbedded-all:all $FILE
    #exiftool -extractEmbedded -all:all $FILE
  4. @hubgit hubgit revised this gist Jul 25, 2013. 2 changed files with 40 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,2 @@
    # Anonymising PDFs

    38 changes: 38 additions & 0 deletions pdftk-remove-info-dictionary.php
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,38 @@
    <?php

    $file = 'example.pdf';

    // get the current metadata
    $command = sprintf('pdftk %s dump_data', escapeshellarg($file));
    $output = array(); $return = null; exec($command, $output, $return);

    //print_r($output);

    if ($return) {
    throw new Exception('There was an error reading metadata from the PDF file');
    }

    // set any metadata values to null
    foreach ($output as $index => $line) {
    if (strpos($line, 'InfoValue:') === 0) {
    $output[$index] = 'InfoValue:';
    }
    }

    // write the updated metadata to a file
    $metadataFile = tempnam(sys_get_temp_dir(), 'pdf-meta-');
    file_put_contents($metadataFile, implode("\n", $output));

    // create a new PDF using the updated metadata
    $tmpFile = tempnam(sys_get_temp_dir(), 'pdf-tmp-');
    $command = sprintf('pdftk %s update_info %s output %s',
    escapeshellarg($file), escapeshellarg($metadataFile), escapeshellarg($tmpFile));
    $output = array(); $return = null; exec($command, $output, $return);

    if ($return) {
    throw new Exception('There was an error writing metadata to the PDF file');
    }

    // clean up the temporary files
    rename($tmpFile, $file);
    unlink($metadataFile);
  5. @hubgit hubgit revised this gist Jul 25, 2013. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions remove-pdf-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -16,3 +16,6 @@ qpdf --linearize $FILE

    # read all strings from the modified PDF
    #cat $FILE | strings > $FILE.txt

    # read XMP from embedded objects in the modified PDF
    #exiftool -extractEmbedded-all:all $FILE
  6. @hubgit hubgit revised this gist Jul 25, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion remove-pdf-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -15,4 +15,4 @@ qpdf --linearize $FILE
    #exiftool -all:all $FILE

    # read all strings from the modified PDF
    #cat example.pdf | strings > $FILE.txt
    #cat $FILE | strings > $FILE.txt
  7. @hubgit hubgit revised this gist Jul 25, 2013. 2 changed files with 18 additions and 19 deletions.
    18 changes: 18 additions & 0 deletions remove-pdf-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,18 @@
    #!/bin/bash

    FILE=example.pdf

    # read tags from the original PDF
    #exiftool -all:all $FILE

    # remove tags (XMP + metadata) from the PDF
    exiftool -all:all= $FILE

    # linearize the file to remove orphan data
    qpdf --linearize $FILE

    # read XMP from the modified PDF
    #exiftool -all:all $FILE

    # read all strings from the modified PDF
    #cat example.pdf | strings > $FILE.txt
    19 changes: 0 additions & 19 deletions remove-xmp-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -1,19 +0,0 @@
    #!/bin/bash

    FILE=example.pdf

    # read XMP from the original PDF
    #rm $FILE.before.xmp
    #exiftool -tagsfromfile $FILE $FILE.before.xmp
    #cat $FILE.before.xmp

    # remove XMP from the PDF
    exiftool -all:all= $FILE

    # read XMP from the modified PDF
    #rm $FILE.after.xmp
    #exiftool -tagsfromfile $FILE $FILE.after.xmp
    #cat $FILE.after.xmp

    # read all strings from the modified PDF
    #cat example.pdf | strings > $FILE.txt
  8. @hubgit hubgit revised this gist Jul 25, 2013. 1 changed file with 5 additions and 2 deletions.
    7 changes: 5 additions & 2 deletions remove-xmp-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -4,13 +4,16 @@ FILE=example.pdf

    # read XMP from the original PDF
    #rm $FILE.before.xmp
    #exiftool -tagsfromfile $FILE $FILE.before.xmp;
    #exiftool -tagsfromfile $FILE $FILE.before.xmp
    #cat $FILE.before.xmp

    # remove XMP from the PDF
    exiftool -all:all= $FILE

    # read XMP from the modified PDF
    #rm $FILE.after.xmp
    #exiftool -tagsfromfile $FILE $FILE.after.xmp;
    #exiftool -tagsfromfile $FILE $FILE.after.xmp
    #cat $FILE.after.xmp

    # read all strings from the modified PDF
    #cat example.pdf | strings > $FILE.txt
  9. @hubgit hubgit created this gist Jul 25, 2013.
    16 changes: 16 additions & 0 deletions remove-xmp-metadata.sh
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,16 @@
    #!/bin/bash

    FILE=example.pdf

    # read XMP from the original PDF
    #rm $FILE.before.xmp
    #exiftool -tagsfromfile $FILE $FILE.before.xmp;
    #cat $FILE.before.xmp

    # remove XMP from the PDF
    exiftool -all:all= $FILE

    # read XMP from the modified PDF
    #rm $FILE.after.xmp
    #exiftool -tagsfromfile $FILE $FILE.after.xmp;
    #cat $FILE.after.xmp