-
-
Save Rourke101/cc98a0dc37780710dc893de6c5c67858 to your computer and use it in GitHub Desktop.
Revisions
-
hubgit revised this gist
Jul 25, 2013 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -12,7 +12,9 @@ Metadata in PDF files can be stored in at least two places: A PDF file contains a) objects and b) pointers to those objects. When information is added to a PDF file, it is appended to the end of the file and a pointer is added. When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed. To remove previously-deleted data, the PDF file must be rebuilt. ## pdftk -
hubgit revised this gist
Jul 25, 2013 . 1 changed file with 4 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -19,7 +19,7 @@ To remove previously-deleted data, the PDF file must be rebuilt. [pdftk][pdftk] can be used to update the Info Dictionary of a PDF file. See `pdftk-unset-info-dictionary-values.php` below for an example. As noted in [the pdftk documentation](http://www.pdflabs.com/docs/pdftk-man-page/), though, pdftk does not alter XMP metadata. [pdftk]: http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ ## exiftool @@ -30,15 +30,15 @@ To remove previously-deleted data, the PDF file must be rebuilt. `exiftool -all:all` also removes the pointer to the Info Dictionary, but does not completely remove it. [exiftool]: http://www.sno.phy.queensu.ca/~phil/exiftool/ ## qpdf [qpdf][qpdf] can be used to linearize PDF files (`qpdf --linearize $FILE`), which optimises them for fast web loading and removes any orphan data. [qpdf]: http://qpdf.sourceforge.net/ ## Embedded objects. After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of embedded objects, use `exiftool -extractEmbedded -all:all $FILE`. -
hubgit revised this gist
Jul 25, 2013 . 3 changed files with 43 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,2 +1,44 @@ # Anonymising PDFs ## PDF metadata Metadata in PDF files can be stored in at least two places: * the Info Dictionary, a limited set of key/value pairs * XMP packets, which contain RDF statements expressed as XML ## PDF files A PDF file contains a) objects and b) pointers to those objects. When information is added to a PDF file, it is appended to the end of the file and a pointer is added. When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed. To remove previously-deleted data, the PDF file must be rebuilt. ## pdftk [pdftk][pdftk] can be used to update the Info Dictionary of a PDF file. See `pdftk-unset-info-dictionary-values.php` below for an example. As noted in [the pdftk documentation](http://www.pdflabs.com/docs/pdftk-man-page/), though, pdftk does not alter XMP metadata. [pdftk] http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ ## exiftool [exiftool][exiftool] can be used to read/write XMP metadata from/to PDF files. * `exiftool -all:all` => read all the tags. * `exiftool -all:all=` => remove all the tags. `exiftool -all:all` also removes the pointer to the Info Dictionary, but does not completely remove it. [exiftool] http://www.sno.phy.queensu.ca/~phil/exiftool/ ## qpdf [qpdf][qpdf] can be used to linearize PDF files (`qpdf --linearize $FILE`), which optimises them for fast web loading and removes any orphan data. ## Embedded objects. After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of embedded objects, use `exiftool -extractEmbedded -all:all $FILE`. [qpdf] http://qpdf.sourceforge.net/ File renamed without changes.This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -18,4 +18,4 @@ qpdf --linearize $FILE #cat $FILE | strings > $FILE.txt # read XMP from embedded objects in the modified PDF #exiftool -extractEmbedded -all:all $FILE -
hubgit revised this gist
Jul 25, 2013 . 2 changed files with 40 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,2 @@ # Anonymising PDFs This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,38 @@ <?php $file = 'example.pdf'; // get the current metadata $command = sprintf('pdftk %s dump_data', escapeshellarg($file)); $output = array(); $return = null; exec($command, $output, $return); //print_r($output); if ($return) { throw new Exception('There was an error reading metadata from the PDF file'); } // set any metadata values to null foreach ($output as $index => $line) { if (strpos($line, 'InfoValue:') === 0) { $output[$index] = 'InfoValue:'; } } // write the updated metadata to a file $metadataFile = tempnam(sys_get_temp_dir(), 'pdf-meta-'); file_put_contents($metadataFile, implode("\n", $output)); // create a new PDF using the updated metadata $tmpFile = tempnam(sys_get_temp_dir(), 'pdf-tmp-'); $command = sprintf('pdftk %s update_info %s output %s', escapeshellarg($file), escapeshellarg($metadataFile), escapeshellarg($tmpFile)); $output = array(); $return = null; exec($command, $output, $return); if ($return) { throw new Exception('There was an error writing metadata to the PDF file'); } // clean up the temporary files rename($tmpFile, $file); unlink($metadataFile); -
hubgit revised this gist
Jul 25, 2013 . 1 changed file with 3 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -16,3 +16,6 @@ qpdf --linearize $FILE # read all strings from the modified PDF #cat $FILE | strings > $FILE.txt # read XMP from embedded objects in the modified PDF #exiftool -extractEmbedded-all:all $FILE -
hubgit revised this gist
Jul 25, 2013 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -15,4 +15,4 @@ qpdf --linearize $FILE #exiftool -all:all $FILE # read all strings from the modified PDF #cat $FILE | strings > $FILE.txt -
hubgit revised this gist
Jul 25, 2013 . 2 changed files with 18 additions and 19 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,18 @@ #!/bin/bash FILE=example.pdf # read tags from the original PDF #exiftool -all:all $FILE # remove tags (XMP + metadata) from the PDF exiftool -all:all= $FILE # linearize the file to remove orphan data qpdf --linearize $FILE # read XMP from the modified PDF #exiftool -all:all $FILE # read all strings from the modified PDF #cat example.pdf | strings > $FILE.txt This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,19 +0,0 @@ -
hubgit revised this gist
Jul 25, 2013 . 1 changed file with 5 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,13 +4,16 @@ FILE=example.pdf # read XMP from the original PDF #rm $FILE.before.xmp #exiftool -tagsfromfile $FILE $FILE.before.xmp #cat $FILE.before.xmp # remove XMP from the PDF exiftool -all:all= $FILE # read XMP from the modified PDF #rm $FILE.after.xmp #exiftool -tagsfromfile $FILE $FILE.after.xmp #cat $FILE.after.xmp # read all strings from the modified PDF #cat example.pdf | strings > $FILE.txt -
hubgit created this gist
Jul 25, 2013 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,16 @@ #!/bin/bash FILE=example.pdf # read XMP from the original PDF #rm $FILE.before.xmp #exiftool -tagsfromfile $FILE $FILE.before.xmp; #cat $FILE.before.xmp # remove XMP from the PDF exiftool -all:all= $FILE # read XMP from the modified PDF #rm $FILE.after.xmp #exiftool -tagsfromfile $FILE $FILE.after.xmp; #cat $FILE.after.xmp