Deterministic PDF

From Library
Jump to navigation Jump to search

Summary

Can a PDF be created in a 'deterministic' way, so that the output file is 'byte-for-byte identical'

(i.e. so they would have the same hash value)?

Why or why not?

Other initial thoughts

  • Even if the 'format' / specification says this is possible, only some tools may actually be able to accomplish it.
  • If it is not possible, can the content of a PDF (all PDFs or only a specific subset of versions / types) be exported and 'normalized' so that duplicates can be identified and similar versions can be easily compared for substantive differences (ideally automatically / without human intervention)?
    • Regardless, a simple computer desktop utility could be written to graphically show the metadata (internal dates, content IDs and other 'automatic' metadata, filesystem metadata such as dates and size, and user-entered metadata such as titles and creator names) and other derived / calculated info (e.g. 'content hashes', number of pages / words / images / etc., page size / orientations, image resolutions, et. al.) of two or more (6 or less is probably a practical limit from a UX perspective) PDF files.

Details

Variable data

  • Compression algorithms
  • PDF metadata / internals IDs
  • Order of dictionary items
  • ... ?

See also

References

Notes

[Template fetch failed for https://en.wikipedia.org/wiki/Template:Template:cnote?action=render: HTTP 404] [Template fetch failed for https://en.wikipedia.org/wiki/Template:Template:Notelist?action=render: HTTP 404]

References

Wikipedia: Reflist Wikipedia: Preboot Execution Environment

External links