Context Navigation

← Previous Change
Next Change →

pdf.py

Timestamp:

06/22/08 01:38:04 (16 years ago)

Author:

jerome

Message:

Did some work to improve PDF parser : A very fast method (26 times
faster than the original one) doesn't work with some "strange"
documents like PCL developers' guide. A slow method, which
extracts objects from PDF documents and correctly handles object
versioning (more cleaning work is needed)

Files:

: 1 modified

pkpgcounter/trunk/pkpgpdls/pdf.py (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

pkpgcounter/trunk/pkpgpdls/pdf.py

r564	r3384
20	20	#
21	21
22		"""This modules implements a page counter for PDF documents."""
	22	"""This modules implements a page counter for PDF documents.
	23
	24	Some informations taken from PDF Reference v1.7 by Adobe.
	25	"""
23	26
24	27	import re
25	28
26	29	import pdlparser
	30
	31	PDFWHITESPACE = chr(0) \
	32	+ chr(9) \
	33	+ chr(10) \
	34	+ chr(12) \
	35	+ chr(13) \
	36	+ chr(32)
	37
	38	PDFDELIMITERS = r"()<>[]{}/%"
	39	PDFCOMMENT = r"%" # Up to next EOL
	40
	41	PDFPAGEMARKER = "<< /Type /Page " # Where spaces are any whitespace char
	42
	43	PDFMEDIASIZE = "/MediaBox [xmin ymin xmax ymax]" # an example. MUST be present in Page objects
	44	PDFOBJREGEX = r"\s+(\d+)\s+(\d+)\s+(obj\s.+\sendobj)" # Doesn't work as expected
27	45
28	46	class PDFObject :
…	…
106	124	pagecount += count
107	125	return pagecount
	126
	127	def veryFastAndNotAlwaysCorrectgetJobSize(self) :
	128	"""Counts pages in a PDF document."""
	129	newpageregexp = re.compile(r"/Type\s*/Page[/>\s]")
	130	return len(newpageregexp.findall(self.infile.read()))
	131
	132	def thisOneIsSlowButCorrectgetJobSize(self) :
	133	"""Counts pages in a PDF document."""
	134	oregexp = re.compile(r"\s+(\d+)\s+(\d+)\s+(obj\s.+?\s?endobj)", \
	135	re.DOTALL)
	136	objtokeep = {}
	137	for (smajor, sminor, content) in oregexp.findall(self.infile.read()) :
	138	major = int(smajor)
	139	minor = int(sminor)
	140	(prevmin, prevcont) = objtokeep.get(major, (None, None))
	141	if (minor >= prevmin) : # Handles both None and real previous minor
	142	objtokeep[major] = (minor, content)
	143	#if prevmin is not None :
	144	# self.logdebug("Object %i.%i overwritten with %i.%i" \
	145	# % (major, prevmin, \
	146	# major, minor))
	147	#else :
	148	# self.logdebug("Object %i.%i OK" % (major, minor))
	149	npregexp = re.compile(r"/Type\s*/Page[/>\s]")
	150	pagecount = 0
	151	for (major, (minor, content)) in objtokeep.items() :
	152	count = len(npregexp.findall(content))
	153	if count :
	154	emptycount = content.count("obj\n<< \n/Type /Page \n>> \nendobj") + content.count("obj\n<< \n/Type /Page \n\n>> \nendobj") # TODO : make this clean
	155	if not emptycount :
	156	self.logdebug("%i.%i : %s\n" % (major, minor, repr(content)))
	157	pagecount += count - emptycount
	158	return pagecount

Context Navigation

Changeset 3384 for pkpgcounter/trunk/pkpgpdls/pdf.py

Legend:

pkpgcounter/trunk/pkpgpdls/pdf.py

Download in other formats: