Recently, I was working on a PDF summarizer and faced the challenge of extracting images from PDFs. During my research, I discovered that some parts of the PDF I initially classified as images were actually drawings, requiring a different extraction method.
I found the library pymupdf, which has excellent documentation and easy-to-follow guides to achieve my initial mission: extracting images from PDFs.
Extract images
We can break down this task into a series of steps:
- Open the PDF file
- Iterate over each page of the PDF searching for images
- Extract the image
- Save image to output folder
Based on the documentation, we can expand on the “Extract image” step. I discovered that some images have a transparency layer, and if we don’t collect this layer and join it with the image, the result will differ from the original. However, not all images have this characteristic, so we need to identify each case and treat it accordingly.
Here’s the final code to extract images:
def extract_img_from_pdf(pdf_path):
output_dir = "output/img"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
image_count = 0
doc = pymupdf.open(pdf_path)
images = []
output_images_list = []
for page in doc:
images = page.get_images(full=True)
if len(images) > 0:
for image in images:
xref = image[0]
smask = image[1]
image_dict = doc.extract_image(xref)
ext = image_dict["ext"]
if smask != 0:
pix_without_alpha = pymupdf.Pixmap(image_dict["image"])
mask = pymupdf.Pixmap(doc.extract_image(smask)["image"])
image = pymupdf.Pixmap(pix_without_alpha, mask).tobytes()
else:
image = image_dict["image"]
imgout = open(f"output/img/img-{image_count}.{ext}", "wb")
imgout.write(image)
imgout.close()
output_images_list.append(f"output/img/img-{image_count}.{ext}")
image_count += 1
return output_images_list
The method get_images returns a series of attributes from the image. The xref identifies the image and the smask identifies the transparency layer. In case of no transparency layer, the smask comes with a zero value (helping to identify the case we are treating).
When testing the function, found that some images were missing from my test PDFs. I then discovered that those “images” were actually drawings.
Extract drawings
The pymupdf library also has a guide to extract drawings. However, I wasn’t satisfied with the final result because some drawings contain text, which the proposed method doesn’t extract. I discovered a method called cluser_drawings. This method walks through the output of Page.get_drawings()
and joins paths whose path["rect"]
are closer to each other than some tolerance values (given in the arguments). This way the text in the drawings is also extracted.
The zoom attribute in the pymupdf.Matrix controls the scaling factor applied when rendering PDF content to a pixel map. In my tests, I doubled the width and height of the original drawing
Here’s the code to extract drawings:
def extract_drawings_from_pdf(pdf_path):
output_dir = "output/drawings"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
output_drawing_list = []
doc = pymupdf.open(pdf_path)
zoom = 2
for page_num, page in enumerate(doc):
clusters = page.cluster_drawings(x_tolerance=10, y_tolerance=10)
for cluster_num, bbox in enumerate(clusters):
pix = page.get_pixmap(
clip=bbox,
matrix=pymupdf.Matrix(zoom, zoom)
)
img_path = os.path.join(output_dir, f"page_{page_num+1}_cluster_{cluster_num+1}.png")
pix.save(img_path)
output_drawing_list.append(img_path)
doc.close()
return output_drawing_list
This method is sensitive to the x_tolerance and y_tolerance parameters. Larger values (like 100) can result in more than one drawing clustered in the same image (cases where you have two drawings close on the same page). Low values can result in parts of the same drawing being split into different images.
Conclusion
With these methods, I was able to extract both images and drawings from my PDF documents and store them appropriately so I can include the best visuals in the PDF summary. As a popular saying goes, “A picture is worth a thousand words.”