How to extract images and drawings from PDF with Python

Recently, I was working on a PDF summarizer and faced the challenge of extracting images from PDFs. During my research, I discovered that some parts of the PDF I initially classified as images were actually drawings, requiring a different extraction method.

I found the library pymupdf, which has excellent documentation and easy-to-follow guides to achieve my initial mission: extracting images from PDFs.

Extract images

We can break down this task into a series of steps:

Open the PDF file
Iterate over each page of the PDF searching for images
Extract the image
Save image to output folder

Based on the documentation, we can expand on the “Extract image” step. I discovered that some images have a transparency layer, and if we don’t collect this layer and join it with the image, the result will differ from the original. However, not all images have this characteristic, so we need to identify each case and treat it accordingly.

Here’s the final code to extract images:

def extract_img_from_pdf(pdf_path):
    output_dir = "output/img"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    image_count = 0
    doc = pymupdf.open(pdf_path)
    images = []
    output_images_list = []
    for page in doc:
        images = page.get_images(full=True)
        if len(images) > 0:
            
            for image in images:
                xref = image[0]
                smask = image[1]
                image_dict = doc.extract_image(xref)
                ext = image_dict["ext"]
                if smask != 0:
                    pix_without_alpha = pymupdf.Pixmap(image_dict["image"])
                    mask = pymupdf.Pixmap(doc.extract_image(smask)["image"])
                    image = pymupdf.Pixmap(pix_without_alpha, mask).tobytes()
                else:
                    image = image_dict["image"]
                imgout = open(f"output/img/img-{image_count}.{ext}", "wb")
                imgout.write(image)
                imgout.close()
                output_images_list.append(f"output/img/img-{image_count}.{ext}")
                image_count += 1
                
    return output_images_list

The method get_images returns a series of attributes from the image. The xref identifies the image and the smask identifies the transparency layer. In case of no transparency layer, the smask comes with a zero value (helping to identify the case we are treating).

When testing the function, found that some images were missing from my test PDFs. I then discovered that those “images” were actually drawings.

Extract drawings

The pymupdf library also has a guide to extract drawings. However, I wasn’t satisfied with the final result because some drawings contain text, which the proposed method doesn’t extract. I discovered a method called cluser_drawings. This method walks through the output of Page.get_drawings() and joins paths whose path["rect"] are closer to each other than some tolerance values (given in the arguments). This way the text in the drawings is also extracted.

The zoom attribute in the pymupdf.Matrix controls the scaling factor applied when rendering PDF content to a pixel map. In my tests, I doubled the width and height of the original drawing

Here’s the code to extract drawings:

def extract_drawings_from_pdf(pdf_path):
    
    output_dir = "output/drawings"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)


    output_drawing_list = []
    
    doc = pymupdf.open(pdf_path)
    zoom = 2

    
    for page_num, page in enumerate(doc):
        
        clusters = page.cluster_drawings(x_tolerance=10, y_tolerance=10)
        
        for cluster_num, bbox in enumerate(clusters):
            
            pix = page.get_pixmap(
                clip=bbox,
                matrix=pymupdf.Matrix(zoom, zoom)
            )
            img_path = os.path.join(output_dir, f"page_{page_num+1}_cluster_{cluster_num+1}.png")
            pix.save(img_path)
            output_drawing_list.append(img_path)

    doc.close()

    return output_drawing_list

This method is sensitive to the x_tolerance and y_tolerance parameters. Larger values (like 100) can result in more than one drawing clustered in the same image (cases where you have two drawings close on the same page). Low values can result in parts of the same drawing being split into different images.

Conclusion

With these methods, I was able to extract both images and drawings from my PDF documents and store them appropriately so I can include the best visuals in the PDF summary. As a popular saying goes, “A picture is worth a thousand words.”

Sergio Henrique

Data Analyst Building Things and Sharing Learning Along the Way

How to extract images and drawings from PDF with Python

Extract images

Extract drawings

Conclusion

Leave a Reply Cancel reply