Written by Sergio HenriqueFebruary 11, 2025

How to extract images and drawings from PDF with Python

Extracting images and drawings from PDF files can be a challenging task, but with the right tools and techniques, it’s entirely achievable. This blog post explores how to use the PyMuPDF library in Python to extract both images and drawings from PDF documents. We’ll dive into the nuances of handling transparency layers in images and clustering drawings to preserve embedded text. Whether you’re building a PDF summarizer or simply need to extract visual content from PDFs, these methods provide a robust solution to automate the process.

Written by Sergio HenriqueDecember 10, 2024December 10, 2024

Download data from Kaggle competition and upload in Azure ML

In some Kaggle competitions the provided machines can not handle the volume of data available. In this cases, I think that could be beneficial to train the model in another place.

Written by Sergio HenriqueDecember 2, 2024December 2, 2024

ARIMA and Online Learning in Financial Forecasting

I discuss the development of an online learning system using the Jane Street Real-Time Market Data Forecasting challenge as a practice ground for time-series forecasting. The project involves predicting the responder_6 variable using an ARIMA model, with a focus on adapting to new data by re-training the model whenever a new date_id is encountered. This approach leverages multiprocessing to meet strict time constraints

Sergio Henrique

Data Analyst Building Things and Sharing Learning Along the Way

Category: Python

How to extract images and drawings from PDF with Python

Download data from Kaggle competition and upload in Azure ML

ARIMA and Online Learning in Financial Forecasting