Written by Sergio HenriqueMarch 29, 2025March 29, 2025

How to use Gemini to bypass image captcha when web scraping

In this project, I tackled the challenge of automating text summarization for my wife’s judicial studies by scraping Brazil’s Supreme Federal Court decisions. Along the way, I hit a snag with image captchas and devised a clever workaround using an LLM to solve them, feeding it screenshots and simulating clicks on canvas elements. This notebook showcases that process, offering a practical example of integrating LLMs into web scraping—perfect for anyone facing similar hurdles.

Written by Sergio HenriqueFebruary 14, 2025February 14, 2025

Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations

This paper, “Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations,” presents a novel approach to understanding the real-world integration of Artificial Intelligence (AI) into the economy. By analyzing over four million conversations from the Claude.ai platform, the authors provide empirical evidence of how AI is currently being used across various […]

Written by Sergio HenriqueFebruary 11, 2025

How to extract images and drawings from PDF with Python

Extracting images and drawings from PDF files can be a challenging task, but with the right tools and techniques, it’s entirely achievable. This blog post explores how to use the PyMuPDF library in Python to extract both images and drawings from PDF documents. We’ll dive into the nuances of handling transparency layers in images and clustering drawings to preserve embedded text. Whether you’re building a PDF summarizer or simply need to extract visual content from PDFs, these methods provide a robust solution to automate the process.

Written by Sergio HenriqueJanuary 17, 2025

Configuring Nginx and Certbot in Docker

In this article I talk about how i configurated Nginx and Certbot in Docker to be able to access a Django app with HTTPS certificate.

Written by Sergio HenriqueDecember 10, 2024December 10, 2024

Download data from Kaggle competition and upload in Azure ML

In some Kaggle competitions the provided machines can not handle the volume of data available. In this cases, I think that could be beneficial to train the model in another place.

Written by Sergio HenriqueDecember 2, 2024December 2, 2024

ARIMA and Online Learning in Financial Forecasting

I discuss the development of an online learning system using the Jane Street Real-Time Market Data Forecasting challenge as a practice ground for time-series forecasting. The project involves predicting the responder_6 variable using an ARIMA model, with a focus on adapting to new data by re-training the model whenever a new date_id is encountered. This approach leverages multiprocessing to meet strict time constraints

Written by Sergio HenriqueNovember 14, 2024November 14, 2024

Walk Forward Validation on Jane Street Real-Time Market Data Forecast

Walk Forward Validation (WFV) involves a training window that moves forward in time, training the model on historical data and then validating it on future, unseen data points. Unlike traditional cross-validation where data is randomly split, WFV respects the sequence of time, making it ideal for datasets with time-dependent features like stock prices, weather patterns, or sales figures.

Written by Sergio HenriqueNovember 6, 2024November 6, 2024

reAct, WESE, Plan-and-Execute and ChatDB architectures applied to question-answer database use case

An overview of the reAct, WESE, Plan-and-Execute and ChatDB architectures applied to the question-aswer database use case of the GDSC7 challenge.

Written by Sergio HenriqueOctober 20, 2024

How to create and save charts with CrewAI agents and AWS S3

In the GDSC7 challenge, we’ve upgraded our agent system to create and display charts in response to user queries, using AWS S3 for image storage. The new chart.py tool leverages Pandas, Matplotlib, and Seaborn to generate various chart types, enhancing our system’s capability to present data visually. This integration allows us to effectively showcase complex information, such as the correlation between GDP and reading skills from the PIRLS 2021 study, improving user engagement and interaction.

Written by Sergio HenriqueOctober 18, 2024October 21, 2024

Adding site and video as sources for CrewAI agent system

In my recent work on the GDSC7 challenge, I’ve been exploring how to enhance responses to subjective questions using data from the PIRLS 2021 study. By integrating the Embedchain package, I can efficiently connect to various sources like websites and videos, extract information, and store it in a vector database.

Sergio Henrique

Data Analyst Building Things and Sharing Learning Along the Way

Author: Sergio Henrique