How to use Gemini to bypass image captcha when web scraping

My wife is studying to become a judge, and sometimes she asks me to do my “black magic” (text summarization with LLMs) to get the main points from decisions periodically published by Brazil’s Supreme Federal Court. The task is very repetitive, and I saw the opportunity to automate the process.

In the web scraping process, I was surprised by an image captcha like the one below, and I asked myself if an LLM could help me achieve my goal.

Image captcha

The core idea was to feed the LLM with a screenshot of the screen and ask it to solve the challenge. I created the prompt by giving an identifier to each box in the image and then asked only for the solution in return.

prompt = """can you solve the following challenge? 
The positions are 1 2 3 4 5 6 7 8 9. 
Only answer the numbers in your final answer splited by space"""

With the captcha solution in hand, I found another issue. The image boxes are inside a canvas and aren’t accessible in the DOM. Libraries like Playwright can’t find them using XPath locators. To solve this, I had to simulate a click at a screen position.

async def click_canvas_square(self, page, index):
        if not 1 <= index <= 9:
            raise ValueError("Index must be between 1 and 9")
        
        canvas_width = 320
        canvas_height = 320
        square_size = canvas_width / 3
        
        index -= 1
        row = index // 3
        col = index % 3
        
        x_offset = (col * square_size) + (square_size / 2)
        y_offset = (row * square_size) + (square_size / 2)
        
        canvas = page.locator('xpath=//*[@id="root"]/div/form/div[3]/div/div[2]/canvas')
        
        box = await canvas.bounding_box()
        if not box:
            raise Exception("Canvas not found or not visible")
        
        click_x = box["x"] + x_offset
        click_y = box["y"] + y_offset
        
        await page.mouse.click(click_x, click_y)

I made a notebook in Kaggle with the full example. I had to convert from sync to async because I automated the process on my local machine and then had an idea to make part of this project publicly available for learning purposes, but notebooks only work with the async version of the Playwright library. You’ll also have to download the notebook and execute it locally because Brazil’s Supreme Court site blocks the Kaggle IP (returning a 403 Forbidden when executing the code from there).

I hope you can abstract some specific parts related to site navigation and use it to learn how to integrate an LLM into your web scraping process.

Kaggle Notebook: https://www.kaggle.com/code/serjhenrique/bypass-image-captcha-with-gemini

Leave a Reply

Your email address will not be published. Required fields are marked *