As the GDSC7 challenge approaches its conclusion, I now have the opportunity to reflect on one of the most critical aspects, in my view: system architecture. Before delving into the details, let’s quickly recap.
The GDSC7 was a challenge to build agentic system capable to answer users questions about the PIRLS Study. The database has the schema below and we were encouraged to enhance the system with features like chart generation and integration with another sources of information (e.g. socioeconomics and demographics information).
I did a series of posts documenting some of my findings along the way:
- Text-to-SQL Agents in Relational Databases with CrewAI
- Enhancing Relational Database Agents with Retrieval Augmented Generation (RAG)
- Adding site and video as sources for CrewAI agent system
- How to create and save charts with CrewAI agents and AWS S3
I was primarily developing my solution using CrewAI until the halfway though the challenge when I realized: “Do I really need a framework just to call a LLM ?“. Don’t get me wrong – CrewAI is a amazing library. Joao Moura did a excellent job creating it, and it has the smoothest learning curve compared to other libraries. However, all the concepts aroung role-playing agents with backstories, roles and goals were diverting my focus to the wrong direction. I was spending to much time distributing responsibilities across a variety of agents, which only increased costs and response times
To my surprise, I wasn’t alone in this thought. A quick web search revealed others questioning the necessity of these agent frameworks, discussing the greenfield nature of this technology and how various frameworks are trying to carve out a market share. So, I decided to drop the frameworks and build the solution from scratch (of course, based on some great ideas presented in some papers)
reAct
Starting with reAct. The idea is to guide the model to accomplish a task interleaving steps of Thought, Action and Observation. The idea was introduced in the paper REAC T: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS.
To implement this, we essentially need to create a loop that calls an action after every observation, continuing until either a maximum number of steps is reached or the final answer is obtained.
The Thought step is the model reasoning and choosing a Action to take. This part is handled in the system prompt, where the model is instructed to use the steps and follow a template rhat allows for easy extraction of the tool name and input parameters
The Action step involves calling a tool. We need to create tools to make available for the LLM. Tools are just functions. I liked the CrewAI idea of use the function.__doc__()
to get the function docstring and provide it as instructions to the model about how to use the tool. I implemented my reAct following this principle.
The Observation step is the return of the function called during the Action step. You add this in the conversation and ask the model for the next step.
Very simple, elegant and powerful.
However, with this architecture we were not leveraging all the resources available in the challenge. We had access to at least two models: Claude Haiku (weak model) and Claude 3.5 Sonnet (strong model).
Another area of improvement lies in the token consumption. Imagine the scenario of the following question: “Does the gender of a child have an impact on the reading capabilities of 4th graders? Base your answer on the findings of the PIRLS 2021 study.“. Since gender isn’t a column in the Student table, the LLM has to explore the database to figure out where this information is stored. It will likely make queries with LIKE
instructions in the students questionnaire tables. Once the information is found we no longer need the results of this exploration queries in the conversation However, they will remain there. Remember, every step we make will increase the conversation with the observation (function call results) and it will be translated in token consumption and cost increase.
WESE – Weak Exploration Strong Exploitation
The word explore kept resonating in my head until I found this incredible paper: WESE: Weak Exploration to Strong Exploitation for LLM Agents. The core idea here is to leverage a weak model to explore the environment (the database in our case) and build a knowledge graph with only useful information to feed the strong model. As said by the authors, “the knowledge acquired by the LLM from environmental exploration tends to be excessive, including irrelevant information to the task“. Wow! This aligned perfectly with my thoughts on reAct.
To implement this I made use of my previous React implementation and created a step of exploration with different tools (and LLM) from exploitation. There’s plenty of room for experimentation here, with creativity being your only limit. You could use Retrieval-Augmented Generation (RAG) tools for exploration and database tools for exploitation, or mix RAG and database tools for exploration while keeping database tools for exploitation, and so on.
The knowledge graph was the tricky part. In the paper, the observations were typically textual information that could be summarized and used to build the graph. In my case, sometimes I ended up with list of countries with average scores or a list of relevant question codes as a result of a Action and I couldn’t figure out how to create a useful graph from that. I ended up skipping the graph and tried to instruct the model to summarize the exploration findings. This is where things began to go wrong. Lists were incomplete, crucial information wasn’t passed forward and I found myself constantly tweaking the prompt text. So I decided to drop the idea and move on. Maybe exploration-exploitation paradigm wasn’t a perfect fit for database question-answers use case.
Plan-and-Execute
Inspired mainly by this repository I tried another approach to solve the issues of the reAct architecture. Split the process between Plan, Execute and Replan.
In the Plan step, we use a strong LLM to create a plan to solve the user’s task. It must be a complete step-by-step plan with detailed descriptions and dependencies between steps.
Next we iterate through this plan. We take the first available step, add the context from step dependencies (if they exists), pass it to our reAct implementation and wait for its conclusion. Once we have the final answer for that step, we store it along with the step description.
In the Replan step we ask the model two questions:
- Do you have enough information to answer the user question?
- Do we need to replan?
The first question act as shortcut in case the initial plan was longer than necessary.
The second question allow for real-time update based on new evidence. Let’s revisit the earlier question: “Does the gender of a child have an impact on the reading capabilities of 4th graders? Base your answer on the findings of the PIRLS 2021 study.“. Imagine that the first step is identify the student’s gender. Now suppose that to accomplish this task, the model tries to query the students questionnaire tables using “%male%” and “%female%“. We know that the database store this information as “boy” and “girl”. This weak model would have failed in this step. But thanks to the replan step, the strong model would replan and add another task to identify the gender (perhaps suggesting “boy” and “girls” as keywords).
Now you might be asking “How does this solve the reAct issues?”. Well, we are leveraging a strong model to planning/replanning and a weak model to execution. The context is cheaper in the weak model and we only pass forward the final answer of each step as a context to the strong model in the replan step. All the intermediate Thought, Action, Observation from the execute model are discarded. This reduces both the size of the strong model’s conversation (context) during replanning and the overall cost of the solution
A really powerful and flexible architecture. However, exploring this solution I found some pitfalls. Our weak model is really, really weak (poor thing). Sometimes it struggles with even simple steps, which increases response time Additionally, due to the stochastic nature of LLM, some answers that should be a complete list were summarized to a top-10 list or key observations about those lists. For questions like “Show a plot of the correlation of a education systems GDP and its readings skills according to the PIRLS 2021 study” tthis significantly impacts answer quality. You could solve this by replacing the weak model with a strong one—but then you’d lose all cost-saving benefits.
Even when replacing the weak model by a strong one, some questions remain highly challenging. For example, the question “How do teacher’s job satisfaction relate to student learning outcomes, as reflected by their level of reading proficiency?” could result in a plan where one step asks for a list of all teachers id’s along with their satisfactions score, followed by another step asking by the student’s average reading scores associated with those teachers from step one. While this question could be easily accomplished in one single step, breaking it into two steps results in a large list from step one—large lists are costly and may also risk being summarized instead of fully returned. So how can we solve this?
chatDB
In the last weekend of the challenge I read this paper: ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory. The idea of using a database as symbolic memory to execute reasoning, avoid hallucinations, and leverage LLM SQL programming skills sounded great!
To adapt this idea to my use case I kept the plan step: asking the strong model for a step-by-step plan with complete descriptions and dependencies to solve the task.
Next, I continued iterating through the plan. I took the first available step, used a weak model to translate it into SQL, and executed the corresponding SQL query. Here’s where the differences begin. Instead of simply executing the query and getting the result back, each step execution creates a new table in the database (e.g., step_1
, step_2
, step_3
, etc.). These tables store the results of each executed step. The database schema, containing all available tables, is updated and provided as context to the weak model for generating new SQL queries in subsequent steps. For steps with explicit dependencies on previous ones, the model is aware of these earlier step tables in the database and uses them to generate queries for the current step (e.g., joining a table from the PIRLS schema with a step table). At the end of the plan, we select data from the final step and use it as context for a strong model to generate the final answer.
This approach ensures that no summarization of intermediate step results occurs. It also reduces costs since minimal context is needed. Questions like “How does teacher job satisfaction relate to student learning outcomes, as reflected by their level of reading proficiency?” are no longer an issue because each step is stored as a table that can be referenced in future steps. Thanks to DuckDB, you can use this solution as an in-memory database and access both your newly created step tables and PIRLS original tables as if they were part of a single database.
However, this solution—at least as I implemented it—has some drawbacks when one of the steps needs to explore the database for subsequent steps to succeed. For example, consider the question “Does the gender of a child have an impact on the reading capabilities of 4th graders? Base your answer on the findings of the PIRLS 2021 study.” If it fails to identify gender in the first step, all subsequent steps will fail. The natural improvement would be to implement a Replan step for cases like this.
Conclusion
It was a great experience exploring agentic systems. Until now, I hadn’t delved into this field—my focus had been on training and fine-tuning models. Implementing everything from scratch was a valuable way to understand how and why things work. For the database question-answer use case, I recommend using a ChatDB architecture with an added replan step to enhance its performance.
I hope you find it useful!!