Adding site and video as sources for CrewAI agent system

Following the work done in the previous posts about the GDSC7 challenge I was thinking in how to improve answers for subjective questions. As a example read the following question extracted from the public set of questions: “As a teacher of 4th graders, how can I improve the reading performance of my students? Provide data that supports your suggestions. In particular consider the findings of the PIRLS 2021 study.“. With a model good at reasoning and access to the database we are able to answer this question. After some database interactions the system can give very good improvement suggestions for the teacher. But again, it comes with a cost.

As we saw in “Enhancing Relational Database Agents with Retrieval Augmented Generation (RAG)” post we can use semantic search in the questionnaire data to improve the answer and reduce the cost related to database interactions. We can also apply similar strategy to leverage information that are available in the PIRLS 2021 site. Take a look at the https://pirls2021.org/results link and at the PIRLS 2021 Youtube video. It is clear that we have some conclusions drawn from the study in a summarized form. How can we use it?

The Embedchain package

As suggested by the competition organizers we are using CrewAI package to build our agent system. The documentation shows a list of tools that can be used. We can also see these tools in the Github repository of CrewAI-Tools package. One specific tool caught my attention: WebSiteSearchTool. If you look at the code in website_search_tool.py you will find a “WebsiteSearchTool” class that extends a “RagTool” class that is defined in “rag” tool. The rag_tool.py defines a “RagTool” class with a “app” attribute being initialized in _set_default_adapter() method. After that the app is used as input parameter in the EmbedchainAdapter that was defined in the embedchain_adapter.py. And here I became aware of the Embedchain package.

The Embedchain package allows us to connect with a variety of sources, extract information, store embeddings in vector database and then query it. This is basically the same process what we’ve done from scratch with our SQL database.

Adding site as a source

Using Embedchain to add a site as a source is very simple. We just have to instantiate our app and use the “add” method to include sites as sources. To initialize the app, we must pass a yaml file with the configurations required to connect with your LLM, embedder model and vector database.

def create_pirls_site_collection():

    app = App.from_config(config_path='src/rag/embedchain_site_config.yaml')
    app.llm.get_answer_from_llm = new_get_llm_model_answer.__get__(app.llm, type(app.llm))
    app.add('https://pirls2021.org/results')
    app.add('https://pirls2021.org/results/achievement/')
    app.add('https://pirls2021.org/results/trends')
    app.add('https://pirls2021.org/results/relative-achievement/')
    app.add('https://pirls2021.org/results/international-benchmarks/')
    app.add('https://pirls2021.org/results/context-home')
    app.add('https://pirls2021.org/results/context-school/')
    app.add('https://pirls2021.org/results/context-student')
    print('pirls_site collection count ', app.db.count())

app:
  config:
    id: "my-app"

vectordb:
  provider: "chroma"
  config:
    collection_name: "pirls_site"
    dir: "src/rag/collections"
    allow_reset: False

llm:
  provider: "aws_bedrock"
  config:
    model: "anthropic.claude-3-haiku-20240307-v1:0"
    model_kwargs:
      temperature: 0.0

embedder:
  provider: "huggingface"
  config:
    model: "sentence-transformers/all-MiniLM-L6-v2"
    api_key: "<YOUR_HUGGINGFACE_KEY_HERE>"
    model_kwargs:
      trust_remote_code: true

Note that I am taking care to use the same vector database created in previous post to handle questionnaire data extracted from the SQL database.

The main issue faced here was logging into AWS cloud to use ChatBedrock. The Embedchain package was using a method that does not allowed me to log in. So I overridden the method as explained more in depth in this post.

def new_get_llm_model_answer(self, prompt) -> str:
        '''
        Overide dfault get_llm_model_answer to login in chadbedrock using session
        '''
        session = boto3.Session()
        self.boto_client = session.client("bedrock-runtime")

        kwargs = {
            "model_id": self.config.model or "amazon.titan-text-express-v1",
            "client": self.boto_client,
            "model_kwargs": self.config.model_kwargs,
            "disable_streaming": True,
            "verbose": True
            or {
                "temperature": self.config.temperature,
            },
        }

        if self.config.stream:
            from langchain.callbacks.streaming_stdout import \
                StreamingStdOutCallbackHandler

            callbacks = [StreamingStdOutCallbackHandler()]
            llm = ChatBedrock(**kwargs, streaming=self.config.stream, callbacks=callbacks)
        else:
            llm = ChatBedrock(**kwargs,)

        response = llm.invoke(prompt).content
        return response

Adding Youtube video as a source

Adding a Youtube video is very similar to add a site. We just have to fill a additional “data_type” attribute. Behind the scenes the package is retrieving the video transcript and storing it in the vector database.

def create_youtube_collection():
    
        app = App.from_config(config_path='src/rag/embedchain_youtube_config.yaml')
        app.llm.get_answer_from_llm = new_get_llm_model_answer.__get__(app.llm, type(app.llm))
        app.add('https://www.youtube.com/watch?v=jUv1QowWmqI', data_type='youtube_video')

app:
  config:
    id: "my-app"

vectordb:
  provider: "chroma"
  config:
    collection_name: "youtube"
    dir: "src/rag/collections"
    allow_reset: False

llm:
  provider: "aws_bedrock"
  config:
    model: "anthropic.claude-3-haiku-20240307-v1:0"
    model_kwargs:
      temperature: 0.0

embedder:
  provider: "huggingface"
  config:
    model: "sentence-transformers/all-MiniLM-L6-v2"
    api_key: "<YOUR_HUGGINGFACE_KEY_HERE>"
    model_kwargs:
      trust_remote_code: true

Creating tools to query site and Youtube video data

Before creating the tools remember that you have to run the new functions to populate the vector database with our new sources.

Below is a example of tools to query the PIRLS site and youtube video

@tool('search_in_pirls_site')
def search_in_pirls_site(question: str):
    """
    Search for information in the PIRLS 2021 results website.
    The main subjects in the the result site are:
    - Overview of PIRLS
    - Countries’ Reading Achievement
    - Trends in Reading Achievement
    - Relative Achievement in Reading Purposes and Comprehension Processes
    - Performance at International Benchmarks
    - Home Environment Support
    - School Composition, Resources, and Climate
    - Students’ Reading Attitudes and Behaviors

    This function uses an embedchain app to query the PIRLS 2021 results website
    and retrieve relevant information based on the provided question.

    Args:
        question (str): The query or question to search for in the PIRLS 2021 results.

    Returns:
        dict: A dictionary containing:
            - 'source' List:(str): The URL of the PIRLS 2021 results website.
            - 'answer' (str): The answer retrieved from the embedchain app query.
    """

    app = App.from_config(config_path='src/rag/embedchain_site_config.yaml')
    app.llm.get_answer_from_llm = new_get_llm_model_answer.__get__(app.llm, type(app.llm))

    answer, sources = app.query(question, citations=True)
    sources = [item[1]['url'] for item in sources]
    list(set(sources))

    result = {
        'source': sources,
        'answer': answer
    }

    return result

@tool('search_in_youtube_videos')
def search_in_youtube_videos(question: str):
    """
    Search for information in the PIRLS 2021 result youtube video.

    This function uses an embedchain app to query the PIRLS 2021 results youtube video
    and retrieve relevant information based on the provided question.

    Args:
        question (str): The query or question to search for in the PIRLS 2021 results.

    Returns:
        dict: A dictionary containing:
            - 'source' List:(str): The URL of the PIRLS 2021 results youtube video.
            - 'answer' (str): The answer retrieved from the embedchain app query.
    """

    app = App.from_config(config_path='src/rag/embedchain_youtube_config.yaml')
    app.llm.get_answer_from_llm = new_get_llm_model_answer.__get__(app.llm, type(app.llm))

    answer, sources = app.query(question, citations=True)
    sources = [item[1]['url'] for item in sources]
    list(set(sources))

    result = {
        'source': sources,
        'answer': answer
    }

    return result

Sergio Henrique

Data Analyst Building Things and Sharing Learning Along the Way

Adding site and video as sources for CrewAI agent system

The Embedchain package

Adding site as a source

Adding Youtube video as a source

Creating tools to query site and Youtube video data

Leave a Reply Cancel reply