Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save gehlotabhishek/0e1facce1a3a48db23cdcab061dd63f8 to your computer and use it in GitHub Desktop.

Select an option

Save gehlotabhishek/0e1facce1a3a48db23cdcab061dd63f8 to your computer and use it in GitHub Desktop.

Revisions

  1. gehlotabhishek revised this gist Jul 3, 2024. No changes.
  2. gehlotabhishek revised this gist Jan 18, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Core Concepts For building an LLM based application.md
    Original file line number Diff line number Diff line change
    @@ -138,7 +138,7 @@ Now, where does Aubai store all this information? In a vector database.
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, and metadata filtering.
    We use [ChromaDB](https://www.trychroma.com/). Which is an open-source vector database.

    ## Implementation of VectorEmbeddings on Policy Documents
    ## Implementation of VectorEmbeddings and Vector Database on Policy Documents

    A core component of the building Aubai involves `Loading` the policy data into the vector database without losing the meaning. First, we `load` and `parse` company policies, `chunk` them into digestible pieces, and then convert them into `vector embeddings`. Next, we create a collection in ChromaDB for these vectors, like how a librarian might catalog new books.

  3. gehlotabhishek revised this gist Jan 18, 2024. 1 changed file with 7 additions and 5 deletions.
    12 changes: 7 additions & 5 deletions Core Concepts For building an LLM based application.md
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,9 @@ Constructing a chatbot like Aubai isn't overly complex, yet it involves integrat
    6. Vector Database

    ## Large Language model (LLM)
    Our journey begins with understanding the brains behind Aubai—the Large Language Model (LLM). An LLM is like a vast library of information and conversations, all bundled into a digital brain. Aubai is built on state-of-the-art LLMs such as OpenAI's GPT-3.5 Turbo and the even more advanced GPT-4-Turbo. These models can handle complex tasks, from generating human-like chat responses to executing various out-of-the-box functions accessible through APIs like Function calling, Tools, and JSON response. To use the LLM for the chatbot use case we have to understand a little bit about the input of the model. Let's quickly go through the concept of *context window* and *tokens*.
    Our journey begins with understanding the brain behind Aubai—the Large Language Model (LLM). An LLM is like a vast library of information and conversations, all bundled into a digital brain. Aubai is built on state-of-the-art LLMs such as OpenAI's GPT-3.5 Turbo and the even more advanced GPT-4-Turbo.
    Large language models, when implemented as chatbots, offer incredible versatility. They can handle various tasks such as answering questions, summarizing information, translating languages, and completing sentences in a conversational manner. By integrating LLMs into chatbots, we can transform the way users interact with virtual assistants, enhancing their ability to engage in natural and dynamic conversations.
    GPT models by OpenAI provides out-of-the-box functions accessible through APIs like [Function calling](https://platform.openai.com/docs/guides/function-calling) and [JSON response](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format). To use the LLM for the chatbot use case we have to understand a little bit about the input of the model. Let's quickly go through the concept of *context window* and *tokens*.

    ### Context Window
    Think of a conversation you've been a part of. Remembering what was said a few sentences ago helps you keep track of the discussion. For Aubai, the chatbot's ability to "remember" is defined by what's called the Context Window. It's the span of the conversation that the AI can keep in mind when crafting its responses.
    @@ -25,7 +27,7 @@ Why does this matter? If the context window is too small, Aubai might "forget" e
    ### Tokens
    So, what's a token? In the language of LLMs, tokens are like puzzle pieces of language—they can be whole words, parts of words, or even punctuation. When you're chatting with Aubai, it slices your sentences into these tokens to process and understand what you're saying.

    Here's a cool analogy: think of tokens as the bread in a sandwich. Your whole sentence is the whole sandwich, while tokens are the individual slices of bread that make it easier to handle and eat. They determine how much of your conversation Aubai can "chew on" in its context window. Head over to the OpenAI [tokenizer](https://platform.openai.com/tokenizer) page to learn more about the token count.
    Here's a cool analogy: think of tokens as ingredients in a recipe. Your sentence, in this context, is the dish. Each token, much like different ingredients, plays a role in contributing to the overall flavor and meaning of the conversation. Head over to the OpenAI [tokenizer](https://platform.openai.com/tokenizer) page to learn more about the tokens.

    ## Conversational Memory
    One of the critical features of Aubai is its conversational memory. Think of it as the chatbot being able to "remember" past interactions, much like a human conversation partner would. This continuity is essential for a chatbot, so it doesn't treat each new question as if it's the first time you've talked. Here’s a peek at how developers help Aubai remember using Python:
    @@ -51,12 +53,12 @@ user_query_2 = "Can I take a leave for more than 10 days consecutively?"

    # Initial interaction
    conversation_history.append({"role": "user", "content": user_query_1})
    response_1 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history))
    response_1 = leave_bot.llm_call(chain=conversation_history)
    conversation_history.append({"role": "assistant", "content": response_1})

    # User's next query
    conversation_history.append({"role": "user", "content": user_query_2})
    response_2 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history))
    response_2 = leave_bot.llm_call(chain=conversation_history)
    conversation_history.append({"role": "assistant", "content": response_2})

    ```
    @@ -67,7 +69,7 @@ To see the full implementation, make sure to check out the [Brain](https://githu
    ## Prompt Engineering
    Prompt engineering is a fresh field that focuses on creating and fine-tuning prompts to make language models (LMs) more efficient across various applications. Having prompt engineering skills helps in understanding what large language models (LLMs) can and cannot do.

    At its core, it is about arranging text in a way that an AI model can understand what a user wants to accomplish. A prompt is the task described in plain language that the AI needs to do. prompt engineering is a must to enhance the performance of LLMs in tasks like question answering and Chatbots.
    At its core, it is about arranging text in a way that an AI model can understand what a user wants to accomplish. A prompt is the task described in plain language that the AI needs to do. prompt engineering is a must to enhance the performance of LLMs in tasks like question answering and Chatbots and to keep the bot in restricted constraint. You do not want your domain specific bot to answer the question which is outside of your domain to your user even if user asks.

    To chat effectively, Aubai needs clear instructions—this is where prompt engineering comes into play. It's all about framing questions and prompts in a way that guides the AI to respond correctly. For instance, we might use a method called `Chain of Thought` to solve problems step-by-step:

  4. gehlotabhishek renamed this gist Jan 18, 2024. 1 changed file with 0 additions and 0 deletions.
  5. gehlotabhishek renamed this gist Jan 18, 2024. 1 changed file with 1 addition and 1 deletion.
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,7 @@

    # Supercharge your work with Aubai: real-time knowledge and data Insights – All with a Dash of Aubergine Flair!

    ### This gist focuses on core concepts to build any type of chatbot application around LLMs. I have taken a specific use case for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.
    ### This gist focuses on core concepts that will help you to build and connect any type of application around LLMs. I have taken a specific use case of a chatbot for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.

    Imagine you have questions about your company's leave policy. Who do you turn to? For many, it might be a colleague or an HR representative. But what if there were an even easier way—a way that didn't involve waiting for email replies or tracking someone down at their desk? Enter Aubai, an advanced chatbot powered by the latest AI technologies that can answer all your policy-related queries instantly. Let's dive into the world of Aubai and explore how it's built to make your life easier.

  6. gehlotabhishek renamed this gist Jan 18, 2024. 1 changed file with 0 additions and 0 deletions.
  7. gehlotabhishek revised this gist Jan 18, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Core Concepts To build an Application around LLMs.md
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,7 @@

    # Supercharge your work with Aubai: real-time knowledge and data Insights – All with a Dash of Aubergine Flair!

    ### This gist focuses on core concepts to build any application around LLMs. I have taken a specific use case for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.
    ### This gist focuses on core concepts to build any type of chatbot application around LLMs. I have taken a specific use case for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.

    Imagine you have questions about your company's leave policy. Who do you turn to? For many, it might be a colleague or an HR representative. But what if there were an even easier way—a way that didn't involve waiting for email replies or tracking someone down at their desk? Enter Aubai, an advanced chatbot powered by the latest AI technologies that can answer all your policy-related queries instantly. Let's dive into the world of Aubai and explore how it's built to make your life easier.

  8. gehlotabhishek revised this gist Jan 17, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Core Concepts To build an Application around LLMs.md
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,7 @@

    # Supercharge your work with Aubai: real-time knowledge and data Insights – All with a Dash of Aubergine Flair!

    ### This gist focuses on a specific use case for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.
    ### This gist focuses on core concepts to build any application around LLMs. I have taken a specific use case for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.

    Imagine you have questions about your company's leave policy. Who do you turn to? For many, it might be a colleague or an HR representative. But what if there were an even easier way—a way that didn't involve waiting for email replies or tracking someone down at their desk? Enter Aubai, an advanced chatbot powered by the latest AI technologies that can answer all your policy-related queries instantly. Let's dive into the world of Aubai and explore how it's built to make your life easier.

  9. gehlotabhishek revised this gist Jan 17, 2024. No changes.
  10. gehlotabhishek renamed this gist Jan 17, 2024. 1 changed file with 0 additions and 0 deletions.
  11. gehlotabhishek revised this gist Jan 17, 2024. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions Aubai.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,8 @@

    # Supercharge your work with Aubai: real-time knowledge and data Insights – All with a Dash of Aubergine Flair!

    ### This gist focuses on a specific use case for policy documents, but developers can apply the same foundational concepts to their unique use cases. The principles discussed here provide a versatile framework applicable across various scenarios.

    Imagine you have questions about your company's leave policy. Who do you turn to? For many, it might be a colleague or an HR representative. But what if there were an even easier way—a way that didn't involve waiting for email replies or tracking someone down at their desk? Enter Aubai, an advanced chatbot powered by the latest AI technologies that can answer all your policy-related queries instantly. Let's dive into the world of Aubai and explore how it's built to make your life easier.

    ### The Rocket Science Behind Aubai
  12. gehlotabhishek revised this gist Jan 11, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Aubai.md
    Original file line number Diff line number Diff line change
    @@ -4,7 +4,7 @@
    Imagine you have questions about your company's leave policy. Who do you turn to? For many, it might be a colleague or an HR representative. But what if there were an even easier way—a way that didn't involve waiting for email replies or tracking someone down at their desk? Enter Aubai, an advanced chatbot powered by the latest AI technologies that can answer all your policy-related queries instantly. Let's dive into the world of Aubai and explore how it's built to make your life easier.

    ### The Rocket Science Behind Aubai
    Building a chatbot like Aubai isn't rocket science, but it does involve some stellar technology. At its heart, Aubai is a constellation of six key components:
    Constructing a chatbot like Aubai isn't overly complex, yet it involves integrating six crucial components at its core.
    1. Large Language model(LLM)
    2. Conversational Memory
    3. Prompt Engineering
  13. gehlotabhishek renamed this gist Jan 11, 2024. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  14. gehlotabhishek revised this gist Jan 11, 2024. 1 changed file with 40 additions and 32 deletions.
    72 changes: 40 additions & 32 deletions Aubai Backend Implementation.md
    Original file line number Diff line number Diff line change
    @@ -1,36 +1,35 @@

    # Supercharge your work with Aubai: real-time knowledge and data Insights – All with a Dash of Aubergine Flair!

    Imagine you have questions about your company's leave policy. Who do you turn to? For many, it might be a colleague or an HR representative. But what if there were an even easier way—a way that didn't involve waiting for email replies or tracking someone down at their desk? Enter Aubai, an advanced chatbot powered by the latest AI technologies that can answer all your policy-related queries instantly. Let's dive into the world of Aubai and explore how it's built to make your life easier.

    # Aubai Implementation and concepts

    ## Aubai is an LLM-based chatbot assistant that is designed to answer questions related to company policies.

    ### Key concepts for the approach and implementation of Aubai:
    There are some key components you have to be clear about while building a chatbot like Aubai.
    ### The Rocket Science Behind Aubai
    Building a chatbot like Aubai isn't rocket science, but it does involve some stellar technology. At its heart, Aubai is a constellation of six key components:
    1. Large Language model(LLM)
    2. Conversational Memory
    3. Prompt Engineering
    4. Function Calling
    5. Vector Embeddings
    6. Vector Database

    ## Large Language model(LLM)
    There are many different LLMs available in the market which are fine-tuned for generating chat-like responses. among them, one of the best models is Openai's GPT-3.5 Turbo and GPT-4-Turbo (current).
    There are multiple reasons to select these models. These are the biggest models available in the current market which can be easily accessible by APIs and hosted by Openai apart from generating chat-like responses they also provide out-of-the-box functions like Function calling, Tools, and JSON response.
    ## Large Language model (LLM)
    Our journey begins with understanding the brains behind Aubai—the Large Language Model (LLM). An LLM is like a vast library of information and conversations, all bundled into a digital brain. Aubai is built on state-of-the-art LLMs such as OpenAI's GPT-3.5 Turbo and the even more advanced GPT-4-Turbo. These models can handle complex tasks, from generating human-like chat responses to executing various out-of-the-box functions accessible through APIs like Function calling, Tools, and JSON response. To use the LLM for the chatbot use case we have to understand a little bit about the input of the model. Let's quickly go through the concept of *context window* and *tokens*.

    ### Context Window
    Context window is the maximum amount of text the model can consider at any one time when generating a response. the memory of any LLM-based system is also affected by the size of the context window, you will get the idea when you read about conversational memory.
    The context window is very large for GPT models. for example, GPT-3.5-Turbo has 4,096 tokens where GPT-4-Turbo has 128,000 tokens.
    Think of a conversation you've been a part of. Remembering what was said a few sentences ago helps you keep track of the discussion. For Aubai, the chatbot's ability to "remember" is defined by what's called the Context Window. It's the span of the conversation that the AI can keep in mind when crafting its responses.

    Why does this matter? If the context window is too small, Aubai might "forget" earlier parts of the conversation, which isn't ideal when you need coherent, consistent help. That's why the GPT models are equipped with pretty large context windows—GPT-3.5 Turbo can handle up to 4,096 tokens of text, while the newer GPT-4 Turbo can consider a massive 128,000 tokens at once!

    ### Tokens
    Tokens can be viewed as fragments of words. When the API deals with prompts, it breaks down the input into tokens. These tokens may not exactly match where words start or end; they can have spaces and parts of sub-words. To know how to count tokens please refer to https://platform.openai.com/tokenizer.
    The size of the context window is very important when you want a long continuous interaction with LLM in your use case.
    So, what's a token? In the language of LLMs, tokens are like puzzle pieces of language—they can be whole words, parts of words, or even punctuation. When you're chatting with Aubai, it slices your sentences into these tokens to process and understand what you're saying.

    Here's a cool analogy: think of tokens as the bread in a sandwich. Your whole sentence is the whole sandwich, while tokens are the individual slices of bread that make it easier to handle and eat. They determine how much of your conversation Aubai can "chew on" in its context window. Head over to the OpenAI [tokenizer](https://platform.openai.com/tokenizer) page to learn more about the token count.

    ## Conversational Memory
    Conversational memory gives power to an LLM-based chatbot to respond like it's having a real conversation. It helps the chatbot to remember what was said before, so each question isn't treated separately, and it considers past talks to give better replies. from the programming perspective, developers have to somehow keep the history of the interaction going with the LLM and feed it with each next input. without it, every query would be treated as an entirely independent input without considering past interactions.
    Example of conversational memory using Aubai's brain class.
    One of the critical features of Aubai is its conversational memory. Think of it as the chatbot being able to "remember" past interactions, much like a human conversation partner would. This continuity is essential for a chatbot, so it doesn't treat each new question as if it's the first time you've talked. Here’s a peek at how developers help Aubai remember using Python:
    ```python
    # An example snippet from Aubai's memory system

    from brain import Brain

    # Example conversation history
    @@ -59,18 +58,18 @@ response_2 = " ".join(response for response in leave_bot.llm_call(chain=conversa
    conversation_history.append({"role": "assistant", "content": response_2})

    ```
    Checkout the implementation for more details GitHub link: https://github.com/gehlotabhishek/aubai/blob/aubai-dev/code/brain.py
    To see the full implementation, make sure to check out the [Brain](https://github.com/gehlotabhishek/aubai/blob/aubai-dev/code/brain.py) class in Aubai Github repo.



    ## Prompt Engineering
    Prompt engineering is a fresh field that focuses on creating and fine-tuning prompts to make language models (LMs) more efficient across various applications and research areas. Having prompt engineering skills helps in understanding what large language models (LLMs) can and cannot do.
    Prompt engineering is a fresh field that focuses on creating and fine-tuning prompts to make language models (LMs) more efficient across various applications. Having prompt engineering skills helps in understanding what large language models (LLMs) can and cannot do.

    At its core, it is about arranging text in a way that an AI model can understand what a user wants to accomplish. A prompt is the task described in plain language that the AI needs to do. prompt engineering is a must to enhance the performance of LLMs in tasks like question answering and Chatbots.
    Developers leverage prompt engineering to craft reliable techniques for effective communication with LLMs and other tools.

    One example will be the `Chain of thought`,
    Prompt:
    To chat effectively, Aubai needs clear instructions—this is where prompt engineering comes into play. It's all about framing questions and prompts in a way that guides the AI to respond correctly. For instance, we might use a method called `Chain of Thought` to solve problems step-by-step:

    ***User Prompt:***
    ```python
    The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
    A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
    @@ -83,30 +82,31 @@ A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.
    The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
    A:
    ```
    Output:
    ***LLM Output:***
    ```python
    Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.
    ```

    You can perform the above similarly with Zero-shot COT Prompting,
    Prompt:
    You can perform the above similarly with the `Zero-shot COT` Prompt technique,

    ***User Prompt:***
    ```python
    Note: 'Let's Think step by step' <---- add this special prompt.
    Question: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
    Answer:
    ```
    Output:
    ***LLM Output:***
    ```python
    There are 16 balls in total. Half of the balls are golf balls. That means that there are 8 golf balls. Half of the golf balls are blue. That means that there are 4 blue golf balls.
    ```


    To complete tasks LLM needs to understand the details of the tasks as much as possible and simply by just asking in one go may not solve the problem at this time different prompting techniques come into the picture. where adding some extra details related to the task or passing a specific prompt which tells the LLM to think differently for example, a vastly use method is the chain of thought technique. To implement the COT add this prompt 'Think step by step' in your query. LLM will solve the given task in a step-by-step manner.
    by adding this prompt 'Think step by step' in your query. LLM will solve the given task in a step-by-step manner. These types of techniques shape how Aubai understands and tackles the task at hand.
    To understand different prompting techniques please refer: https://www.promptingguide.ai/

    ## Function calling
    A function call is a very powerful approach to connecting LLM with the external digital world.
    In an API request, you can specify functions, and the model will smartly create a JSON object with the necessary arguments for calling one or more functions. The opener's API doesn't execute the function; instead, it provides JSON that you can use in your code to make the function call.

    Aubai needs to access real-time data or specific document details, that's where function call comes into the picture. Aubai doesn’t make the calls itself—it generates JSON objects that your software can use to make those calls and retrieve the information needed to respond accurately to your queries. Here’s a glance at how Aubai breaks down the task:
    ```mermaid
    sequenceDiagram
    User ->> Aubai: Can I take 13 days' leave in one go?
    @@ -120,7 +120,8 @@ Aubai ->> User: You cannot take 13 days' leave in one go.
    checkout the Function calling implementation on [Aubai](https://github.com/gehlotabhishek/aubai/tree/aubai-dev) GitHub

    ## Vector Embeddings
    Vector embeddings, in terms of Large Language Models (LLMs), are numerical representations of words, phrases, or even entire documents in a multi-dimensional space. These embeddings capture the semantic meaning of the text, allowing the LLM to understand and process language. Each point in the space represents a different word or text snippet, and the distance or angle between points reflects the similarity between their meanings. Embeddings enable the LLM to perform tasks such as classification, translation, and question answering more accurately.
    You might wonder how Aubai understands the meaning behind words or sentences stored in the policy documents. The secret lies in vector embeddings.
    Vector embeddings, in terms of Large Language Models (LLMs), are numerical representations of words, phrases, or even entire documents in a multi-dimensional space. These embeddings capture the semantic meaning of the text, allowing the LLM to understand and process language. Each point in the space represents a different word or text snippet. The distance or angle between points reflects the similarity between their meanings. Embeddings enable the LLM to perform tasks such as classification, translation, and question answering more accurately.

    ![](https://miro.medium.com/v2/resize:fit:700/1*jptD3Rur8gUOftw-XHrezQ.png)

    @@ -129,11 +130,13 @@ Aubai utilizes `multi-qa-mpnet-base-dot-v1` for sentence-to-vector embeddings.
    `multi-qa-mpnet-base-dot-v1` is a sentence-transformers model. It maps sentences & paragraphs to a 768-dimensional dense vector space and was designed for semantic search. The reason we chose this model is that It has been trained on 215M (question, answer) pairs from diverse sources which match our use case that is in the form of Q&A. You can check out the other models by [sentence transformers](https://www.sbert.net/) and their implementations.

    ## Vector Database
    Now, where does Aubai store all this information? In a vector database.
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, and metadata filtering.
    In Aubai we are using [ChromaDB](https://www.trychroma.com/) which is an open source vector database.
    We use [ChromaDB](https://www.trychroma.com/). Which is an open-source vector database.

    ## Implementation of VectorEmbeddings on Policy Documents

    ## Implementation of vector embeddings
    Before converting the text data into vector embedding, data should be divided into batches of chunks. The chunking process is highly dependent on the use case and the input format (pdf, txt, CSV, JSON, docx), in our use case we have to store the policy documents which are in pdf format. Let's see the approach of chunking the policy documents, converting them into vector embeddings, and storing them in our Chromadb vector database.
    A core component of the building Aubai involves `Loading` the policy data into the vector database without losing the meaning. First, we `load` and `parse` company policies, `chunk` them into digestible pieces, and then convert them into `vector embeddings`. Next, we create a collection in ChromaDB for these vectors, like how a librarian might catalog new books.

    ### Loading, parsing, and chunking
    ```python
    @@ -186,7 +189,7 @@ def data_loader():
    # Display the total number of chunks
    print(f"\nTotal chunks: {len(character_split_texts)}")

    # Append each chunk to the data list with associated topic
    # Append each chunk to the data list with an associated topic
    for chunk in character_split_texts:
    all_data.append({"topic": file_name, "content": chunk})

    @@ -325,3 +328,8 @@ def data_loader():
    return True
    ```

    Once indexed in our vector database, these embeddings are ready for Aubai to search through. This means when you ask about the number of leave days you can take, Aubai consults its *digital library* to fetch the most accurate and relevant policy details to answer your question.

    ## Conclusion

    Aubai isn’t just a chatbot; it’s a sophisticated AI assistant, ready to guide you through the maze of company policies with ease. By integrating cutting-edge AI technologies, such as LLMs, vector databases, and prompt engineering, we’ve crafted an assistant that’s knowledgeable, efficient, and incredibly user-friendly. So, the next time you’ve got a question about your company’s policies, turn to Aubai and chat away—getting the answers you need has never been easier.
  15. gehlotabhishek revised this gist Jan 11, 2024. 1 changed file with 34 additions and 33 deletions.
    67 changes: 34 additions & 33 deletions Aubai Backend Implementation.md
    Original file line number Diff line number Diff line change
    @@ -1,10 +1,11 @@



    # Aubai Implementation and concepts

    ## Aubai is an LLM based chatbot assistant which is design to answer questions realted to company policies.
    ## Aubai is an LLM-based chatbot assistant that is designed to answer questions related to company policies.

    ### Key concepts for the approach and implemenation of Aubai:
    ### Key concepts for the approach and implementation of Aubai:
    There are some key components you have to be clear about while building a chatbot like Aubai.
    1. Large Language model(LLM)
    2. Conversational Memory
    @@ -14,22 +15,22 @@ There are some key components you have to be clear about while building a chatbo
    6. Vector Database

    ## Large Language model(LLM)
    There are many different LLMs avilable in the market which are fine-tune for generating chat like responses. among them one of the best models are openai's GPT-3.5 Turbo and GPT-4-Turbo (current).
    There are multiple reasons to select these models. These are the biggest models available in the current market which can be easlily accessable by APIs, Hosted by openai and apart from generating chat like responses they also provide out of the box functionlties like Function calling, Tools, JSON response.
    There are many different LLMs available in the market which are fine-tuned for generating chat-like responses. among them, one of the best models is Openai's GPT-3.5 Turbo and GPT-4-Turbo (current).
    There are multiple reasons to select these models. These are the biggest models available in the current market which can be easily accessible by APIs and hosted by Openai apart from generating chat-like responses they also provide out-of-the-box functions like Function calling, Tools, and JSON response.

    ### Context Window
    Context window is the maximum amount of text the model can consider at any one time when generating a response. memory of any LLM based system is also gets affected by the size of context window, you will get the idea when you will read about conversational memory.
    Context window is the maximum amount of text the model can consider at any one time when generating a response. the memory of any LLM-based system is also affected by the size of the context window, you will get the idea when you read about conversational memory.
    The context window is very large for GPT models. for example, GPT-3.5-Turbo has 4,096 tokens where GPT-4-Turbo has 128,000 tokens.

    ### Tokens
    Tokens can be viewed as fragments of words. When the API deals with prompts, it breaks down the input into tokens. These tokens may not exactly match where words start or end; they can have spaces and parts of sub-words. To know how to count tokens please refer https://platform.openai.com/tokenizer.
    Size of context window is very important when you want a long continious interaction with LLM in your usecase.
    Tokens can be viewed as fragments of words. When the API deals with prompts, it breaks down the input into tokens. These tokens may not exactly match where words start or end; they can have spaces and parts of sub-words. To know how to count tokens please refer to https://platform.openai.com/tokenizer.
    The size of the context window is very important when you want a long continuous interaction with LLM in your use case.


    ## Conversational Memory
    Conversational memory give power to a LLM based chatbot respond like it's having a real conversation. It helps the chatbot to remember what was said before, so each question isn't treated separately, and it considers past talks to give better replies. from the programming percpctive developer have to somehow keep the history of the interaction going on with the LLM and feed it with each next input. without it, every query would be treated as an entirely independent input without considering past interactions.
    Example of conversational memory using aubai's brain class.
    ```
    Conversational memory gives power to an LLM-based chatbot to respond like it's having a real conversation. It helps the chatbot to remember what was said before, so each question isn't treated separately, and it considers past talks to give better replies. from the programming perspective, developers have to somehow keep the history of the interaction going with the LLM and feed it with each next input. without it, every query would be treated as an entirely independent input without considering past interactions.
    Example of conversational memory using Aubai's brain class.
    ```python
    from brain import Brain

    # Example conversation history
    @@ -58,19 +59,19 @@ response_2 = " ".join(response for response in leave_bot.llm_call(chain=conversa
    conversation_history.append({"role": "assistant", "content": response_2})

    ```
    Checkout the implemenation for more details githublink: https://github.com/gehlotabhishek/aubai/blob/aubai-dev/code/brain.py
    Checkout the implementation for more details GitHub link: https://github.com/gehlotabhishek/aubai/blob/aubai-dev/code/brain.py



    ## Prompt Engineering
    Prompt engineering is a fresh field that focuses on creating and fine-tuning prompts to make language models (LMs) more efficient across various applications and research areas. Having prompt engineering skills helps in understanding what large language models (LLMs) can and cannot do.

    At its core it is about arranging text in a way that an AI model can understand what a user wants to accomplish. A prompt is the task described in plain language that the AI needs to do. prompt engineering is must in order to enhance the performance of LLMs in tasks like question answering and Chatbots.
    At its core, it is about arranging text in a way that an AI model can understand what a user wants to accomplish. A prompt is the task described in plain language that the AI needs to do. prompt engineering is a must to enhance the performance of LLMs in tasks like question answering and Chatbots.
    Developers leverage prompt engineering to craft reliable techniques for effective communication with LLMs and other tools.

    One example will be `Chain of thought`,
    One example will be the `Chain of thought`,
    Prompt:
    ```
    ```python
    The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
    A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
    The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.
    @@ -83,71 +84,71 @@ The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
    A:
    ```
    Output:
    ```
    ```python
    Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.
    ```

    You can perform the above similarly with Zero-shot COT Prompting,
    Prompt:
    ```
    ```python
    Note: 'Let's Think step by step' <---- add this special prompt.
    Question: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
    Answer:
    ```
    Output:
    ```
    ```python
    There are 16 balls in total. Half of the balls are golf balls. That means that there are 8 golf balls. Half of the golf balls are blue. That means that there are 4 blue golf balls.
    ```


    In order to complete tasks LLM needs to understand the details of the tasks as much as possible and simply by just asking in a one go may not solve the problem at this time different prompting techniques comes into the picture. where adding some extra details realted to the task or by passing a specific prompt which tells the LLM to think differently for example, a vastly use method is chain of thought technique. To implement the COT add this prompt 'Think step by step' in your query. LLM will solve the given task in a step by step manner.
    To complete tasks LLM needs to understand the details of the tasks as much as possible and simply by just asking in one go may not solve the problem at this time different prompting techniques come into the picture. where adding some extra details related to the task or passing a specific prompt which tells the LLM to think differently for example, a vastly use method is the chain of thought technique. To implement the COT add this prompt 'Think step by step' in your query. LLM will solve the given task in a step-by-step manner.
    To understand different prompting techniques please refer: https://www.promptingguide.ai/

    ## Function calling
    Function calling is a very powerful approach to connect LLM with external digital world.
    In an API request, you can specify functions, and the model will smartly create a JSON object with the necessary arguments for calling one or more functions. The openai's API doesn't execute the function; instead, it provides JSON that you can use in your code to make the function call.
    A function call is a very powerful approach to connecting LLM with the external digital world.
    In an API request, you can specify functions, and the model will smartly create a JSON object with the necessary arguments for calling one or more functions. The opener's API doesn't execute the function; instead, it provides JSON that you can use in your code to make the function call.
    ```mermaid
    sequenceDiagram
    User ->> Aubai: Can I take 13 days leave in one go?
    User ->> Aubai: Can I take 13 days' leave in one go?
    Aubai-->>Aubai: Analysed the query, need to call a function to get the information related to leave policy.
    Aubai ->> ChromaDB Query Engine: query: can a employee take leaves more than 10 days?
    ChromaDB Query Engine ->> Vector Database: Get a list of related documents which are most similar and near to the query in vector space
    Aubai ->> ChromaDB Query Engine: query: can an employee take leaves for more than 10 days?
    ChromaDB Query Engine ->> Vector Database: Get a list of related documents that are most similar and near to the query in vector space
    Vector Database ->> Aubai: <list of related documents which are most similar and near to the query in vector space>
    Aubai -->> Aubai: Curate the answer based on the user's query and database information
    Aubai ->> User: You cannot take 13 days leave in one go.
    Aubai ->> User: You cannot take 13 days' leave in one go.
    ```
    checkout the Function calling implementation on [Aubai](https://github.com/gehlotabhishek/aubai/tree/aubai-dev) github
    checkout the Function calling implementation on [Aubai](https://github.com/gehlotabhishek/aubai/tree/aubai-dev) GitHub

    ## Vector Embeddings
    Vector embeddings, in terms of Large Language Models (LLMs), are numerical representations of words, phrases, or even entire documents in a multi-dimensional space. These embeddings capture the semantic meaning of the text, allowing the LLM to understand and process language. Each point in the space represents a different word or text snippet, and the distance or angle between points reflects the similarity between their meanings. Embeddings enable the LLM to perform tasks such as classification, translation, and question answering more accurately.

    ![](https://miro.medium.com/v2/resize:fit:700/1*jptD3Rur8gUOftw-XHrezQ.png)

    There are different models availabe to generate vector embeddings from the text.
    Aubai utlizes `multi-qa-mpnet-base-dot-v1` for sentence to vector embeddings.
    `multi-qa-mpnet-base-dot-v1` is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. The reason we choose this model is that It has been trained on 215M (question, answer) pairs from diverse sources which matches our usecase that is in form of Q&A. you can checkout the other models by [sentence transformers](https://www.sbert.net/) and its implementations.
    There are different models available to generate vector embeddings from the text.
    Aubai utilizes `multi-qa-mpnet-base-dot-v1` for sentence-to-vector embeddings.
    `multi-qa-mpnet-base-dot-v1` is a sentence-transformers model. It maps sentences & paragraphs to a 768-dimensional dense vector space and was designed for semantic search. The reason we chose this model is that It has been trained on 215M (question, answer) pairs from diverse sources which match our use case that is in the form of Q&A. You can check out the other models by [sentence transformers](https://www.sbert.net/) and their implementations.

    ## Vector Database
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering.
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, and metadata filtering.
    In Aubai we are using [ChromaDB](https://www.trychroma.com/) which is an open source vector database.

    ## Implementation of vector embeddings
    Before converting the text data into vector embedding, data should be divded into batch of chunks. The chunking process is highly dependend on the use case and the input format (pdf, txt, csv, json, docx), in our usecase we have to store the policy documents which are in pdf format. Let's see the approach of chunking the policy documents, converting it into vector embeddings and store it into our chromadb vector database.
    Before converting the text data into vector embedding, data should be divided into batches of chunks. The chunking process is highly dependent on the use case and the input format (pdf, txt, CSV, JSON, docx), in our use case we have to store the policy documents which are in pdf format. Let's see the approach of chunking the policy documents, converting them into vector embeddings, and storing them in our Chromadb vector database.

    ### Loading, parsing, and chunking
    ```python
    import os
    from decouple import config
    from chromadb.utils import embedding_functions
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from pypdf import PdfReader
    from long-chain.text_splitter import RecursiveCharacterTextSplitter
    from pdf import PdfReader

    def data_loader():
    # Set the directory path where PDF files are stored
    data_dir = 'data/policy'

    # Check if the directory exists
    if not os.path.exists(data_dir) or not os.path.isdir(data_dir):
    if not os.path.exists(data_dir) or not os. path.isdir(data_dir):
    print("The 'data' directory does not exist in the current working directory.")
    return

  16. gehlotabhishek revised this gist Jan 11, 2024. 1 changed file with 167 additions and 76 deletions.
    243 changes: 167 additions & 76 deletions Aubai Backend Implementation.md
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,8 @@

    # Aubai Implementation Concepts

    ## Aubai is an AI chatbot assistant which is design to answer questions realted to company's information and policies.
    # Aubai Implementation and concepts

    ## Aubai is an LLM based chatbot assistant which is design to answer questions realted to company policies.

    ### Key concepts for the approach and implemenation of Aubai:
    There are some key components you have to be clear about while building a chatbot like Aubai.
    @@ -14,29 +15,29 @@ There are some key components you have to be clear about while building a chatbo

    ## Large Language model(LLM)
    There are many different LLMs avilable in the market which are fine-tune for generating chat like responses. among them one of the best models are openai's GPT-3.5 Turbo and GPT-4-Turbo (current).
    There are multiple reasons to select these models. These are the biggest models available in the current market which can be easlily accessable by APIs, Hosted by openai and apart from generating chat like responses they also provide out of the box functionlties.

    ### Concepts:
    There are multiple reasons to select these models. These are the biggest models available in the current market which can be easlily accessable by APIs, Hosted by openai and apart from generating chat like responses they also provide out of the box functionlties like Function calling, Tools, JSON response.

    #### Context Window
    context window is the maximum amount of text the model can consider at any one time when generating a response. memory of any LLM based system is also gets affected by the size of context window, you will get the idea when you will read memory concept.
    ### Context Window
    Context window is the maximum amount of text the model can consider at any one time when generating a response. memory of any LLM based system is also gets affected by the size of context window, you will get the idea when you will read about conversational memory.
    The context window is very large for GPT models. for example, GPT-3.5-Turbo has 4,096 tokens where GPT-4-Turbo has 128,000 tokens.

    #### Tokens
    ### Tokens
    Tokens can be viewed as fragments of words. When the API deals with prompts, it breaks down the input into tokens. These tokens may not exactly match where words start or end; they can have spaces and parts of sub-words. To know how to count tokens please refer https://platform.openai.com/tokenizer.
    Size of context window is very important when you want a long continious interaction with LLM in your usecase.


    ## Conversational Memory
    Conversational memory helps a LLM based chatbot respond like it's having a real conversation. It lets the chatbot remember what was said before, so each question isn't treated separately, and it considers past talks to give better replies. from the programming percpctive developer have to somehow keep the history of the interaction going on with the LLM and feed it with each next input. without it, every query would be treated as an entirely independent input without considering past interactions.
    Conversational memory give power to a LLM based chatbot respond like it's having a real conversation. It helps the chatbot to remember what was said before, so each question isn't treated separately, and it considers past talks to give better replies. from the programming percpctive developer have to somehow keep the history of the interaction going on with the LLM and feed it with each next input. without it, every query would be treated as an entirely independent input without considering past interactions.
    Example of conversational memory using aubai's brain class.
    ```
    from brain import Brain
    # Example conversation history
    conversation_history = [
    {"role": "system", "content": "LeaveBot"},
    {"role": "user", "content": "What is the leave policy of the company?"},
    {"role": "user", "content": "Hey"},
    {"role": "assistant", "content": "Hello there! How can I assist you today?
    If you have any questions related to Aubergine Solutions company's policy or guidelines, feel free to ask!"},
    ]
    # Instantiate the Brain
    @@ -47,15 +48,14 @@ user_query_1 = "How many leave days am I entitled to per year?"
    user_query_2 = "Can I take a leave for more than 10 days consecutively?"
    # Initial interaction
    full_response_1 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history, prompt=user_query_1))
    print("Full Response 1:", full_response_1)
    # Now, let's extend the conversation
    conversation_history.append({"role": "assistant", "content": "You are entitled to 15 leave days per year."})
    conversation_history.append({"role": "user", "content": user_query_1})
    response_1 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history))
    conversation_history.append({"role": "assistant", "content": response_1})
    # User's next query
    full_response_2 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history, prompt=user_query_2))
    print("Full Response 2:", full_response_2)
    conversation_history.append({"role": "user", "content": user_query_2})
    response_2 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history))
    conversation_history.append({"role": "assistant", "content": response_2})
    ```
    Checkout the implemenation for more details githublink: https://github.com/gehlotabhishek/aubai/blob/aubai-dev/code/brain.py
    @@ -116,76 +116,162 @@ Vector Database ->> Aubai: <list of related documents which are most similar and
    Aubai -->> Aubai: Curate the answer based on the user's query and database information
    Aubai ->> User: You cannot take 13 days leave in one go.
    ```
    checkout the Function calling implementation on Aubai github: https://github.com/gehlotabhishek/aubai/tree/aubai-dev
    checkout the Function calling implementation on [Aubai](https://github.com/gehlotabhishek/aubai/tree/aubai-dev) github

    ## Vector Embeddings
    Vector embeddings, in terms of Large Language Models (LLMs), are numerical representations of words, phrases, or even entire documents in a multi-dimensional space. These embeddings capture the semantic meaning of the text, allowing the LLM to understand and process language. Each point in the space represents a different word or text snippet, and the distance or angle between points reflects the similarity between their meanings. Embeddings enable the LLM to perform tasks such as classification, translation, and question answering more accurately.

    ![](https://miro.medium.com/v2/resize:fit:700/1*jptD3Rur8gUOftw-XHrezQ.png)

    There are different models availabe to generate vector embeddings from the text. Aubai uses SentenceTransformers for word to vector embeddings.
    There are different models availabe to generate vector embeddings from the text.
    Aubai utlizes `multi-qa-mpnet-base-dot-v1` for sentence to vector embeddings.
    `multi-qa-mpnet-base-dot-v1` is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. The reason we choose this model is that It has been trained on 215M (question, answer) pairs from diverse sources which matches our usecase that is in form of Q&A. you can checkout the other models by [sentence transformers](https://www.sbert.net/) and its implementations.

    ## Vector Database
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling.
    In Aubai we are using ChromaDB which is an open source vector database.
    Link: https://www.trychroma.com/
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering.
    In Aubai we are using [ChromaDB](https://www.trychroma.com/) which is an open source vector database.

    ## Implementation of vector embeddings
    Before embedding the text data, data should be divded into batch of chunks. The chunking is highly dependend on the use case, in our usecase we have to store the policy documents. Let's see the approach how we take the approach of dividing the policy document, converting it into vector embeddings and store it into our chromadb vector database.
    Before converting the text data into vector embedding, data should be divded into batch of chunks. The chunking process is highly dependend on the use case and the input format (pdf, txt, csv, json, docx), in our usecase we have to store the policy documents which are in pdf format. Let's see the approach of chunking the policy documents, converting it into vector embeddings and store it into our chromadb vector database.

    ### Loading and Indexing stage:
    In loading stage we have a function name `data_loader()` which loads a pdf and do the parsing and chunking.
    #### Parsing the PDF file and chunking
    ```
    import PyPDF2
    ### Loading, parsing, and chunking
    ```python
    import os
    from decouple import config
    from chromadb.utils import embedding_functions
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from pypdf import PdfReader

    def data_loader():
    # Set the directory path where PDF files are stored
    data_dir = 'data/policy'

    # Check if the directory exists
    if not os.path.exists(data_dir) or not os.path.isdir(data_dir):
    print("The 'data' directory does not exist in the current working directory.")
    return

    # Initialize an empty list to store data
    all_data = []

    # Loop through each file in the directory
    for file in os.listdir(data_dir):
    if file.endswith(".pdf"):
    # Extract the file name without the extension
    file_name = file.split(".")[0]

    # Construct the full PDF file path
    pdf_url = os.path.join(data_dir, file)
    pdf_file = open(pdf_url, 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    page_text = page.extract_text()
    chunk = " ".join(page_text.replace(" ", "").split('\n'))
    # all_data.append([file.split(".")[0], chunk])
    all_data.append({"topic": file_name, "content": chunk,
    "page_number": f"{file_name}_{page_num + 1}"})
    pdf_file.close()
    ```
    #### Create collection for vector database
    ```
    collection_name = config('COMPANY_POLICY_DATA')
    chroma_client_host = config('CHROMA_CLIENT_HOST')
    chroma_client = chromadb.HttpClient(host=chroma_client_host, port=8000)
    try:
    chroma_client.delete_collection(name=collection_name)
    except:
    pass
    distance_functions = ["l2","ip","cosine"]
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="multi-qa-mpnet-base-dot-v1")
    collection = chroma_client.create_collection(name=collection_name,embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance_functions[0]})

    # Read the PDF file
    pdf_reader = PdfReader(pdf_url)

    # Extract text from each page, remove spaces, and join lines
    pdf_texts = [" ".join(page.extract_text().replace(" ", "").split('\n')) for page in pdf_reader.pages]

    # Remove empty texts
    pdf_texts = [text for text in pdf_texts if text]

    # Initialize a text splitter
    character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
    )

    # Split text into chunks
    character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

    # Display the total number of chunks
    print(f"\nTotal chunks: {len(character_split_texts)}")

    # Append each chunk to the data list with associated topic
    for chunk in character_split_texts:
    all_data.append({"topic": file_name, "content": chunk})

    ```
    #### storing embedding into database
    **Explanation:**

    - A function `data_loader` to load data from PDF files, parse the text, and split it into chunks.
    - It iterates through each PDF file in the specified directory, extracts text from each page, and joins the text.
    - The text is then split into chunks using a `RecursiveCharacterTextSplitter` with specified separators and chunk size.
    - The resulting chunks are stored in a list (`all_data`) along with the associated topic.


    ### Create Collection
    ```python
    # Get the collection name and Chroma client host from environment variables
    collection_name = config('COMPANY_POLICY_DATA')
    chroma_client_host = config('CHROMA_CLIENT_HOST')

    # Initialize a ChromaDB HTTP client
    chroma_client = chromadb.HttpClient(host=chroma_client_host, port=8000)

    # Try to delete the existing collection (if any)
    try:
    chroma_client.delete_collection(name=collection_name)
    except:
    pass

    # Define distance functions for indexing
    distance_functions = ["l2", "ip", "cosine"]

    # Initialize a SentenceTransformer embedding function
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="multi-qa-mpnet-base-dot-v1")

    # Create a new ChromaDB collection with specified metadata
    collection = chroma_client.create_collection(
    name=collection_name, embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance_functions[0]})
    ```
    documents = []
    metadata = []
    ids = []
    for data in all_data:
    documents.append(data['content'])
    metadata.append({"topic": data['topic']})
    ids.append(data['page_number'])
    collection.add(documents=documents, metadatas=metadata, ids=ids)
    **Explanation:**

    - Initializes a ChromaDB collection by creating a connection to ChromaDB, deleting any existing collection with the same name, and creating a new one.
    - Defines a collection name, Chroma client host, and distance functions for indexing.
    - The `SentenceTransformerEmbeddingFunction` is used for embedding text, and a new collection is created with specified metadata.

    ### Store Embeddings into Vector Database
    ```python
    # Initialize lists to store documents, metadata, and document IDs
    documents = []
    metadata = []
    ids = []
    n = 1

    # Iterate through each data chunk and extract content and topic
    for data in all_data:
    documents.append(data['content'])
    metadata.append({"topic": data['topic']})
    ids.append(f"{data['topic']}_{n}")
    n += 1

    # Add documents, metadata, and IDs to the ChromaDB collection
    collection.add(
    documents=documents,
    metadatas=metadata,
    ids=ids
    )

    # Return True indicating successful data loading
    return True

    ```
    **Explanation:**

    - prepare data for storage by iterating through each chunk and extracting content, topic, and unique IDs.
    - It then adds the documents, metadata, and IDs to the ChromaDB collection using the `collection.add` method.
    - Finally, the function returns `True` to indicate successful data loading into the database.


    Full code:
    ```
    ```python
    import chromadb
    import os
    from decouple import config
    from pypdf import PdfReader
    from chromadb.utils import embedding_functions
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    def data_loader():
    data_dir = 'data/policy'
    if not os.path.exists(data_dir) or not os.path.isdir(data_dir):
    @@ -196,16 +282,18 @@ def data_loader():
    if file.endswith(".pdf"):
    file_name = file.split(".")[0]
    pdf_url = os.path.join(data_dir, file)
    pdf_file = open(pdf_url, 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    page_text = page.extract_text()
    chunk = " ".join(page_text.replace(" ", "").split('\n'))
    # all_data.append([file.split(".")[0], chunk])
    all_data.append({"topic": file_name, "content": chunk,
    "page_number": f"{file_name}_{page_num + 1}"})
    pdf_file.close()
    pdf_reader = PdfReader(pdf_url)
    pdf_texts = [" ".join(page.extract_text().replace(" ", "").split('\n')) for page in pdf_reader.pages]
    pdf_texts = [text for text in pdf_texts if text]
    character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
    )
    character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))
    print(f"\nTotal chunks: {len(character_split_texts)}")
    for chunk in character_split_texts:
    all_data.append({"topic": file_name, "content": chunk})

    collection_name = config('COMPANY_POLICY_DATA')
    chroma_client_host = config('CHROMA_CLIENT_HOST')
    @@ -214,22 +302,25 @@ def data_loader():
    chroma_client.delete_collection(name=collection_name)
    except:
    pass
    distance_functions = ["l2","ip","cosine"]
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="multi-qa-mpnet-base-dot-v1")
    collection = chroma_client.create_collection(name=collection_name,embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance_functions[0]})
    distance_functions = ["l2", "ip", "cosine"]
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="multi-qa-mpnet-base-dot-v1")
    collection = chroma_client.create_collection(
    name=collection_name, embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance_functions[0]})
    documents = []
    metadata = []
    ids = []
    n=1
    for data in all_data:
    documents.append(data['content'])
    metadata.append({"topic": data['topic']})
    ids.append(data['page_number'])
    ids.append(f"{data['topic']}_{n}")
    n+=1
    collection.add(
    documents=documents,
    metadatas=metadata,
    ids=ids
    )
    return True
    data_loader()
    ```

  17. gehlotabhishek revised this gist Jan 10, 2024. 3 changed files with 0 additions and 97 deletions.
    84 changes: 0 additions & 84 deletions policy_embeddings.tsv
    0 additions, 84 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
    Binary file removed policy_embeddings_metadata.tsv
    Binary file not shown.
    13 changes: 0 additions & 13 deletions template_projector_config.json
    Original file line number Diff line number Diff line change
    @@ -1,13 +0,0 @@
    {
    "embeddings": [
    {
    "tensorName": "Policies",
    "tensorShape": [
    84,
    768
    ],
    "tensorPath": "https://gist.github.com/gehlotabhishek/0e1facce1a3a48db23cdcab061dd63f8#file-policy_embeddings-tsv",
    "metadataPath": "https://gist.github.com/gehlotabhishek/0e1facce1a3a48db23cdcab061dd63f8#file-policy_embeddings_metadata-tsv"
    }
    ]
    }
  18. gehlotabhishek revised this gist Jan 10, 2024. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions template_projector_config.json
    Original file line number Diff line number Diff line change
    @@ -6,8 +6,8 @@
    84,
    768
    ],
    "tensorPath": "https://drive.google.com/file/d/1Y_wYn87FvlHhyxL6LKtatJQO2C8vuLMJ/view",
    "metadataPath": "https://drive.google.com/file/d/1EDUgbeWjjJQDdCga3JZiEmdelEWMKk7q/view"
    "tensorPath": "https://gist.github.com/gehlotabhishek/0e1facce1a3a48db23cdcab061dd63f8#file-policy_embeddings-tsv",
    "metadataPath": "https://gist.github.com/gehlotabhishek/0e1facce1a3a48db23cdcab061dd63f8#file-policy_embeddings_metadata-tsv"
    }
    ]
    }
  19. gehlotabhishek revised this gist Jan 10, 2024. 3 changed files with 97 additions and 0 deletions.
    84 changes: 84 additions & 0 deletions policy_embeddings.tsv
    84 additions, 0 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
    Binary file added policy_embeddings_metadata.tsv
    Binary file not shown.
    13 changes: 13 additions & 0 deletions template_projector_config.json
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,13 @@
    {
    "embeddings": [
    {
    "tensorName": "Policies",
    "tensorShape": [
    84,
    768
    ],
    "tensorPath": "https://drive.google.com/file/d/1Y_wYn87FvlHhyxL6LKtatJQO2C8vuLMJ/view",
    "metadataPath": "https://drive.google.com/file/d/1EDUgbeWjjJQDdCga3JZiEmdelEWMKk7q/view"
    }
    ]
    }
  20. gehlotabhishek revised this gist Jan 10, 2024. 2 changed files with 0 additions and 84 deletions.
    Binary file removed embeddings_metadata.tsv
    Binary file not shown.
    84 changes: 0 additions & 84 deletions policy_embeddings.tsv
    0 additions, 84 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
  21. gehlotabhishek revised this gist Jan 10, 2024. 2 changed files with 84 additions and 0 deletions.
    Binary file added embeddings_metadata.tsv
    Binary file not shown.
    84 changes: 84 additions & 0 deletions policy_embeddings.tsv
    84 additions, 0 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
  22. gehlotabhishek revised this gist Jan 10, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Aubai Backend Implementation.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,5 @@

    # Aubie Implementation Concept
    # Aubai Implementation Concepts

    ## Aubai is an AI chatbot assistant which is design to answer questions realted to company's information and policies.

  23. gehlotabhishek renamed this gist Jan 3, 2024. 1 changed file with 2 additions and 2 deletions.
    Original file line number Diff line number Diff line change
    @@ -1,9 +1,9 @@

    # Concepts for Implementation of Aubai
    # Aubie Implementation Concept

    ## Aubai is an AI chatbot assistant which is design to answer questions realted to company's information and policies.

    ### Key concepts for the approach and implemenation for Aubai:
    ### Key concepts for the approach and implemenation of Aubai:
    There are some key components you have to be clear about while building a chatbot like Aubai.
    1. Large Language model(LLM)
    2. Conversational Memory
  24. gehlotabhishek renamed this gist Jan 3, 2024. 1 changed file with 0 additions and 0 deletions.
  25. gehlotabhishek renamed this gist Jan 3, 2024. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  26. gehlotabhishek created this gist Jan 3, 2024.
    235 changes: 235 additions & 0 deletions aubai.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,235 @@

    # Concepts for Implementation of Aubai

    ## Aubai is an AI chatbot assistant which is design to answer questions realted to company's information and policies.

    ### Key concepts for the approach and implemenation for Aubai:
    There are some key components you have to be clear about while building a chatbot like Aubai.
    1. Large Language model(LLM)
    2. Conversational Memory
    3. Prompt Engineering
    4. Function Calling
    5. Vector Embeddings
    6. Vector Database

    ## Large Language model(LLM)
    There are many different LLMs avilable in the market which are fine-tune for generating chat like responses. among them one of the best models are openai's GPT-3.5 Turbo and GPT-4-Turbo (current).
    There are multiple reasons to select these models. These are the biggest models available in the current market which can be easlily accessable by APIs, Hosted by openai and apart from generating chat like responses they also provide out of the box functionlties.

    ### Concepts:

    #### Context Window
    context window is the maximum amount of text the model can consider at any one time when generating a response. memory of any LLM based system is also gets affected by the size of context window, you will get the idea when you will read memory concept.
    The context window is very large for GPT models. for example, GPT-3.5-Turbo has 4,096 tokens where GPT-4-Turbo has 128,000 tokens.

    #### Tokens
    Tokens can be viewed as fragments of words. When the API deals with prompts, it breaks down the input into tokens. These tokens may not exactly match where words start or end; they can have spaces and parts of sub-words. To know how to count tokens please refer https://platform.openai.com/tokenizer.
    Size of context window is very important when you want a long continious interaction with LLM in your usecase.


    ## Conversational Memory
    Conversational memory helps a LLM based chatbot respond like it's having a real conversation. It lets the chatbot remember what was said before, so each question isn't treated separately, and it considers past talks to give better replies. from the programming percpctive developer have to somehow keep the history of the interaction going on with the LLM and feed it with each next input. without it, every query would be treated as an entirely independent input without considering past interactions.
    Example of conversational memory using aubai's brain class.
    ```
    from brain import Brain
    # Example conversation history
    conversation_history = [
    {"role": "system", "content": "LeaveBot"},
    {"role": "user", "content": "What is the leave policy of the company?"},
    ]
    # Instantiate the Brain
    leave_bot = Brain()
    # User interacts with the chatbot
    user_query_1 = "How many leave days am I entitled to per year?"
    user_query_2 = "Can I take a leave for more than 10 days consecutively?"
    # Initial interaction
    full_response_1 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history, prompt=user_query_1))
    print("Full Response 1:", full_response_1)
    # Now, let's extend the conversation
    conversation_history.append({"role": "assistant", "content": "You are entitled to 15 leave days per year."})
    # User's next query
    full_response_2 = " ".join(response for response in leave_bot.llm_call(chain=conversation_history, prompt=user_query_2))
    print("Full Response 2:", full_response_2)
    ```
    Checkout the implemenation for more details githublink: https://github.com/gehlotabhishek/aubai/blob/aubai-dev/code/brain.py



    ## Prompt Engineering
    Prompt engineering is a fresh field that focuses on creating and fine-tuning prompts to make language models (LMs) more efficient across various applications and research areas. Having prompt engineering skills helps in understanding what large language models (LLMs) can and cannot do.

    At its core it is about arranging text in a way that an AI model can understand what a user wants to accomplish. A prompt is the task described in plain language that the AI needs to do. prompt engineering is must in order to enhance the performance of LLMs in tasks like question answering and Chatbots.
    Developers leverage prompt engineering to craft reliable techniques for effective communication with LLMs and other tools.

    One example will be `Chain of thought`,
    Prompt:
    ```
    The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
    A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
    The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.
    A: Adding all the odd numbers (17, 19) gives 36. The answer is True.
    The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24.
    A: Adding all the odd numbers (11, 13) gives 24. The answer is True.
    The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.
    A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.
    The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
    A:
    ```
    Output:
    ```
    Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.
    ```

    You can perform the above similarly with Zero-shot COT Prompting,
    Prompt:
    ```
    Note: 'Let's Think step by step' <---- add this special prompt.
    Question: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
    Answer:
    ```
    Output:
    ```
    There are 16 balls in total. Half of the balls are golf balls. That means that there are 8 golf balls. Half of the golf balls are blue. That means that there are 4 blue golf balls.
    ```


    In order to complete tasks LLM needs to understand the details of the tasks as much as possible and simply by just asking in a one go may not solve the problem at this time different prompting techniques comes into the picture. where adding some extra details realted to the task or by passing a specific prompt which tells the LLM to think differently for example, a vastly use method is chain of thought technique. To implement the COT add this prompt 'Think step by step' in your query. LLM will solve the given task in a step by step manner.
    To understand different prompting techniques please refer: https://www.promptingguide.ai/

    ## Function calling
    Function calling is a very powerful approach to connect LLM with external digital world.
    In an API request, you can specify functions, and the model will smartly create a JSON object with the necessary arguments for calling one or more functions. The openai's API doesn't execute the function; instead, it provides JSON that you can use in your code to make the function call.
    ```mermaid
    sequenceDiagram
    User ->> Aubai: Can I take 13 days leave in one go?
    Aubai-->>Aubai: Analysed the query, need to call a function to get the information related to leave policy.
    Aubai ->> ChromaDB Query Engine: query: can a employee take leaves more than 10 days?
    ChromaDB Query Engine ->> Vector Database: Get a list of related documents which are most similar and near to the query in vector space
    Vector Database ->> Aubai: <list of related documents which are most similar and near to the query in vector space>
    Aubai -->> Aubai: Curate the answer based on the user's query and database information
    Aubai ->> User: You cannot take 13 days leave in one go.
    ```
    checkout the Function calling implementation on Aubai github: https://github.com/gehlotabhishek/aubai/tree/aubai-dev

    ## Vector Embeddings
    Vector embeddings, in terms of Large Language Models (LLMs), are numerical representations of words, phrases, or even entire documents in a multi-dimensional space. These embeddings capture the semantic meaning of the text, allowing the LLM to understand and process language. Each point in the space represents a different word or text snippet, and the distance or angle between points reflects the similarity between their meanings. Embeddings enable the LLM to perform tasks such as classification, translation, and question answering more accurately.

    ![](https://miro.medium.com/v2/resize:fit:700/1*jptD3Rur8gUOftw-XHrezQ.png)

    There are different models availabe to generate vector embeddings from the text. Aubai uses SentenceTransformers for word to vector embeddings.

    ## Vector Database
    A vector database is a type of database that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling.
    In Aubai we are using ChromaDB which is an open source vector database.
    Link: https://www.trychroma.com/

    ## Implementation of vector embeddings
    Before embedding the text data, data should be divded into batch of chunks. The chunking is highly dependend on the use case, in our usecase we have to store the policy documents. Let's see the approach how we take the approach of dividing the policy document, converting it into vector embeddings and store it into our chromadb vector database.

    ### Loading and Indexing stage:
    In loading stage we have a function name `data_loader()` which loads a pdf and do the parsing and chunking.
    #### Parsing the PDF file and chunking
    ```
    import PyPDF2
    import os
    def data_loader():
    data_dir = 'data/policy'
    if not os.path.exists(data_dir) or not os.path.isdir(data_dir):
    print("The 'data' directory does not exist in the current working directory.")
    return
    all_data = []
    for file in os.listdir(data_dir):
    if file.endswith(".pdf"):
    file_name = file.split(".")[0]
    pdf_url = os.path.join(data_dir, file)
    pdf_file = open(pdf_url, 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    page_text = page.extract_text()
    chunk = " ".join(page_text.replace(" ", "").split('\n'))
    # all_data.append([file.split(".")[0], chunk])
    all_data.append({"topic": file_name, "content": chunk,
    "page_number": f"{file_name}_{page_num + 1}"})
    pdf_file.close()
    ```
    #### Create collection for vector database
    ```
    collection_name = config('COMPANY_POLICY_DATA')
    chroma_client_host = config('CHROMA_CLIENT_HOST')
    chroma_client = chromadb.HttpClient(host=chroma_client_host, port=8000)
    try:
    chroma_client.delete_collection(name=collection_name)
    except:
    pass
    distance_functions = ["l2","ip","cosine"]
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="multi-qa-mpnet-base-dot-v1")
    collection = chroma_client.create_collection(name=collection_name,embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance_functions[0]})
    ```
    #### storing embedding into database
    ```
    documents = []
    metadata = []
    ids = []
    for data in all_data:
    documents.append(data['content'])
    metadata.append({"topic": data['topic']})
    ids.append(data['page_number'])
    collection.add(documents=documents, metadatas=metadata, ids=ids)
    ```
    Full code:
    ```
    def data_loader():
    data_dir = 'data/policy'
    if not os.path.exists(data_dir) or not os.path.isdir(data_dir):
    print("The 'data' directory does not exist in the current working directory.")
    return
    all_data = []
    for file in os.listdir(data_dir):
    if file.endswith(".pdf"):
    file_name = file.split(".")[0]
    pdf_url = os.path.join(data_dir, file)
    pdf_file = open(pdf_url, 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    page_text = page.extract_text()
    chunk = " ".join(page_text.replace(" ", "").split('\n'))
    # all_data.append([file.split(".")[0], chunk])
    all_data.append({"topic": file_name, "content": chunk,
    "page_number": f"{file_name}_{page_num + 1}"})
    pdf_file.close()
    collection_name = config('COMPANY_POLICY_DATA')
    chroma_client_host = config('CHROMA_CLIENT_HOST')
    chroma_client = chromadb.HttpClient(host=chroma_client_host, port=8000)
    try:
    chroma_client.delete_collection(name=collection_name)
    except:
    pass
    distance_functions = ["l2","ip","cosine"]
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="multi-qa-mpnet-base-dot-v1")
    collection = chroma_client.create_collection(name=collection_name,embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance_functions[0]})
    documents = []
    metadata = []
    ids = []
    for data in all_data:
    documents.append(data['content'])
    metadata.append({"topic": data['topic']})
    ids.append(data['page_number'])
    collection.add(
    documents=documents,
    metadatas=metadata,
    ids=ids
    )
    return True
    data_loader()
    ```