Train OpenAI GPT with Your Own Data and Unleash Limitless Possibilities!

Empower Your Creativity: Build Your Own Chatbot Like ChatGPT with Accessible Source Code — Unleash Your Imagination and Enjoy the Journey!

Raghunaathan
10 min readApr 26, 2023
Photo by Andrew Neel from Pexels: https://www.pexels.com/photo/monitor-screen-with-openai-logo-on-black-background-15863044/

Welcome to an exciting journey of building your very own custom chatbot like ChatGPT! In this blog post, we will explore the step-by-step process of creating OpenAI keys, leveraging ChatGPT to prepare the training data, understanding the essential components of a custom chatbot, and gaining insights into what the program does. We will also see the final result of our creation and discuss potential scopes for improvement. With accessible source code and a fun learning approach, get ready to unlock your creativity and embark on an adventure of building your own chatbot! Let’s dive in!

Develop Your Training Dataset

To prepare the training data for our custom chatbot, I have utilized the power of ChatGPT to generate interview data from various kinds of artists. Different types of artists, including street artists, professional artists, and general artists, were passed in the prompts to capture a diverse range of perspectives and insights. While the amount of data and the quality of datasets are crucial factors in training a chatbot, we can also acknowledge that prompt engineering plays a significant role in improving the output. For the purpose of this blog post, I have generated six interviews as examples, but the potential for expanding and refining the dataset is immense with further prompt engineering. Let’s see how we incorporated these interviews into our custom chatbot to create an engaging conversational experience!

sample prompt and response returned

The output generated by ChatGPT can be copied and pasted into a notepad, and then saved as a text file. This text file can then be used as trainable data for GPT using the GPTIndex and LangChain tools.

What is GPTIndex and Why Do We Use It?

Note: GPTIndex is renamed to LlamaIndex and both the names point to the same tool

GPTIndex is a tool used for converting raw text data into a format that can be used to train a language model like GPT (Generative Pre-trained Transformer). It helps in indexing the text data, assigning unique indices to each token, and creating a dataset that can be used for training a language model. GPTIndex is commonly used because it helps in efficiently handling large text datasets, managing memory requirements, and optimizing the training process for language models. It ensures that the data is in a structured format that can be effectively used to train a language model and generate high-quality text output. GPTIndex is widely used for the following reasons:

  1. Efficient Data Indexing: GPTIndex assigns unique indices to tokens in the text data, optimizing memory usage during training and enabling efficient management of large datasets.
  2. Streamlined Training Process: GPTIndex structures the data in a way that maximizes the training process, leading to improved model performance and higher-quality text generation.
  3. Memory Optimization: GPTIndex optimizes memory requirements during training, allowing for smoother and faster training of language models.
  4. High-Quality Text Output: By converting raw text data into a structured format, GPTIndex enables the model to generate coherent and contextually relevant text output that closely resembles human-like language.
  5. Scalability: GPTIndex is scalable, making it ideal for handling massive text datasets and large-scale language model training projects.

In a nutshell, GPTIndex is a game-changing tool that empowers the training of language models like GPT, revolutionizing the data indexing process, streamlining training, and resulting in top-notch text generation. With its advanced capabilities, GPTIndex is paving the way for cutting-edge language modeling and transforming the field of natural language processing.

For detailed documentation visit this reporsitory llama_index

LangChain — The Next Generation Language Model Training Framework!

LangChain is a state-of-the-art technology that is revolutionizing the field of natural language processing (NLP). It is a language model training framework that utilizes chain-structured models to generate text. Here’s why LangChain is gaining momentum in the AI community:

  1. Advanced Text Generation: LangChain leverages chain-structured models to generate coherent and contextually relevant text, leading to high-quality output that closely resembles human-like language.
  2. Flexible Language Modeling: With LangChain, users have the flexibility to define their own language modeling tasks, making it a versatile tool for a wide range of NLP applications, such as text generation, summarization, and translation.
  3. Scalability and Efficiency: LangChain is designed to handle large-scale language model training, making it ideal for processing massive text datasets. It optimizes memory usage, accelerates training time, and allows for parallel processing, ensuring efficient and scalable training.
  4. Customization: LangChain allows users to customize the model architecture and training parameters to suit their specific requirements, providing a tailored solution for individual use cases.
  5. Cutting-Edge Technology: LangChain incorporates the latest advancements in NLP research, including innovations in neural networks, attention mechanisms, and transfer learning, making it a state-of-the-art tool for language modeling tasks.

In summary, LangChain is a powerful language model training framework that offers advanced text generation capabilities, flexibility, scalability, customization, and incorporates cutting-edge technology. It is transforming the field of NLP and enabling AI practitioners to develop sophisticated language models for a wide range of applications.

For detailed documentation visit this reporsitoryLangChain

Next up is the pivotal step of generating an OpenAI key, as it serves as the driving force behind the entire GPT application. It’s time to take action and obtain your OpenAI key.

  1. Visit the OpenAI website and click on “Sign up”.
  2. You can either Sign up with gmail, outlook or other email address.
  3. You can “Log in” from this page if you already have an OpenAI account.
  4. When you “Sign Up”, you will receive a verification email to your inbox. Once you click on it, you will be asked to fill in the basic details. Post that you will be asked for your phone number.
  5. Post the phone verification, you will be able to view your OpenAI account.

Follow the steps mentioned in this page if you have any problems understanding the mentioned points

Once you login, click on “Personal” in the top right corner of the navigation bar. From the appearing dropdown select “View API keys”. From the “API keys” page, click on “Create new secret key”. Click on the button and save the API key.

Translating Data for Communicating with GPT!

Now that we have the data and the key in hand, it’s time to roll up our sleeves and start building! First, we need to develop a function that will translate our interview data into a valid format that is supported and easy to train for the GPT model. We are going to use the text-davinci-003 model for our GPT.

We use the PromptHelper module to optimize the data indexing process by generating high-quality prompts that can effectively guide the model during the training process. It aids in generating prompts that are tailored to the specific tasks or goals of the language model, improving the effectiveness and efficiency of the training process, making it easier to yield better results. For more details refer this documentation .

The LLMPredictor is another important module that helps in predicting the likelihood of a given (LLM prompt) being completed successfully by the model. It uses machine learning techniques to analyze the prompt and assess the probability of generating coherent and meaningful text outputs based on that prompt. LLMPredictor evaluates the prompt in terms of its compatibility with the model’s training data, fine-tuning settings, and other parameters. For more details refer this documentation .

The Service Context module is the container that contains the following objects that are commonly used for configuring every index and query, such as the LLMPredictor (for configuring the LLM), the PromptHelper (for configuring input size/chunk size), the BaseEmbedding (for configuring the embedding model), and more. For more details refer this documentation .

Finally, we save the indexed data into a file format that can be trained with OpenAI GPT model.

Refer the provided documentation links to tune the parameters for each module.

Let’s Use Our Custom Chatbot

Now you can ask any queries to the chatbot related to the trained dataset.

Sample response generated based on the trained data

Let Us Explore The Code

The first step is to install the necessary libraries. I suggest using the google Colaboratory notebooks or the Anaconda IDE with jupyter notebooks, so that the basic and popular libraries are pre-installed. In addition to this we need to install the GPTIndex and LangChain modules. Use the pip installer and install the necessary modules.

! pip install gpt-index
! pip install langchain

Next step is to import the necessary modules and methods

from gpt_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper,ServiceContext
from langchain import OpenAI
import sys,os
from IPython.display import Markdown, display

Now, let us write the function to transform the dataset for our GPT model

def data_constructor(dir): 
max_input_size = 4096
num_outputs = 2000
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap,
chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.5, model_name="text-
davinci-003", max_tokens=num_outputs))
documents = SimpleDirectoryReader(dir).load_data()
context = ServiceContext.from_defaults(llm_predictor=llm_predictor,
prompt_helper=prompt_helper)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=context)
index.save_to_disk('index.json')
return index

PromptHelper Parameters

  • max_input_size = specifies the maximum size of input text that can be processed by the PromptHelper module. It defines the upper limit on the length of the input text that can be used to generate prompts for training language models like GPT. If the input text exceeds this maximum size, it may be truncated or omitted during the prompt generation process.
  • num_outputs = specifies the number of text outputs to generate for a given prompt during the training process of a language model like GPT. It determines how many different completions or responses the model will generate for a single prompt. This parameter allows users to control the diversity and quantity of generated text outputs during training. More volume = Diverse Outputs
  • max_chunk_overlap = controls how much overlap can exist between adjacent chunks to avoid redundancy or repetition in the prompts. By specifying a value for max_chunk_overlap, users can control the amount of overlap allowed between adjacent chunks, thus influencing the diversity and coherence of the prompts
  • chunk_size_limit = determines the maximum size or length of chunks or segments into which a given prompt is divided during the training process.This parameter is used to limit the size of prompts that are fed into the language model during training, helping to ensure that the prompts are of manageable size and do not exceed the processing capabilities or memory limitations of the system.

LLMPredictor Parameters

  • temperature = controls the randomness or creativity of the generated text. A higher temperature value (e.g., 1.0) results in more randomness and diversity in the generated text, as the model is allowed to make more random choices from the probability distribution. On the other hand, a lower temperature value (e.g., 0.5) makes the generated text more focused and deterministic, as the model tends to choose the most probable word based on the probability distribution.
  • model-name = the pre-trained language model that is being used for prompt prediction. You can find all the available models here . You can select the model based on your use-case and budget.
  • max_tokens = The max_tokens parameter is used to limit the length of the generated text to a certain number of tokens, ensuring that the output does not exceed a certain length constraint. You can select the value according to the your demand

SimpleDirectoryReader

  • SimpleDirectoryReader offers efficient and streamlined functionality for reading text data from files, performing data pre-processing tasks such as tokenization, and managing the data ingestion process.
  • Pass the folder path as the argument for the loader. The text data from the available files can be read by the reader.

ServiceContext

  • ServiceContext includes information such as API keys, data sources, model configurations, and other runtime settings that are required for the proper functioning of the GPTIndex service. We pass the PromptHelper and LLMPredictor configurations as the parameters.
  • It acts as a container or holder for relevant information and settings that are used by the GPTIndex system during its operation.

GPTSimpleVectorIndex

  • It provides functionality for indexing and managing vector representations of prompts or text inputs in the GPT training data.
  • It will return a gpt_index object that contains vector indices
  • we will be able to save the index in a format of our choice
  • I am saving the output in JSON format

Now copy and paste your saved OpenAI key that will be used by our model

os.environ["OPENAI_API_KEY"] = input("Paste your OpenAI key here and hit enter ")

Now call data_constructor function and pass the path to the directory where you have saved your input text files, In my case the text files comprising interviews of artists. This will generate a file called index.json in the same directory as the notebook.

Now that we have the indexed data available, let us build the function where we can communicate with our chatbot.

def ask_ai():
index = GPTSimpleVectorIndex.load_from_disk('index.json')
while True:
query = input("What do you want to ask? ")
response = index.query(query)
display(Markdown(f"Response: <b>{response.response}</b>"))
# call the function
ask_ai()

In the above function, we load the processed/indexed dataset, and use it for responding to our queries. This indexed dataset is used to enhance the chatbot by providing more accurate answers based on the custom data provided. We are using the IPython module to interact with our chatbot.

In conclusion, building a custom ChatGPT using GPT-3.5 and GPTIndex can be a powerful tool for generating dynamic and interactive conversational experiences. With the right prompts, prompt engineering, and understanding of the GPTIndex workflow, you can create your own chatbots tailored to your specific needs and applications. From brainstorming ideas, generating creative content, providing customer support, to enhancing user engagement, the possibilities are endless. Hope this step-by-step tutorial and your experimentation with different prompts and settings will enable you to unlock the full potential of LLMs and create unique conversational experiences. So, get started and unleash your creativity with your own custom ChatGPT!

Useful Resources

--

--

Responses (1)