Skip to main content

Learning AI Poorly: How you might use ChatGPT to query your own pdfs...

·4 mins

(originally on LinkedIn)

ChatGPT is great b/c you can ask it questions and it will give you a fun answer. Those answers, of course, are based on whatever humungous corpus of data ChatGPT was trained upon. It knows nothing of your local cache of PDF files about whatever obscure thing you’re interested in at the moment. How can you ask it questions about that stuff?

ChatGPT does allow fine tuning but it is expensive to retrain every time you find a new file and to actually do fine tuning, you have to format your data into “a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production” - what a bummer.

Did you know you could possibly sort of fine tune ChatGPT with just a little bit of prompt engineering?

Prompt Engineering #

Prompt engineering is the process of optimizing the prompts you provide to ChatGPT to… maximize the effectiveness of the output. In other words, asking the same question in different ways can give you wildly different answers. If you practice a little bit (and fiddle with some levers called things like “temperature” and “top_p”) you can get good at getting good answers. Problem is, ain’t nobody got time for that. That’s why something like langchain was created.

Langchain #

Langchain is a framework for developing applications powered by language models. It enables applications that:

  • Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)

  • Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)

  • “Ok, great… how does that help us query our own PDF’s?” I’m glad you asked. Langchain has a feature called Retrieval-augmented generation (RAG) that allows you to augment ChatGPT’s (or any LLM’s) knowledge with your own data.

RAG #

RAG consists 3 things. A “Knowlege Base” - that is, custom information you have provided that is relevant to your needs. It has a “Retrieval System” that is able to pull information relevant to your question from your custom knowledge base. And finally, it has a LLM or a Large Language Machine like ChatGPT to ask questions.

The process starts with the question. Let’s say, “How do I choose a motor for an electric go kart conversion for my kid?” The first step is called “Retrieval”. This step pulls the most relevant data (based on the question you provided) from your own custom knowledge base. Next, all of that information, along with your question, is sent to the LLM (like ChatGPT) in a format that is very much like a prompt that was crafted by someone who has spent years learning how to create good prompts for ChatGPT… That’s right! RAG helps you be a great little Prompt Engineer without putting in any work. Pretty cool, eh?

(if you’re curious, here’s an example of a prompt it might create based on a custom knowledge base)

“Ok, what happens if you have a ton of data in your knowledge base? You can’t just send ChatGPT a ten gig query, can you?”

Of course not! Luckily, the people who made Langchain thought of this and created a thing called a “multi-step chain” that breaks the query up into manageable chunks and calls ChatGPT multiple times to refine the answer.

“But… doesn’t it cost money to call OpenAI’s ChatGPT a ton of times?” - You’re right… It does…

Optimizations! #

This langchain tutorial goes through the steps of asking simple questions to the ones that wind up making thousands of calls to OpenAI. They suggest the idea of using vector space search engine techniques… Which, sounds bananas, but really just kind of means you index your local knowledge base into its own search engine, then when you ask a question, you hit that search engine which should only return a small subset of relevant information and then you pass only that data to ChatGPT. This makes the prompts small so you don’t spend so much money.

PDF #

Ok, so the whole point of this was to get very localized information from your own cache of boutique PDF’s. How can we do that?

Luckily, Langchain has its own PDF document loader that you can use to fill your knowledge base with your localized information. Once that is all loaded up, you can use Langchain to craft fantastic prompts.

Will this leak proprietary information to OpenAI? #

yes.