AI FAQs - GenAI, LLMs, and VectorDB
At Capacity, we are committed to helping teams do their best work. Since our founding in 2017, we have focused on using artificial intelligence (AI) to empower people to work more efficiently and better serve customers. As the AI field rapidly evolves, Capacity will remain grounded in our company values and stay true to our established guiding principles for the use of AI.
We have created this FAQ to help our customers and potential customers understand some of the nuances as to how we use AI in the Capacity platform.
What’s an LLM?
An LLM or Large Language Model is an AI that can generate text and images. Technically, an LLM is a very large set of numbers that reflect the relationship between the items that are found in text and images like letters and words or the pixels in an image. The higher the number the stronger the relationship between the items, and the more likely the LLM will use that relationship to create a response.
LLMs create these relationships by looking at data such as all of Wikipedia or all the images on Facebook. They remember the word pizza goes frequently with the word food and good. They remember that images of horses often have brown colors with black on the very bottom and sometimes white on the upper side (i.e. where the nose is). When you ask an LLM to write a poem, for example, it has already looked at thousands of poems and comments people have made about poems and uses these relationships to generate your poem.
The reason that LLMs are so good at these types of activities is the datasets are so large and the LLM can store billions and billions of relationships. And they are designed to traverse these relationships very quickly and in very intelligent ways using very powerful hardware.
What is a vector database?
A vector database is simply a database that is optimized to store a large amount of vectors and do calculations on them very quickly. Think of a vector as a line drawn from 0,0,0 to any point on a graph. One thing you can do on a graph is measure the distance between two vectors to see how near they are two each other. Vector databases are very fast at measuring these distances especially on large amounts of data.
Institutions like Stanford figured out a long time ago that you could graph words like “cat” and “kitten” near each other and then use the distance between the two points to test for similarity. “Kitten” might be near “cat” whereas “bulb” might be farther away. This proximity is often referred to as semantic similarity because we’re measuring how close two words are on a graph by how semantically similar they are. To create vectors from multiple words, you simply add the vectors of the words together. You can convert any word, phrase, paragraph or document into a vector using this approach. You then calculate the distance between the two vectors to see how similar the two pieces of text really are.
For example, when we try to find documents for search, we now convert the search request into a vector and then use semantic similarity to identify all the documents similar to that request, i.e., where the vector of the document is near the vector of the request. The vector database ultimately stores both the vector of the document and a link to the original document, which allows us to find the document once we determine the closest vector.
There are many blog posts and articles about exactly how words are converted to points on a graph, and therefore we are not covering that here.
What is Natural Language Understanding (NLU) and Natural Language Processing(NLP) and what do you use?
Capacity was first started using leading edge techniques for NLP. These techniques allow you to take a set of words from a user and map it to another set of words that we have in our database, which then allows us to formulate the best response. NLP encompasses a variety of techniques such as handling misspellings (“aquire” vs “acquire”) and matching phrases that are close (“ loan payment” and “mortgage payment”).
NLU techniques have the ability to match against the entire context of words, phrases, paragraphs and documents to determine what data should be used in a response. With its better understanding of language, NLU can match something like “loan payments” to an article titled “Understanding the structure of a 30 year fixed.” NLU can figure out that “30 year fixed” refers to some type of mortgage or loan, and that “structure” is likely about “loan payments” even though the word “structure” is not similar to “loan” or “payments”. We continue to use some parts of our NLP technology to enhance the success of our new NLU-based approach.
Why do LLM’s hallucinate?
When people ask a LLM for a response, we provide a description of what we are looking for which is referred to as the “prompt”. A simple prompt tells the LLM what to do, and may additionally tell the LLM what information to consider. For example, a simple prompt such as “write a poem from scratch” would cause the LLM to look at all the relationships it has stored from large datasets it learned from and to create the poem.
People can also utilize a more specific prompt such as “summarize this meeting recording” using information they provide (i.e the recording). The LLM will utilize the same process of creating a response based on instructions, but it looks instead at the data provided. The LLM still uses what it learned from larger datasets to know “how” to complete the task. So, as it summarizes, it may include external information like a good introduction or conclusion. Sometimes, however, that external information does not improve the result or includes something not represented in the information. This is called a hallucination.
Companies like Capacity do our best to instruct the LLM what to do, including telling the LLMs we have selected what information to use in the response without including inaccurate information from the model. Progress in this quelling hallucinations has been tremendous in the last year, but LLMs are still evolving and so are the techniques to tell the LLM how to respond to prompts.
If it’s true that LLMs hallucinate, how can we trust the information coming from an LLM?
The process Capacity uses with our LLMs, which is called Retrieval Augmented Generation (RAG), is specifically designed to compell the LLM to consider only the provided information. We provide the information and then ask the LLM to complete a task using only what was provided by us. From what we have seen, this method works mostly as expected, and we feel it creates a better result than generalized search mechanisms used today which can produce links to irrelevant content. We will continue to monitor and improve these results.
How do you keep our data separate from other customers in your vector database?
We use the same technique that we use with SQL data. Every record in the vector database includes unique identifiers that link the data back to the account and org, which then allows us to verify if the data should be used with that account. This practice is known as “logical separation” and is very common in multi-tenant platforms (such as AWS) where all customer data is stored in one or more large databases.
How do you make sure our data isn’t incorporated in or used to improve or train a LLM?
We have signed agreements (BAA) with third-party LLMs, such as OpenAI, that stipulate they can not use our data for enhancing their LLM, nor can they save the data. This is governed by data regulations related to privacy and security. For LLMs that we host, we control what data can be used to enhance an LLM. We can say for sure that we do not use your data for improving our hosted LLMs.
How do you make sure our data isn’t used by AI for some other customer or that their data won’t show up for our users?
From previous answers above, you see that we provide the data to the LLM when formulating the prompt. Because your customer data is logically separated by Capacity to its account and org, it will not be included in the prompt to the LLM with respect to another customer’s user query. Similarly, another customer’s data would not be available for a prompt from your user’s query.