Vector Databases - What are they? Why do we need them?

A comprehensive guide to understanding vector databases, their importance in modern AI systems, and how they provide long-term memory for Large Language Models.

Before we jump into what vector databases are we need to understand what a Vector is. In mathematics and in Computer Science a Vector quantity is defined as a an entity that has a magnitude and direction, for example velocity, acceleration, force or even things like employee productivity, revenue growth etc.

What is a Vector?

The opposite of Vector quantities are called Scalar quantities and they include things like mass, time, distance, speed, salary, address etc. The databases we have been accustomed to usually store and process these scalar quantities.

What is a Vector Database?

A vector database is a database that stores data in the form of high-dimensional vectors. A Vector is a mathematical representations of features or attributes, and they contain multiple dimensions, which can vary in number based on the level of complexity and detail in the data.

Vectors are usually represented as list of numbers and can be manipulated using matrix algebra, their dimensionality can vary from a couple to thousands to millions of dimensions.

A Vector database allows us to store, query and manipulate such data. One thing to note is that all these operations are fuzzy rather than discrete, meaning instead of returning exact results they return similar results.

Embeddings are numerical representations, in the form of vectors or arrays, that depict the meaning and contextual information of the tokens processed and generated by the model. Embeddings helps the model and the database to gain semantic and syntactic understanding of the tokens.

The generation of vectors typically involves the application of a transformation or embedding function to raw data, including text, images, audio, video, and other types of information. The embedding functions can be various machine learning models, word embeddings, feature extraction algorithms and even large language models LLMs. After the application of embeddings these vectors then gets stored in the database, when a query is generated the same embeddings are applied to it and results are returned depending upon vector distance or similarity in vector space.

Vector Embeddings Diagram

The similarity measure can be based on various metrics, such as Cosine Similarity, Euclidean Distance, Manhattan Distance, hamming distance, Jaccard similarity, Minkowski Distance etc. You can read about main Vector similarity search types here.

Standalone similarity search indexes also exist like Facebook AI Similarity Search but they lack the scalability, security and data management provided by the Vector databases.

Why do we need Vector Databases?

Vector Databases makes it very easy to store, index and retrieve multi-dimensional vector data, this vector data can include feature and attributes representing characteristics of text, images, geo-spatial data, audio or video data or any type of data that can be represented numerically like genomics or scientific simulations.

Vector databases can also store Machine Learning models represented by vectors of weights and biases

The stored Vector data can be queried to provide similarity search nearest neighbor searches on large vector datasets. Applications of which can include but not limited to

Recommendation systems.
Content based search, searching an image, video or PDF from a text query or vice versa.
Computer Vision: Objection recognition, visually similar image search or generation.
Fraud or anomaly detection etc
Natural Language Processing, sentiment analysis, translation etc

One of the feature of Vector databases is that also they provide a long term memory for Large Language Models, by storing the information as vectors and supplying the these vectors along with user generated prompts to fine tune the results returned. By utilizing the vector database as a source of long-term memory, the language model can improve its performance and provide more relevant, personalized and context aware responses.

Popular Vector Databases

A few of the Vector databases on the market include:

Pinecone
Weaviate
Chroma
Milvus
Vespa

In conclusion, vector databases offer a powerful solution for storing and manipulating high-dimensional vector data. By leveraging mathematical representations of features and attributes, vector databases enable efficient storage, indexing, and retrieval of multi-dimensional data.