📦

Vector Database構築のハンズオン

Author

宋明河 / MLOps Engineer

Caregory

Hands-on

Database構築

(画像出典) Microsoft Designer - Stunning designs in a flash を使用して自動生成後編集

ハンズオンの環境構成

qdrant DB

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
Bash
복사

milvus DB

wget https://github.com/milvus-io/milvus/releases/download/v2.3.3/milvus-standalone-docker-compose.yml -O docker-compose.yml
sudo docker-compose up -d
Bash
복사
Reference: https://milvus.io/docs/install_standalone-docker.md

postgres (pgvector)

postgres Dockerfile
docker build -t pg-vector .
docker run -it -p 5432:5432 -e POSTGRES_PASSWORD=1234 -e POSTGRES_HOST_AUTH_METHOD=trust pg-vector
Bash
복사
Reference: https://github.com/pgvector/pgvector

Datasetのダウンロード

Libraryのインストール

実習する前にインストールするパッケージは下記の通りです。

•

transformers

•

torch

•

qdrant_client

•

pymilvus

•

psycopg2

# install library
!pip install transformers
!pip install torch
!pip install qdrant_client
!pip install pymilvus
!pip install psycopg2
Bash
복사
[0]:各ライブラリパッケージのインストール

SQuAD v2.0 Datasetとは？

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

The Stanford Question Answering Dataset

Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

https://rajpurkar.github.io/SQuAD-explorer/

import requests

# SQuAD v2.0 データセットのダウンロード
squad_url = "<https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json>"
response = requests.get(squad_url)
squad_data = response.json()
Python
복사
[1]: データセットのセーブ

Embedding関数の設定

Datasetをembeddingに変換する関数を設定します。このembeddingは下記の実習時、各VDBで共通で使います。

from transformers import BertTokenizer, BertModel
import torch

# Loading the tokenizer and BERT model for use
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
Python
복사
[2]: 使用するBERTモデルとトークナイザーをロードする

def embed_text(text):
    """ Returns BERT embeddings for the given text. """
    inputs = tokenizer(text, return_tensors='pt')
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(1).detach().numpy()[0]
Python
복사
[3]: 与えられたテキストのBERT embeddingを返す関数を作成する

ハンズオンの実施

Qdrant

qdrant clientの初期化設定を行います。

# prepare qdrant client
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct, PointInsertOperations
qdrant = QdrantClient(host="localhost", port=6333)
Python
복사
[4]: ローカルdockerで駆動されたqdrantに接続するclientを構成

Qdrant

qdrant clientの初期化を行います。

qdrantにcollectionを作成します。

# create collection
qdrant.create_collection(
    collection_name="questions",
    vectors_config=VectorParams(size= 768,distance=Distance.COSINE),
)
Python
복사
[5]: 任意の名称(questions)を指定

collection生成後、下記のようなOutputが確認できます。

[Output]
True

SQuAD v2.0Datasetのうち1,000個の質問に対するembeddingを保存します。

Qdrantは基本的にHNSW Indexingをサポートしていますが、Embeddingを保存する前にIndexをビルドすることができます。または、Searchする前にexact=Trueパラメータを使うと、明示的にIVF Flatを使えます。

points = []
count = 0
questions = []
answers = []
max_questions = 1000

# Store embeddings for 1,000 questions from the SQuAD v2.0 dataset in Qdrant.
for article in squad_data["data"]:
    for paragraph in article["paragraphs"]:
        for qa in paragraph["qas"]:
            question = qa["question"]
            embedding = embed_text(question)
            answer = qa["answers"]
            point = PointStruct(
                id=count,
                vector=embedding.tolist(),
                payload={"question": question}
            )
            questions.append(question)
            answers.append(answer)
            points.append(point)
            count += 1
            if count >= max_questions:
                break
        if count >= max_questions:
            break
    if count >= max_questions:
        break
Python
복사
[6]: embeddingの生成

operation_info = qdrant.upsert(
    collection_name="questions",
    wait=True,
    points=points
)
Python
복사
[7]: 生成されたembeddingをquestions collectionに保存します。

•

qdrant は基本的にHNSW indexingだけサポートします。

•

一般的に 「Embeddingを入れる -> indexのビルド -> 検索」 をする必要がありますが、そこのコードにはindexをビルドするコードがありません。

Vector searchを実行する質問のembeddingを生成してSearchを実行します。
Searchを実行するのにかかる時間と結果を確認することができます。

vector = embed_text("who is Beyonce")
Python
복사
[8]: vector searchを行う質問のembeddingを生成します。

%%time
results = qdrant.search(
    collection_name="questions", query_vector= vector, limit=5
)
Python
복사
[9]: vector searchの実行

実行時間を出力して確認することができます。

[Output]
CPU times: user 1.91 ms, sys: 1.1 ms, total: 3.02 ms
Wall time: 3.79 ms

検索結果とsimilarity scoreを確認することができます。

for i in results:
    print("question : {},  answer : {},  similarity score : {}".format(questions[i.id], answers[i.id][0]["text"], i.score))
Python
복사
[10]: 結果値の確認

[Output] 
question : Who is Beyoncé married to?,  answer : Jay Z,  similarity score : 0.831488
question : Who influenced Beyonce?,  answer : Michael Jackson,  similarity score : 0.8148161
question : Who did Beyoncé marry?,  answer : Jay Z.,  similarity score : 0.8004105
question : When did Beyoncé release Formation?,  answer : February 6, 2016,  similarity score : 0.79073215
question : Which artist did Beyonce marry?,  answer : Jay Z,  similarity score : 0.7883082

milvus

milvus clientの初期化を行います。

from pymilvus import Collection, CollectionSchema, FieldSchema, DataType, connections
import numpy as np

# prepare milvus client
connections.connect(host='localhost', port='19530')
Python
복사
[11]: ローカルdockerに駆動されたmilvus接続設定

milvusにcollectionを生成します。

# create collection

id_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
schema = CollectionSchema(fields=[id_field, vector_field ], description="SQuAD Questions")
collection_name = "questions101"
collection = Collection(name=collection_name, schema=schema)
Python
복사
[12]: 任意の名称(questions101)を指定

SQuAD v2.0 Datasetのうち1,000問のembeddingを保存します。

count = 0
max_count = 1000
points = []
ids = []
embeddings = []
questions = []
answers = []
for article in squad_data['data']:
    for paragraph in article['paragraphs']:
        for qa in paragraph['qas']:
            question = qa['question']
            answer = qa["answers"]
            embedding = embed_text(question)

            # ID와 임베딩을 별도의 리스트로 준비
            ids.append(count)
            embeddings.append(embedding.tolist())
            questions.append(question)
            answers.append(answer)
            count += 1
            if count >= max_count:
                break
        if count >= max_count:
            break
    if count >= max_count:
        break
Python
복사
[13]: embeddingの生成

# save milvus db
index_params = {
    "metric_type": "COSINE",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 768}
}
insert_data = [ids, embeddings]
collection.create_index(field_name="embedding", index_params=index_params)
ids = collection.insert(insert_data)
Python
복사
[14]: 生成されたembeddingをquestionscollectionに保存

保存されたCollectionをロードした後、Vector searchを実行する質問のembeddingを生成してSearchを実行します。Searchを実行するのにかかる時間と結果値を確認することができます。

collection.load()
Python
복사
[15]: 保存したcollectionの読み込み

search_params = {
    "metric_type": "COSINE",
    "offset": 0,
    "ignore_growing": False,
    "params": {"nprobe": 10}
}
Python
복사
[16]: search_params設定

question = "who is Beyonce"
embedding = embed_text(question)
Python
복사
[17]: vector searchを実行する質問のembeddingを生成

%%time
results = collection.search(
    data=[embedding.tolist()],
    anns_field="embedding",
    param=search_params,
    limit=5
)
Python
복사
[18]: vector searchの実行

実行時間についての出力が確認できます。

[Output]
CPU times: user 950 µs, sys: 895 µs, total: 1.85 ms
Wall time: 5.3 ms

Search結果値とsimilarity scoreを確認することができます。

for i in results:
    print("question : {},  answer : {},  similarity score : {}".format(questions[i.id], answers[i.id][0]["text"], i.score))
Python
복사
[19]: 結果値の確認

[Output]
question : Who is Beyoncé married to?,  answer : Jay Z,  similarity score : 0.8314879536628723
question : Who influenced Beyonce?,  answer : Michael Jackson,  similarity score : 0.8148161768913269
question : Who did Beyoncé marry?,  answer : Jay Z.,  similarity score : 0.8004105091094971
question : When did Beyoncé release Formation?,  answer : February 6, 2016,  similarity score : 0.79073215
question : Which artist did Beyonce marry?,  answer : Jay Z,  similarity score : 0.7883082

pgvector

pgvector clientの初期化設定を行います。

import psycopg2
Python
복사
[20]: package import

# prepare pgvector client
conn = psycopg2.connect(host="localhost",user="postgres", password="1234",port=4321)
cur = conn.cursor()
# 확장설치
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
Python
복사
[21]: ローカルdockerで駆動されるpgvector接続の設定方法。

pgvectorにtableを生成します。

# create table
cur.execute('CREATE TABLE questions (id bigserial PRIMARY KEY, embedding vector(768))')
conn.commit()
Python
복사
[22]: 任意の名前(questions)を指定

SQuAD v2.0 Datasetのうち1,000問のembeddingを保存します。

points = []
count = 0
max_questions = 1000  # 최대 질문 수
questions = []
answers = []
try:
    for article in squad_data['data']:
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                question = qa['question']
                answer = qa["answers"]
                embedding = embed_text(question)
                cur.execute('INSERT INTO questions (embedding) VALUES (%s)', (embedding.tolist(),))
                conn.commit()
                questions.append(question)
                answers.append(answer)
                count += 1
                if count >= max_questions:
                    break
            if count >= max_questions:
                break
        if count >= max_questions:
            break
except psycopg2.DatabaseError as e:
    print(f"Database error: {e}")
    conn.rollback()
Python
복사
[23]: embeddingの生成

cur.execute("""SET maintenance_work_mem = '128MB';""")
Python
복사
[24]: maintenance_work_mem の設定

cur.execute('''
CREATE INDEX IF NOT EXISTS idx_vector
ON questions
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 768);
''')
Python
복사
[25]: questions tableに対する INDEX の生成

保存したCollectionをロードした後、Vector searchを実行する質問のembeddingを生成してSearchを実行します。Searchを実行するのにかかる時間と結果を確認することができます。

question = "who is Beyonce"
embedding = embed_text(question)
Python
복사
[26]: vector searchを実行する質問のembeddingを生成する

%%time
try:
    cur.execute("SELECT * FROM questions ORDER BY embedding <-> %s::vector LIMIT 5;", (embedding.tolist(),))
    results = cur.fetchall()
except psycopg2.DatabaseError as e:
    print(f"Database error: {e}")
    conn.rollback()
Python
복사
[27]: vector searchの実行

実行時間についての出力が確認できます。

[Output]
CPU times: user 1.36 ms, sys: 1.05 ms, total: 2.42 ms
Wall time: 4.1 ms

Search結果値とsimilarity scoreを確認することができます。

def cosine_similarity(vec_a, vec_b):
    return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
Python
복사
[28]: similarityを測定する関数の定義：cosine_similarity

for i in results:
    db_vector = np.array(eval(i[1]))
    similarity = cosine_similarity(embedding, db_vector)
    print("question : {},  answer : {},  similarity score : {}".format(questions[i[0]-1], answers[i[0]-1][0]["text"],similarity))
Python
복사
[29]: 結果値の確認

[Output]
question : Who is Beyoncé married to?,  answer : Jay Z,  similarity score : 0.8314879702500422
question : Who influenced Beyonce?,  answer : Michael Jackson,  similarity score : 0.8148160879479863
question : What band did Beyonce introduce in 2006?,  answer : Suga Mama,  similarity score : 0.7825187277864076
question : What solo album did Beyonce release in 2003?,  answer : Dangerously in Love,  similarity score : 0.7818855973596779
question : When did Beyoncé release Formation?,  answer : February 6, 2016,  similarity score : 0.7907322425502744