벡터 DB와 캐시 스토리지 SaaS로 RAG 기반 하이브리드 서치 챗봇 만들기 1

Digital Architects

챗봇 구현에 필수적으로 사용되고 있는 RAG 기술

대화형 AI 기술이 발전하면서 특정 도메인 기반의 챗봇을 구현하기 위한 RAG(Retrieval-Augmented Generation) 기술이 필수적으로 사용되고 있습니다. RAG는 외부 지식 기반에서 사실을 검색하여 가장 정확한 최신 정보에 기반한 응답을 제공함으로써 ChatGPT와 같은 대규모 언어 모델(LLM)을 향상시킵니다. 지식 스니펫에서 사용자별 데이터에 이르기까지 특정 데이터셋을 통합하는 RAG 기술은 답변의 퀄리티를 높이고 개인화된 응답을 가능하게 합니다.

저도 위키피디아, 회사 웹사이트 같은 특정 URL을 AI에 기반 지식으로 던져준 뒤 프롬프트를 정교하게 구현하여 원하는 대답을 끌어낼 수 있는 챗봇을 만들어 보았습니다. 하지만 원하는 수준의 답변을 얻기가 어려웠고, 이후 RAG를 도울 수 있는 몇 가지 SaaS 제품들을 활용하여 5일 이내에 좀 더 고도화된 챗봇을 만들었습니다. 챗봇 구현 시 효율적인 서비스를 위해 데이터베이스와 캐시 SaaS(Software as a Service)를 활용하는 것이 좋습니다.

이번 글에서는 온라인몰의 시계 판매 정보를 활용한 챗봇 구현 사례를 통해 벡터 DB와 캐시 스토리지 SaaS로 RAG 기반 하이브리드 서치 챗봇 만드는 방법을 소개합니다.

‍

1. 챗봇 주제 정하기

먼저 챗봇의 주제를 정해야 합니다. 저는 온라인 쇼핑몰에서 현재 판매되고 있는 제품들에 대한 정보와 AI가 이미 사전학습을 통해 알고 있는 데이터를 조합하여 사용자와 쇼핑몰 기반의 정보로 소통할 수 있는 챗봇을 만들어 보기로 했습니다. 이때 사용한 쇼핑몰 데이터는 Kaggle에서 eBay에 판매된 시계 데이터를 가져와 전처리를 진행하였습니다. (데이터셋: https://www.kaggle.com/datasets/kanchana1990/trending-ebay-watch-listings-2024)

‍

2. Knowledge base에 데이터 저장하기

‍

그다음 한 일은 데이터를 AI가 참고할 수 있는 형태로 knowledge base에 저장해 두는 것인데요. 전처리된 데이터를 저장하기 위한 DB로 Pinecone이라는 SaaS 서비스를 활용하였습니다. 저는 Pinecone 웹사이트에 접속하여 Index를 클릭 방식으로 만들어 보았습니다.

‍

Pinecone에 Index를 만든 후 데이터를 저장해야 할 텐데요, 일반적인 관계형 DB와 달리 데이터 내 자연어 형태로 존재하는 값들과 사용자 질문의 유사성을 계산하고 단어 간의 관계를 파악할 수 있도록 임베딩을 시켜 유사한 의미를 가진 단어들을 임베딩 공간에서 가까운 위치에 맵핑시키게 됩니다.

임베딩을 시키는 모델은 다양하지만, 저는 이번 챗봇 빌드에 OpenAI의 임베딩 모델을 활용하였습니다. 임베딩 시 모델마다 임베딩 차원 크기가 다른데요, 임베딩 모델을 정하셨다면 차원 크기를 확인해 두셨다가 벡터 DB Index 생성 시에 사용해야 합니다. 제가 사용한 OpenAI의 text-embedding-ada-002의 경우 1536 차원이었습니다.

def get_embedding(text):
    return openai.embeddings.create(input = text, 
    model="text-embedding-ada-002").data[0].embedding

‍

임베딩 모델까지 정해졌다면 Pinecone을 초기화시킨 후 데이터를 Pinecone에 적재시킬 겁니다. Pinecone의 경우 기본 구조는 id, value 혹은 id, value, metada입니다. value가 임베드 되어있다 보니 사람이 이해하는 형태가 아니라 metadata에 자연어 형태로 데이터를 입력시켜 줍니다.

from pinecone import Pinecone

pc = Pinecone(
    api_key=""
)
index = pc.Index('quickstart')

text = [
    f"The title is {row['title']} 
    and the product type is {row['type']}. 
    The seller is {row['seller']} 
    and the price is {row['priceWithCurrency']}. 
    They have been sold {row['sold']} times."
    for _, row in batch.iterrows()
]
    
emb_vec = [get_embedding(doc) for doc in text]
to_upsert = list(ids, emb_vec, meta_data)
index.upsert(vectors=to_upsert)

‍

제가 저장할 데이터는 다음과 같습니다:

The title is {row['title']} and the product type is {row['type']}. The seller is {row['seller']} and the price is {row['priceWithCurrency']}. They have been sold {row['sold']} times."

가장 첫 번째 데이터를 대입시켜 출력해 본다면 다음과 같은 문장의 데이터가 저장될 것을 알 수 있습니다: "The title is hamilton men's h77705145 khaki navy 42mm automatic watch and the product type is wristwatch. The seller is watchgooroo and the price is $519.99. They have been sold 10 times.”

그런데 문장의 형태는 지정되어 있고 안의 값들만 변하다 보니 문장에 반복해서 들어가는 단어들이 있네요. 반복적이고 불필요한 단어들을 제거하고 싶습니다.

여기서 잠깐, 우리가 언어를 해석할 때 의미가 없는 단어나 조사 등으로 The, a, is, of 같은 단어들을 불용어(stopwords)라고 합니다.

임베딩 시 반복적이고 불필요한 단어의 저장을 줄이기 위해 불용어를 제거해 보았습니다. 그리고 기존 데이터셋에 추가 ‘filtered_text’라는 열을 만들어 불용어가 제거된 문장을 따로 저장해 주었습니다.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
text = text.lower()
words = word_tokenize(text)
filtered_words = [word for word in words if word not in stop_words]
filtered_text = ' '.join(filtered_words)
filtered_texts.append(filtered_text)
df['filtered_text'] = filtered_texts

‍

그리고 기본 불용어를 제외 처리 결과 남은 문장은 다음과 같습니다:

"hamilton men's h77705145 khaki navy 42mm automatic watch product type wristwatch. seller watchgooroo price $ 519.99 . sold 10 times .”

아래가 불용어 처리 이전의 모습인데 비교해 보면 좀 더 깔끔해진 것 같습니다:

"The title is hamilton men's h77705145 khaki navy 42mm automatic watch and the product type is wristwatch. The seller is watchgooroo and the price is $519.99. They have been sold 10 times.”

이제 모든 데이터가 준비되었으니, Pinecone에 데이터를 저장해 보겠습니다.

batch_size = 100
for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i + batch_size, len(df))
    batch = df.iloc[i:i_end]
    text = [
        f"{row['filtered_text']}"
        for _, row in batch.iterrows()
    ]
    
    metadata = [
        {
            "index": str(row['Index']),
            "itemNumber": str(row['itemNumber']),
            "price": str(row['priceWithCurrency']),
            "seller": str(row['seller']),
            "title": str(row["title"]),
            "type": str(row["type"]),
            "sold": str(row["sold"]),
            ## "sparse_vector": 
            str(sparse_embedding(str(row["filtered_text"]))) 
            Sparse 벡터 임베딩
        }
        for _, row in batch.iterrows()
    ]
    
    emb_vec = [get_embedding(doc) for doc in text]
    ids = batch['Index'].astype(str).tolist()
    to_upsert = list(zip(ids, emb_vec, meta_data))
    index.upsert(vectors=to_upsert)

‍

Pinecone에서 GUI를 통해 확인해 보면 데이터가 잘 적재된 것을 확인할 수 있습니다.

‍

3. 챗봇 만들기

자, 그럼, 데이터는 준비되었으니 대화할 수 있는 챗봇을 간단하게 만들어 보겠습니다.

모델은 gpt-3.5-turbo를 사용하였고, 질문을 받았을 때 DB를 기반으로 관련 제품에 대한 정보와 추천을 해줄 수 있도록 프롬프트를 설정하였습니다. (Please provide your insights and recommendations about the item based on the information from the database)

def get_response(prompt, input_time):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages = [
        {"role": "user", 
        "content": f"Customer Question: {prompt}"},
        {"role": "assistant", 
        "content": f"Information from the database:
        {query_top3_vector(prompt)}"},
        {"role": "user", 
        "content": "Please provide your insights and recommendations 
        about the item based on the information from the database."}
    ])
    ai_response = response.choices[0].message.content
    duration = datetime.now() - input_time
    answer = (f"Duration: {duration} <br>Bot Answer: {ai_response}")

‍

그런데 DB와 관련된 정보라는 건 AI가 어떻게 추출해야 할까요? 위 코드에서는 관련성이 높은 3개의 데이터를 가져올 수 있도록 Information from the database:{query_top3_vector(prompt)} 라고 Function을 추가한 프롬프트를 작성하였습니다.

작성한 query_top3_vector 함수를 살펴보면 질문에 근사한 3개의 데이터를 가져와 사람이 읽을 수 있는 형태로 저장해둔 metadata를 Pinecone으로부터 가져와 챗봇에 알려주는 것입니다. 이때부터 데이터 전처리 단계에서 힘들여 임베드를 해놓은 힘이 발휘되는 것 같습니다.

def query_top3_vector(question):
    results = index.query(vector=question, 
    top_k=3, include_metadata=True)
    metadata_values = [match['metadata'] 
    for match in results['matches']]
    return metadata_values