표 질의 응답(Table Question Answering)을 파인콘(Pinecone), 랭체인(Langchain)으로 만들기

GPT 활용법 탐구 2023. 6. 2. 08:00

표 질의응답 모델(Table Question Answering)을 파인콘(Pinecone)과 랭체인(Langchain)을 사용해서 간단하게 만들어볼 수 있습니다. 이 글에서는 파인콘에서 제공하는 Colab을 참고해서 표 질의 응답을 만들어보겠습니다. 먼저, '표 질의응답 모델', '질의응답 모델'이 무엇인지 알아보도록 하겠습니다.

질의응답 모델

질의응답(Question Answering) 모델이란, 문서 또는 자연어 텍스트를 기반으로 주어진 질문에 대한 대답을 생성하는 모델입니다. 질의응답을 하려면 모델은 텍스트를 이해하고 사실에 대한 추론 능력이 있어야 하기 때문에 다소 까다롭지만, 자주 묻는 질문에 대한 응답을 자동화할 수 있다는 편리함이 있어 이미 다양한 분야에서 많이 사용되고 있습니다. 질의응답 모델을 가장 접하기 쉬운 기능은 서비스의 고객 문의 챗봇이나 문서 검색입니다.

질의응답 모델의 종류

질의응답은 입력 및 출력에 따라 다양한 종류가 있으며, 도메인을 기준으로 분류하기도 합니다.

입력 및 출력에 따른 종류

Extract QA: 모델이 콘텍스트로 제공된 텍스트 또는 HTML에서 답변을 추출합니다. 일반적으로 BERT 모델을 사용합니다.
Open Generative QA: 모델이 콘텍스트를 기반으로 직접 텍스트를 생성합니다.
Closed Generative QA: 모델에게 콘텍스트를 제공하지 않고, 모델이 처음부터 끝까지 텍스트를 생성합니다.

도메인에 따른 분류

오픈 도메인(Open domain): 질문, 답변이 특정 도메인(예: 법률, 의료)으로 제한되지 않음
클로즈 도메인(Closed domain): 질문, 답변이 특정 도메인으로 제한됨

표 질의 응답(Table Question Answering) 모델은 표 형태로 주어진 문서를 기반으로 질문에 대한 답변을 생성하는 모델입니다. 아래에서 표 질의응답 모델에 대해서 더 자세하게 알아보겠습니다.

표 질의 응답 모델

표 질의 응답(Table Question-Answering) 모델은 질의응답 모델과 마찬가지로 질문에 대한 답변을 생성하는 모델입니다. 모델에게 줄 콘텍스트는 표 형태로 제공해야 합니다. 아래는 표 질의응답 모델을 간략하게 설명하는 예시입니다.

예를 들어, 아래와 같은 표를 모델에게 제공합니다.

이름	출연작
이경영	35
오달수	23
조진웅	19

모델에게 '출연작이 가장 많은 배우가 누구인지 알려줘' 라고 질문한다면 모델은 출연작이 35편인 이경영을 답변으로 생성합니다.

표 질의 응답질의응답 모델은 다른 질의응답 모델과 자연어를 인식하고 처리하는 과정은 동일합니다. 그러나 표 질의응답 모델이 답변할 수 있는 내용은 주어진 표로 한정됩니다. 대체로 질문이 제한된 범위로 한정될 때 유용하게 사용할 수 있기 때문에, 고객 문의 챗봇에 많이 활용되고 있습니다.

표 질의 응답 모델을 활용해서 시스템을 만들려면, 아래와 같은 세 가지 요소가 필요합니다.

테이블 임베딩을 저장할 벡터 인텍스
임베딩 쿼리와 테이블을 검색하는 검색(retriever) 모델
테이블을 읽고 답변을 추출하는 리더(reader) 모델

파인콘, 랭체인으로 표 질의응답 만들기

이제 파인콘에서 제공하는 Colab을 참고해서 표 질의 응답을 만들어보겠습니다. Colab 사용이 처음이라면 Google Colab 질문과 답변을 참고합니다.

라이브러리 설치

!pip install datasets pinecone-client sentence_transformers torch-scatter

아래는 각 라이브러리에 대한 설명입니다.

datasets: ML 모델을 학습하고 평가하는데 사용할 수 있는 공용 데이터 세트를 제공하는 라이브러리입니다.
pinecone-client: 파인콘 파이선 클라이언트입니다.
sentence_transformers: 문장, 단락, 이미지에 대한 고밀더 벡터 표현을 쉽게 계산하는 방법을 제공하며 BERT/RoBERTa/XLM-RoBERTa등과 같은 트랜스포머를 기반으로 하는 프레임워크입니다.
torch-scatter: 파이토치에서 사용하기에 최적화된 sparse update(scatter and segment) 연산의 확장 라이브러리들로 구성되어 있습니다.

데이터 세트 로드하기

위키피디아의 텍스트와 표로 구성된 OTT-QA (Open Table-and-Text Question Answering) 데이터 세트의 하위 세트로 작업할 것입니다. 하위 세트에는 20,000개의 테이블이 있으며, 아래와 같이 Huggingface Datasets 허브에서 로드할 수 있습니다.

from datasets import load_dataset

# load the dataset from huggingface datasets hub
data = load_dataset("ashraq/ott-qa-20k", split="train")
data

Dataset({
    features: ['url', 'title', 'header', 'data', 'section_title', 'section_text', 'uid', 'intro'],
    num_rows: 20000
})

data[2]

{'url': 'https://en.wikipedia.org/wiki/1976_New_York_Mets_season',
 'title': '1976 New York Mets season',
 'header': ['Level', 'Team', 'League', 'Manager'],
 'data': [['AAA', 'Tidewater Tides', 'International League', 'Tom Burgess'],
  ['AA', 'Jackson Mets', 'Texas League', 'John Antonelli'],
  ['A', 'Lynchburg Mets', 'Carolina League', 'Jack Aker'],
  ['A', 'Wausau Mets', 'Midwest League', 'Bill Monbouquette'],
  ['Rookie', 'Marion Mets', 'Appalachian League', 'Al Jackson']],
 'section_title': 'Farm system',
 'section_text': 'See also : Minor League Baseball',
 'uid': '1976_New_York_Mets_season_7',
 'intro': 'The New York Mets season was the 15th regular season for the Mets, who played home games at Shea Stadium. Led by manager Joe Frazier, the team had an 86-76 record and finished in third place in the National League East.'}

테이터 세트에는 서로 관련있는 텍스트와 표가 있습니다. 이 예제에서는 표만 사용할 것 이기 때문에, 표를 추출해서 pandas 데이터 프레임으로 변환합니다.

import pandas as pd

# store all tables in the tables list
tables = []
# loop through the dataset and convert tabular data to pandas dataframes
for doc in data:
    table = pd.DataFrame(doc["data"], columns=doc["header"])
    tables.append(table)

tables[2]

	Level	Team	League	Manager
0	AAA	Tidewater Tides	International League	Tom Burgess
1	AA	Jackson Mets	Texas League	John Antonelli
2	A	Lynchburg Mets	Carolina League	Jack Aker
3	A	Wausau Mets	Midwest League	Bill Monbouquette
4	Rookie	Marion Mets	Appalachian League	Al Jackson

검색기(Retriever) 초기화

검색기는 자연어 쿼리와 표 형식 데이터를 임베딩/벡터로 변환합니다. 자연어 질문과, 질문에 대한 답변이 포함된 표가 벡터 공간 가까운 곳에 위치하도록 임베딩을 생성합니다.

검색 작업을 위해서 표 형식 데이터를 포함하도록 훈련한 Sentence Transformer 모델을 사용할 것입니다. 아래와 같이 Huggingface Models 허브에서 모델을 로드합니다.

import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the table embedding model from huggingface models hub
retriever = SentenceTransformer("deepset/all-mpnet-base-v2-table", device=device)
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

표를 특정 형식으로 변환하는 함수를 작성합니다.

def _preprocess_tables(tables: list):
    processed = []
    # loop through all tables
    for table in tables:
        # convert the table to csv and 
        processed_table = "\n".join([table.to_csv(index=False)])
        # add the processed table to processed list
        processed.append(processed_table)
    return processed

여기에는 표만 사용하고 있습니다. 그러나 표를 검색하는 동안, 검색기가 메타데이터를 고려하도록 하려면 표의 시작 부분에서 줄 바꿈 문자로 구분된 title, section_title 등과 같은 모든 메타데이터 문자열을 조인할 수 있습니다.

서식이 지정된 표를 살펴봅니다.

# format all the dataframes in the tables list
processed_tables = _preprocess_tables(tables)
# display the formatted table
processed_tables[2]

'Level,Team,League,Manager\nAAA,Tidewater Tides,International League,Tom Burgess\nAA,Jackson Mets,Texas League,John Antonelli\nA,Lynchburg Mets,Carolina League,Jack Aker\nA,Wausau Mets,Midwest League,Bill Monbouquette\nRookie,Marion Mets,Appalachian League,Al Jackson\n'

임베딩 모델이 형식이 지정된 표를 이해하고 정확하게 임베딩을 생성하도록 훈련됩니다.

파인콘 인덱스 초기화

여기서는 파인콘 벡터 데이터베이스를 벡터 인덱스로 사용합니다. 파인콘 인덱스는 자연어 쿼리(쿼리 벡터)를 사용해 검색할 수 있는 표의 벡터 표현을 저장합니다. 파인콘은 쿼리 벡터와 벡터 인덱스에 저장된 포함된 표 간의 유사성을 계산합니다.

파인콘을 사용하려면 먼저 파인콘과 연결을 초기화해야 합니다. 파인콘에서 무료 API 키를 발급받아, 아래와 같이 초기화합니다.

import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="YOUR API KEY",
    environment="YOUR_ENVIRONMENT"
)

새로운 인덱스를 생성합니다. 콘텍스트 임베딩을 생성하는 데 사용하는 검색기가 768차원 벡터를 출력하기 때문에, 메트릭 유형을 '코사인'으로 지정하고 차원을 768로 지정합니다. 파인콘은 코사인 유사성을 사용해 쿼리와 테이블 임베딩 간의 유사성을 계산합니다.

# you can choose any name for the index
index_name = "table-qa"

# check if the table-qa index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=768,
        metric="cosine"
    )

# connect to table-qa index we created
index = pinecone.Index(index_name)

임베딩 생성 및 업서트

테이블 임베딩을 생성하고, 이를 파인콘 인덱스에 업로드해야 합니다. 아래 코드는 테이블 임베딩을 생성하고 파인콘 인덱스에 업로드합니다.

from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(processed_tables), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(processed_tables))
    # extract batch
    batch = processed_tables[i:i_end]
    # generate embeddings for batch
    emb = retriever.encode(batch).tolist()
    # create unique IDs ranging from zero to the total number of tables in the dataset
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

  100%|██████████| 313/313 [09:12<00:00, 1.49s/it]

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 20000}},
 'total_vector_count': 20000}

이제 파인콘 인덱스를 쿼리 할 준비가 되었습니다. 쿼리와 관련된 표를 변환하는지 테스트해 보겠습니다.

query = "which country has the highest GDP in 2020?"
# generate embedding for the query
xq = retriever.encode([query]).tolist()
# query pinecone index to find the table containing answer to the query
result = index.query(xq, top_k=1)
result

{'matches': [{'id': '19931', 'score': 0.822087, 'values': []}], 'namespace': ''}

파인콘 인덱스는 82.2% 신뢰도로 쿼리에 대한 답변을 포함하는 표를 반환했습니다. 이 표에 실제로 답이 포함되어 있는지 확인해 보겠습니다. 반환된 표를, 인덱스를 사용해서 관련 pandas 데이터 프레임을 가져옵니다.

id = int(result["matches"][0]["id"])
tables[id].head()

	Rank	Country	GDP (PPP, Peak Year) millions of USD	Peak Year
0	1	China	27,804,953	2020
1	2	India	11,321,280	2020
2	3	Russia	4,389,960	2019
3	4	Indonesia	3,778,134	2020
4	5	Brazil	3,596,841	2020

파인콘 인덱스에 의해 반환된 표는 실제로 쿼리에 대한 답변을 포함하고 있습니다. 이제 이 표를 읽고 정확한 답을 추출할 수 있는 모델을 만들어봅니다.

표 리더 초기화

표 QA 작업을 위해 미세 조정된 TAPAS 모델을 사용합니다. TAPAS는 위키피디아의 대규모 영어 데이터 모음에서 자체 감독 방식으로 사전 훈련된 BERT와 유사한 트랜스포머(Transformer) 모델입니다. 허깅페이스 모델 허브에서 질문 응답 파이프라인으로 모델과 토크나이저를 로드합니다.

from transformers import pipeline, TapasTokenizer, TapasForQuestionAnswering

model_name = "google/tapas-base-finetuned-wtq"
# load the tokenizer and the model from huggingface model hub
tokenizer = TapasTokenizer.from_pretrained(model_name)
model = TapasForQuestionAnswering.from_pretrained(model_name, local_files_only=False)
# load the model and tokenizer into a question-answering pipeline
pipe = pipeline("table-question-answering",  model=model, tokenizer=tokenizer, device=device)

답을 추출하기 위해 이전에 사용했던 질문을 응답 파이프라인에 파인콘 인덱스와 쿼리가 반환한 표에 실행해 봅니다.

pipe(table=tables[id], query=query)

{'answer': 'China',
 'coordinates': [(0, 1)],
 'cells': ['China'],
 'aggregator': 'NONE'}

모델이 질문에 정확하게 대답했습니다.

쿼리

먼저 쿼리를 처리학고 표에서 답변을 추출하는 두 가지 함수를 정의합니다.

def query_pinecone(query):
    # generate embedding for the query
    xq = retriever.encode([query]).tolist()
    # query pinecone index to find the table containing answer to the query
    result = index.query(xq, top_k=1)
    # return the relevant table from the tables list
    return tables[int(result["matches"][0]["id"])]

def get_answer_from_table(table, query):
    # run the table and query through the question-answering pipeline
    answers = pipe(table=table, query=query)
    return answers

query = "which car manufacturers produce cars with a top speed of above 180 kph?"
table = query_pinecone(query)
table

	Manufacturer	Model	Engine	Power Output	Max-Speed(kph)	Dry Weight (kg)
0	Fiat	805-405	FIAT 1979ccS6 supercharged	130 bhp	220	680
1	Alfa Romeo	GPR (P1)	Alfa Romeo 1990cc S6	95 bhp	180	850
2	Diatto	Tipo 20 S	Diatto 1997cc S4	75 bhp	155	700
3	Bugatti	Type 32	Bugatti 1991cc S8	100 bhp	190	660
4	Voisin	C6 Laboratoire	Voisin 1978cc S6	90 bhp	175	710
5	Sunbeam		Sunbeam 1988cc S6	108 bhp	180	675
6	Mercedes	M7294	Mercedes 1990cc S4 superchared	120 bhp	180	750
7	Benz	RH Tropfenwagen	Benz 1998cc S6	95 bhp	185	745
8	Miller	122	Miller 1978cc S8	120 bhp	186	850

참고

'GPT 활용법 탐구' 카테고리의 다른 글

파인콘, 허깅페이스로 시맨틱 검색 만들기 (0)	2023.06.08
SEO를 위한 ChatGPT 3가지 사용 사례 (0)	2023.06.05
ChatGPT 플러그인 활용하기 (0)	2023.06.01
AutoGPT를 사용하는 3가지 방법 (0)	2023.05.31
Google의 PaLM2와 Bard (0)	2023.05.31

ABOUT ME

ai-method

질의응답 모델

질의응답 모델의 종류

표 질의 응답 모델

파인콘, 랭체인으로 표 질의응답 만들기

라이브러리 설치

데이터 세트 로드하기

검색기(Retriever) 초기화

파인콘 인덱스 초기화

임베딩 생성 및 업서트

표 리더 초기화

쿼리

참고

'GPT 활용법 탐구' 카테고리의 다른 글

티스토리툴바

ABOUT ME

질의응답 모델

질의응답 모델의 종류

표 질의 응답 모델

파인콘, 랭체인으로 표 질의응답 만들기

라이브러리 설치

데이터 세트 로드하기

검색기(Retriever) 초기화

파인콘 인덱스 초기화

임베딩 생성 및 업서트

표 리더 초기화

쿼리

참고

'GPT 활용법 탐구' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

표 리더 초기화