Skip to content
LeeYejun1324 edited this page May 8, 2023 · 11 revisions

KoGPT2

  • Key model for chatbots that generate answers to users' chats

Baseline & Dataset

Baseline code : Korean Language Model for Wellness Conversation

Dataset : 주제별 텍스트 일상 대화 데이터

Process

  • Data Preparation
  • Data Preprocessing
  • Using pretrained weight
  • Train the model
  • Validate the model

How to make AIFriend?

1. Data Preparation

  • Download the dataset

  • This project used only Kakao data in the source data compression(원천데이터 압축파일) in the Training/Validation folder in dataset.
    If you proceed in the same environment as the project, you only need to use Kakao data.

  • If necessary, you can proceed with the learning by using additional refining.

image
(Project folder)/data/kakao - Place the prepared KAKAO.txt files (Collect all the txt files in the /data/kakao/)


2. Data Preprocessing

  • In (Project folder)/preprocess/split_new.py code, Check the path that exists at the top, set it correctly, and run it.
    It preprocesses data, Kogpt2 can be fine-tuned to AIFriend.

  • At this time, exceptions are processed so that certain words such as politics/ideologies are not learned for a document, but you can also modify the project file directly to make better exceptions.


3. Using pretrained weight

  • Use the Word2Vec pretrained model for extracting interest categories(download link)
    image
    Place the files to (Project folder)/core/AIfriend/Keybert/

  • Use the kogpt2 pre-trained model provided by SKT (provided by transformers library)

  • The pretrained weight of the AIfriend that we used in the project(download link)
    (Please note that the model weight may have learned the bias or incorrect knowledge of the data)


4. Train the model

  • In (Project folder)/train/run_auto_regressive.py code, Check the path that exists at the top, set it correctly, and run it.
    Automatically load the skt pre-trained kogpt2 model and training with preprocessed data.

  • Hyperparameters such as learning rate and epoch can be adjusted as you want.


5. Validate the model

  • You can test directly on the code before testing on the server and client.

  • In (Project folder)/example/kogpt2-text-generation.py code, AIfriend can be tested using trained kogpt2 before it is used on the server.

Server with Firestore(DB)

  • Connect between the server and the client by network socket and using thread to make multiple connections
  • Manage conversations with chatbots and expansion of interests channel, the one of the core of the AIfriend project

Demo.py

Load various pre-trained models

#model loading
root_path = '../..'   
...   
model_W2V = word2vec.Word2Vec.load('./Keybert/model_W2V')
  • Load the pretrained model that contains KoGPT2, Word2Vec, Tokenizer, SentenceTransformer.
  • Please write the correct path

Modify the category

# Adding category in here.
category = ['여행', '음악', '게임', '동물', '옷', '음식', '운동', '독서', '요리']
  • Modify the category that you want.
    If you change the category, you should make a new fav, Board (Reference)

Main loop

try:
    while True:
        print('>> Waiting for a new connection')
        client_socket, addr = server_socket.accept()  
        user_sockets.append(client_socket)
        print("Current user : ", len(user_sockets))
        start_new_thread(threaded, (client_socket, addr))
except ...
finally ...
  • Generate a thread by receiving user information from the network socket

Thread server

def threaded(client_socket, addr):
  • When a new user accesses the server, it is managed through a new thread.
...
data = client_socket.recv(2048)
...
original_data = data.decode('utf-8')
...
if original_data[:6] == 'AIchat':
                    uid = original_data[6:]
                    KoGPT(uid, db, model, tokenizer, push_service, model_ST, model_W2V, category)
...
  • The client sends a prefix AIchat to have a conversation with chatbot.
    Additionally, receive a uid to access firestore.

functions.py

Main function(KoGPT)

  • Called by network socket prefix AIchat, Create AIfriend answers and make judgments about adding interests.
def KoGPT(...):
...
# Document search
    document_name = db.collection(u'AIChat').where('uid', 'array_contains', uid).get()[0].id
    AIchat_ref = db.collection(u'AIChat').document(document_name).collection('Chats')
  • Chatting log document is loaded by Document search
def KoGPT(...):
...
    tokenized_indexs = tokenizer.encode(user_chat_list[0])
    input_ids = torch.tensor([tokenizer.bos_token_id, ] + tokenized_indexs + [tokenizer.eos_token_id]).unsqueeze(0)
    sample_output = model.generate(input_ids=input_ids)
    answer = tokenizer.decode(sample_output[0].tolist()[len(tokenized_indexs) + 1:], skip_special_tokens=True)
  • Use tokenizer and trained KoGPT2 to generate answers to user chat
def KoGPT(...):
...
    # In each 'fav_max_count', server starts extracting interests
    if keybert_check != -1:
        keybert_check = keybert_check % fav_max_count
    if keybert_check == 0:
        # Category connecting each 'fav_max_count' user chatting
        category_connect(uid, db, model_ST, model_W2V, category)
  • In each 'fav_max_count', call category_connect(...) to connect the user and interest board

Main function(category_connect)

def category_connect(...):
    bert_keyword = key_bert(uid, db, model_ST, model_W2V, category)
  • Call key_bert(...) to get keyword in user chatting.
def category_connect(...):
...
    if bert_keyword in category:
        email = db.collection(u'user').where('uid', '==', uid).get()[0].id
        check = db.collection("fav").document(bert_keyword).get().to_dict()['users']
...
            db.collection("fav").document(bert_keyword).update({"users": firestore.ArrayUnion([email])})
            AIchat_ref.add({'message': bert_keyword + '에 관심있구나! 내가 비슷한 취향을 가진 친구들을 소개시켜줄게! 내 관심사 탭에 가볼래?', 'time': firestore.SERVER_TIMESTAMP, 'uid': 'AIfriend'})
  • If a user's interest keyword is in a category, then add the user to the interest board.

Sub functions

def getToken(...):
  • Function used to acquire token of client.
def sendMessage(...):
  • Use the token of the client to raise the push alarm.
def chatting_delay(...):
  • Delay function to make AIfriend feel like a person
def key_bert(...):