Server

KoGPT2

Key model for chatbots that generate answers to users' chats

Baseline & Dataset

Baseline code : Korean Language Model for Wellness Conversation

Dataset : 주제별 텍스트 일상 대화 데이터

Process

Data Preparation
Data Preprocessing
Using pretrained weight
Train the model
Validate the model

How to make AIFriend?

1. Data Preparation

Download the dataset
This project used only Kakao data in the source data compression(원천데이터 압축파일) in the Training/Validation folder in dataset.
If you proceed in the same environment as the project, you only need to use Kakao data.
If necessary, you can proceed with the learning by using additional refining.

(Project folder)/data/kakao - Place the prepared KAKAO.txt files (Collect all the txt files in the /data/kakao/)

2. Data Preprocessing

In (Project folder)/preprocess/split_new.py code, Check the path that exists at the top, set it correctly, and run it.
It preprocesses data, Kogpt2 can be fine-tuned to AIFriend.
At this time, exceptions are processed so that certain words such as politics/ideologies are not learned for a document, but you can also modify the project file directly to make better exceptions.

3. Using pretrained weight

Use the Word2Vec pretrained model for extracting interest categories(download link)

Place the files to (Project folder)/core/AIfriend/Keybert/
Use the kogpt2 pre-trained model provided by SKT (provided by transformers library)
The pretrained weight of the AIfriend that we used in the project(download link)
(Please note that the model weight may have learned the bias or incorrect knowledge of the data)

4. Train the model

In (Project folder)/train/run_auto_regressive.py code, Check the path that exists at the top, set it correctly, and run it.
Automatically load the skt pre-trained kogpt2 model and training with preprocessed data.
Hyperparameters such as learning rate and epoch can be adjusted as you want.

5. Validate the model

You can test directly on the code before testing on the server and client.
In (Project folder)/example/kogpt2-text-generation.py code, AIfriend can be tested using trained kogpt2 before it is used on the server.

Server with Firestore(DB)

Connect between the server and the client by network socket and using thread to make multiple connections
Manage conversations with chatbots and expansion of interests channel, the one of the core of the AIfriend project

Demo.py

Load various pre-trained models

#model loading
root_path = '../..'   
...   
model_W2V = word2vec.Word2Vec.load('./Keybert/model_W2V')

Load the pretrained model that contains KoGPT2, Word2Vec, Tokenizer, SentenceTransformer.
Please write the correct path

Modify the category

# Adding category in here.
category = ['여행', '음악', '게임', '동물', '옷', '음식', '운동', '독서', '요리']

Modify the category that you want.
If you change the category, you should make a new fav, Board (Reference)

Main loop

try:
    while True:
        print('>> Waiting for a new connection')
        client_socket, addr = server_socket.accept()  
        user_sockets.append(client_socket)
        print("Current user : ", len(user_sockets))
        start_new_thread(threaded, (client_socket, addr))
except ...
finally ...

Generate a thread by receiving user information from the network socket

Thread server

def threaded(client_socket, addr):

When a new user accesses the server, it is managed through a new thread.

...
data = client_socket.recv(2048)
...
original_data = data.decode('utf-8')
...
if original_data[:6] == 'AIchat':
                    uid = original_data[6:]
                    KoGPT(uid, db, model, tokenizer, push_service, model_ST, model_W2V, category)
...

The client sends a prefix AIchat to have a conversation with chatbot.
Additionally, receive a uid to access firestore.

functions.py

Main function(KoGPT)

Called by network socket prefix AIchat, Create AIfriend answers and make judgments about adding interests.

def KoGPT(...):
...
# Document search
    document_name = db.collection(u'AIChat').where('uid', 'array_contains', uid).get()[0].id
    AIchat_ref = db.collection(u'AIChat').document(document_name).collection('Chats')

Chatting log document is loaded by Document search

def KoGPT(...):
...
    tokenized_indexs = tokenizer.encode(user_chat_list[0])
    input_ids = torch.tensor([tokenizer.bos_token_id, ] + tokenized_indexs + [tokenizer.eos_token_id]).unsqueeze(0)
    sample_output = model.generate(input_ids=input_ids)
    answer = tokenizer.decode(sample_output[0].tolist()[len(tokenized_indexs) + 1:], skip_special_tokens=True)

Use tokenizer and trained KoGPT2 to generate answers to user chat

def KoGPT(...):
...
    # In each 'fav_max_count', server starts extracting interests
    if keybert_check != -1:
        keybert_check = keybert_check % fav_max_count
    if keybert_check == 0:
        # Category connecting each 'fav_max_count' user chatting
        category_connect(uid, db, model_ST, model_W2V, category)

In each 'fav_max_count', call category_connect(...) to connect the user and interest board

Main function(category_connect)

def category_connect(...):
    bert_keyword = key_bert(uid, db, model_ST, model_W2V, category)

Call key_bert(...) to get keyword in user chatting.

def category_connect(...):
...
    if bert_keyword in category:
        email = db.collection(u'user').where('uid', '==', uid).get()[0].id
        check = db.collection("fav").document(bert_keyword).get().to_dict()['users']
...
            db.collection("fav").document(bert_keyword).update({"users": firestore.ArrayUnion([email])})
            AIchat_ref.add({'message': bert_keyword + '에 관심있구나! 내가 비슷한 취향을 가진 친구들을 소개시켜줄게! 내 관심사 탭에 가볼래?', 'time': firestore.SERVER_TIMESTAMP, 'uid': 'AIfriend'})

If a user's interest keyword is in a category, then add the user to the interest board.

Sub functions

def getToken(...):

Function used to acquire token of client.

def sendMessage(...):

Use the token of the client to raise the push alarm.

def chatting_delay(...):

Delay function to make AIfriend feel like a person

def key_bert(...):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server

KoGPT2

Baseline & Dataset

Process

How to make AIFriend?

1. Data Preparation

2. Data Preprocessing

3. Using pretrained weight

4. Train the model

5. Validate the model

Server with Firestore(DB)

Demo.py

Load various pre-trained models

Modify the category

Main loop

Thread server

functions.py

Main function(KoGPT)

Main function(category_connect)

Sub functions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AI Friend

Guide & API Documentation

Chatting

Recommend

Other Features

Server

Firebase

Clone this wiki locally