Skip to content

Conversation

@github-classroom
Copy link

@github-classroom github-classroom bot commented Sep 30, 2024

👋! GitHub Classroom created this pull request as a place for your teacher to leave feedback on your work. It will update automatically. Don’t close or merge this pull request, unless you’re instructed to do so by your teacher.
In this pull request, your teacher can leave comments and feedback on your code. Click the Subscribe button to be notified if that happens.
Click the Files changed or Commits tab to see all of the changes pushed to the default branch since the assignment started. Your teacher can see this too.

Notes for teachers

Use this PR to leave feedback. Here are some tips:

  • Click the Files changed tab to see all of the changes pushed to the default branch since the assignment started. To leave comments on specific lines of code, put your cursor over a line of code and click the blue + (plus sign). To learn more about comments, read “Commenting on a pull request”.
  • Click the Commits tab to see the commits pushed to the default branch. Click a commit to see specific changes.
  • If you turned on autograding, then click the Checks tab to see the results.
  • This page is an overview. It shows commits, line comments, and general comments. You can leave a general comment below.
    For more information about this pull request, read “Leaving assignment feedback in GitHub”.

Subscribed: @jin-jae @LHANTAEK @simigami @ssunbear @doraemon500

jin-jae and others added 26 commits October 27, 2024 16:52
Add:: 역할 분담
Copy link

@gunny97 gunny97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize_retriever.py가 main.py와 동일한 것 같은데, 불필요한 파일은 지우는게 나을 것 같습니다.

Copy link

@gunny97 gunny97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

전체적으로 코드가 정상적으로 동작하지만, 유지보수성과 가독성을 높이기 위해 몇 가지 개선이 필요합니다. 불필요한 주석을 지우고 사용하지 않는 주석 처리된 코드를 제거하여 코드의 명확성을 높일 수 있습니다. 또한, 소스 파일을 모듈화하고 폴더로 관리하면(예: modeling, pre-processing, retrieval 등) 프로젝트 구조가 개선되어 코드 관리가 더욱 편리해질 것입니다. 전반적으로 코드에 대한 깊은 이해가 돋보이며, 이러한 개선 사항을 반영하면 더욱 효율적이고 관리하기 쉬운 코드가 될 것입니다.

from mecab import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer

data_path = "/data/ephemeral/home/level2-mrc-nlp-15/data/train_dataset"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

상대 경로로 작성하면 더 좋을 것 같습니다. (여기뿐만 아니어도 다른 곳에서도 다 공통된 코멘트에요.)


if val > 0: cnt += 1

print(len(top_tokens_per_doc))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

단순히 numeric값만 print하기 보다는 가시적으로 설명도 있으면 좋을 것 같습니다.

ex)
total_docs = len(top_tokens_per_doc)
matching_docs = cnt
print(f"총 문서 수: {total_docs}")
print(f"질문과 키워드가 매칭된 문서 수: {matching_docs}")
print(f"매칭 비율: {matching_docs / total_docs * 100:.2f}%")

self.num_labels = config.num_labels
self.roberta = RobertaModel(config, add_pooling_layer=False)

self.cnn_block1 = CNN_block(config.hidden_size, config.hidden_size)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

중복된 코드는 가독성을 떨어트릴 수도 있으니 nn.ModuleList를 사용해도 좋을 것 같아요.

self.cnn_blocks = nn.ModuleList([
CNN_block(config.hidden_size, config.hidden_size) for _ in range(5)
])

sequence_output = outputs[0]

# Apply CNN layers
sequence_output = self.cnn_block1(sequence_output)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

위에서 nn.ModuleList사용하면 여기 forwarding에서도 아래와 같이 바꿀 수 있습니다.

for cnn_block in self.cnn_blocks:
sequence_output = cnn_block(sequence_output)

},
)
#################################################################################
batch_size: int = field(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_size, num_empochs도 위에 args와 일관되게 metadata의 help를 추가하는게 깔끔해보일 것 같습니다.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

추가로 'batch_size, num_empochs'는 ModelArgs보다는 TrainArgs에 연관이 더 크기에, 새롭게 TrainArgs class를 만들어 거기에 정의하는게 자연스러워 보입니다.



for i in range(len(test_ids)):
id = test_ids[i]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id는 python의 built-in function으로 나중에 충돌 위험이 있어서 다른 객체명을 선언하는 것을 추천드립니다.

return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

class DenseRetrieval:
def __init__(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init function보면 model path가 hard coding되어 있어 다른 모델로 변경하려면 직접 코드를 수정해야하는 구조입니다. __init__때, model path를 param으로 받아 처리하는 것이 더 유용할 것 같아요.

ex)
def init(
self,
data_path: Optional[str] = "../data/",
context_path: Optional[str] = "wikipedia_documents.json",
model_name: str = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
corpus: Optional[pd.DataFrame] = None
) -> NoReturn:
# ...
self.model_name = model_name
self.tokenize_fn = AutoTokenizer.from_pretrained(self.model_name)
self.dense_embeder = AutoModel.from_pretrained(self.model_name)

def get_dense_embedding(self, question=None, contexts=None):
if contexts is not None:
self.contexts = contexts
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

일반적으로 device는 __init__에서 정의하고 class내에 필요한 부분에서 self.devcie로 불러 사용하는 것이 적절합니다.

with torch.no_grad():
model_output = self.dense_embeder(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
self.dense_embeds.append(sentence_embeddings.cpu())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

이렇게 하면 데이터셋이 커질 때 문제가 생길 수 있습니다.
for loop 밖에서 self.dense를 다음과 같이 정의를 하고 self.dense_embeds = torch.zeros(len(self.contexts), self.dense_embeder.config.hidden_size)

index별로 embedding값을 넣어주는게 나을 것 같아요.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants