-
Notifications
You must be signed in to change notification settings - Fork 4
Feedback #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feedback
Are you sure you want to change the base?
Feedback #1
Conversation
Baseline Code 제작
change default=klue/roberta-large
feat: wandb integration
Add feature/bm25
fix: error with bm25
add CNN layer
Updated for CNN_layer_model.py
Feature/jungmin
Feature/jungmin
Add ensemble file
Add ensemble file
add: llm answer post-preprocessing
add: llm answer post-preprocessing
Feature/korquad finetuning
Add:: 역할 분담
Project Final Completion
fix: leaderboard rank
gunny97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimize_retriever.py가 main.py와 동일한 것 같은데, 불필요한 파일은 지우는게 나을 것 같습니다.
gunny97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
전체적으로 코드가 정상적으로 동작하지만, 유지보수성과 가독성을 높이기 위해 몇 가지 개선이 필요합니다. 불필요한 주석을 지우고 사용하지 않는 주석 처리된 코드를 제거하여 코드의 명확성을 높일 수 있습니다. 또한, 소스 파일을 모듈화하고 폴더로 관리하면(예: modeling, pre-processing, retrieval 등) 프로젝트 구조가 개선되어 코드 관리가 더욱 편리해질 것입니다. 전반적으로 코드에 대한 깊은 이해가 돋보이며, 이러한 개선 사항을 반영하면 더욱 효율적이고 관리하기 쉬운 코드가 될 것입니다.
| from mecab import MeCab | ||
| from sklearn.feature_extraction.text import TfidfVectorizer | ||
|
|
||
| data_path = "/data/ephemeral/home/level2-mrc-nlp-15/data/train_dataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
상대 경로로 작성하면 더 좋을 것 같습니다. (여기뿐만 아니어도 다른 곳에서도 다 공통된 코멘트에요.)
|
|
||
| if val > 0: cnt += 1 | ||
|
|
||
| print(len(top_tokens_per_doc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
단순히 numeric값만 print하기 보다는 가시적으로 설명도 있으면 좋을 것 같습니다.
ex)
total_docs = len(top_tokens_per_doc)
matching_docs = cnt
print(f"총 문서 수: {total_docs}")
print(f"질문과 키워드가 매칭된 문서 수: {matching_docs}")
print(f"매칭 비율: {matching_docs / total_docs * 100:.2f}%")
| self.num_labels = config.num_labels | ||
| self.roberta = RobertaModel(config, add_pooling_layer=False) | ||
|
|
||
| self.cnn_block1 = CNN_block(config.hidden_size, config.hidden_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
중복된 코드는 가독성을 떨어트릴 수도 있으니 nn.ModuleList를 사용해도 좋을 것 같아요.
self.cnn_blocks = nn.ModuleList([
CNN_block(config.hidden_size, config.hidden_size) for _ in range(5)
])
| sequence_output = outputs[0] | ||
|
|
||
| # Apply CNN layers | ||
| sequence_output = self.cnn_block1(sequence_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
위에서 nn.ModuleList사용하면 여기 forwarding에서도 아래와 같이 바꿀 수 있습니다.
for cnn_block in self.cnn_blocks:
sequence_output = cnn_block(sequence_output)
| }, | ||
| ) | ||
| ################################################################################# | ||
| batch_size: int = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size, num_empochs도 위에 args와 일관되게 metadata의 help를 추가하는게 깔끔해보일 것 같습니다.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
추가로 'batch_size, num_empochs'는 ModelArgs보다는 TrainArgs에 연관이 더 크기에, 새롭게 TrainArgs class를 만들어 거기에 정의하는게 자연스러워 보입니다.
|
|
||
|
|
||
| for i in range(len(test_ids)): | ||
| id = test_ids[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id는 python의 built-in function으로 나중에 충돌 위험이 있어서 다른 객체명을 선언하는 것을 추천드립니다.
| return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) | ||
|
|
||
| class DenseRetrieval: | ||
| def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init function보면 model path가 hard coding되어 있어 다른 모델로 변경하려면 직접 코드를 수정해야하는 구조입니다. __init__때, model path를 param으로 받아 처리하는 것이 더 유용할 것 같아요.
ex)
def init(
self,
data_path: Optional[str] = "../data/",
context_path: Optional[str] = "wikipedia_documents.json",
model_name: str = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
corpus: Optional[pd.DataFrame] = None
) -> NoReturn:
# ...
self.model_name = model_name
self.tokenize_fn = AutoTokenizer.from_pretrained(self.model_name)
self.dense_embeder = AutoModel.from_pretrained(self.model_name)
| def get_dense_embedding(self, question=None, contexts=None): | ||
| if contexts is not None: | ||
| self.contexts = contexts | ||
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
일반적으로 device는 __init__에서 정의하고 class내에 필요한 부분에서 self.devcie로 불러 사용하는 것이 적절합니다.
| with torch.no_grad(): | ||
| model_output = self.dense_embeder(**encoded_input) | ||
| sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) | ||
| self.dense_embeds.append(sentence_embeddings.cpu()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
이렇게 하면 데이터셋이 커질 때 문제가 생길 수 있습니다.
for loop 밖에서 self.dense를 다음과 같이 정의를 하고 self.dense_embeds = torch.zeros(len(self.contexts), self.dense_embeder.config.hidden_size)
index별로 embedding값을 넣어주는게 나을 것 같아요.
👋! GitHub Classroom created this pull request as a place for your teacher to leave feedback on your work. It will update automatically. Don’t close or merge this pull request, unless you’re instructed to do so by your teacher.
In this pull request, your teacher can leave comments and feedback on your code. Click the Subscribe button to be notified if that happens.
Click the Files changed or Commits tab to see all of the changes pushed to the default branch since the assignment started. Your teacher can see this too.
Notes for teachers
Use this PR to leave feedback. Here are some tips:
For more information about this pull request, read “Leaving assignment feedback in GitHub”.
Subscribed: @jin-jae @LHANTAEK @simigami @ssunbear @doraemon500