Feedback #1

github-classroom · 2024-09-30T04:29:47Z

👋! GitHub Classroom created this pull request as a place for your teacher to leave feedback on your work. It will update automatically. Don’t close or merge this pull request, unless you’re instructed to do so by your teacher.
In this pull request, your teacher can leave comments and feedback on your code. Click the Subscribe button to be notified if that happens.
Click the Files changed or Commits tab to see all of the changes pushed to the default branch since the assignment started. Your teacher can see this too.

Notes for teachers

Use this PR to leave feedback. Here are some tips:

Click the Files changed tab to see all of the changes pushed to the default branch since the assignment started. To leave comments on specific lines of code, put your cursor over a line of code and click the blue + (plus sign). To learn more about comments, read “Commenting on a pull request”.
Click the Commits tab to see the commits pushed to the default branch. Click a commit to see specific changes.
If you turned on autograding, then click the Checks tab to see the results.
This page is an overview. It shows commits, line comments, and general comments. You can leave a general comment below.
For more information about this pull request, read “Leaving assignment feedback in GitHub”.

Subscribed: @jin-jae @LHANTAEK @simigami @ssunbear @doraemon500

Baseline Code 제작

change default=klue/roberta-large

feat: wandb integration

Add feature/bm25

fix: error with bm25

add CNN layer

Updated for CNN_layer_model.py

Feature/jungmin

Add ensemble file

add: llm answer post-preprocessing

Feature/korquad finetuning

Add:: 역할 분담

Project Final Completion

fix: leaderboard rank

gunny97

optimize_retriever.py가 main.py와 동일한 것 같은데, 불필요한 파일은 지우는게 나을 것 같습니다.

gunny97

전체적으로 코드가 정상적으로 동작하지만, 유지보수성과 가독성을 높이기 위해 몇 가지 개선이 필요합니다. 불필요한 주석을 지우고 사용하지 않는 주석 처리된 코드를 제거하여 코드의 명확성을 높일 수 있습니다. 또한, 소스 파일을 모듈화하고 폴더로 관리하면(예: modeling, pre-processing, retrieval 등) 프로젝트 구조가 개선되어 코드 관리가 더욱 편리해질 것입니다. 전반적으로 코드에 대한 깊은 이해가 돋보이며, 이러한 개선 사항을 반영하면 더욱 효율적이고 관리하기 쉬운 코드가 될 것입니다.

gunny97 · 2024-11-03T14:40:34Z

src/data_analysis.py

+from mecab import MeCab
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+data_path = "/data/ephemeral/home/level2-mrc-nlp-15/data/train_dataset"


상대 경로로 작성하면 더 좋을 것 같습니다. (여기뿐만 아니어도 다른 곳에서도 다 공통된 코멘트에요.)

gunny97 · 2024-11-03T14:42:36Z

src/data_analysis.py

+
+    if val > 0: cnt += 1
+
+print(len(top_tokens_per_doc))


단순히 numeric값만 print하기 보다는 가시적으로 설명도 있으면 좋을 것 같습니다.

ex)
total_docs = len(top_tokens_per_doc)
matching_docs = cnt
print(f"총 문서 수: {total_docs}")
print(f"질문과 키워드가 매칭된 문서 수: {matching_docs}")
print(f"매칭 비율: {matching_docs / total_docs * 100:.2f}%")

gunny97 · 2024-11-03T14:43:15Z

src/CNN_layer_model.py

+        self.num_labels = config.num_labels
+        self.roberta = RobertaModel(config, add_pooling_layer=False)
+
+        self.cnn_block1 = CNN_block(config.hidden_size, config.hidden_size)


중복된 코드는 가독성을 떨어트릴 수도 있으니 nn.ModuleList를 사용해도 좋을 것 같아요.

self.cnn_blocks = nn.ModuleList([
CNN_block(config.hidden_size, config.hidden_size) for _ in range(5)
])

gunny97 · 2024-11-03T14:43:55Z

src/CNN_layer_model.py

+        sequence_output = outputs[0]
+
+        # Apply CNN layers
+        sequence_output = self.cnn_block1(sequence_output)


위에서 nn.ModuleList사용하면 여기 forwarding에서도 아래와 같이 바꿀 수 있습니다.

for cnn_block in self.cnn_blocks:
sequence_output = cnn_block(sequence_output)

gunny97 · 2024-11-03T14:46:26Z

src/arguments.py

+        },
+    )
+    #################################################################################
+    batch_size: int = field(


batch_size, num_empochs도 위에 args와 일관되게 metadata의 help를 추가하는게 깔끔해보일 것 같습니다.

추가로 'batch_size, num_empochs'는 ModelArgs보다는 TrainArgs에 연관이 더 크기에, 새롭게 TrainArgs class를 만들어 거기에 정의하는게 자연스러워 보입니다.

gunny97 · 2024-11-03T14:50:27Z

src/ensemble/probs_voting_ensemble.py

+
+
+    for i in range(len(test_ids)):
+        id = test_ids[i]


id는 python의 built-in function으로 나중에 충돌 위험이 있어서 다른 객체명을 선언하는 것을 추천드립니다.

gunny97 · 2024-11-03T14:55:16Z

src/retrieval_Dense.py

+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+
+class DenseRetrieval:
+    def __init__(


init function보면 model path가 hard coding되어 있어 다른 모델로 변경하려면 직접 코드를 수정해야하는 구조입니다. __init__때, model path를 param으로 받아 처리하는 것이 더 유용할 것 같아요.

ex)
def init(
self,
data_path: Optional[str] = "../data/",
context_path: Optional[str] = "wikipedia_documents.json",
model_name: str = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
corpus: Optional[pd.DataFrame] = None
) -> NoReturn:
# ...
self.model_name = model_name
self.tokenize_fn = AutoTokenizer.from_pretrained(self.model_name)
self.dense_embeder = AutoModel.from_pretrained(self.model_name)

gunny97 · 2024-11-03T14:56:18Z

src/retrieval_Dense.py

+    def get_dense_embedding(self, question=None, contexts=None):
+        if contexts is not None:
+            self.contexts = contexts
+            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


일반적으로 device는 __init__에서 정의하고 class내에 필요한 부분에서 self.devcie로 불러 사용하는 것이 적절합니다.

gunny97 · 2024-11-03T14:58:02Z

src/retrieval_Dense.py

+                    with torch.no_grad():
+                        model_output = self.dense_embeder(**encoded_input)
+                    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+                    self.dense_embeds.append(sentence_embeddings.cpu())


이렇게 하면 데이터셋이 커질 때 문제가 생길 수 있습니다.
for loop 밖에서 self.dense를 다음과 같이 정의를 하고 self.dense_embeds = torch.zeros(len(self.contexts), self.dense_embeder.config.hidden_size)

index별로 embedding값을 넣어주는게 나을 것 같아요.

github-classroom bot and others added 2 commits September 30, 2024 04:29

Setting up GitHub Classroom Feedback

d4ca644

add: README

0598084

jin-jae force-pushed the main branch from 1509743 to 0598084 Compare October 1, 2024 05:07

jin-jae and others added 27 commits October 1, 2024 05:17

add: data folder (structure only)

b1a210e

add: baseline init

9b19845

add: baseline source code

e03f52d

Merge pull request #3 from boostcampaitech7/feature/baseline_code

458a33d

Baseline Code 제작

change default=klue/roberta-large

5d083d7

Merge pull request #8 from boostcampaitech7/feature/baseline_code

f789d47

change default=klue/roberta-large

feat: wandb integration

047ef44

Merge pull request #9 from boostcampaitech7/feature/baseline_code

3432fc9

feat: wandb integration

Add retrieval BM25

cc631cb

Add retrieval BM25

2b78008

Merge pull request #13 from boostcampaitech7/feature/bm25

eb5f411

Add feature/bm25

fix: error with bm25

c8724ea

Merge pull request #18 from boostcampaitech7/feature/bm25

3aa6993

fix: error with bm25

chore: requirements.txt, .gitignore

1d6aa5b

modify: for window version

e10ee40

Add retrieval_BM25Plus

e037f64

hantaek_pre_baseline_code

5d8a2de

Update main.sh

23cab5b

Update requirements.txt

5a229e3

Update main.py

15cdd77

Delete src/requirements.txt

1b8b944

Delete src/retrieval_BM25Plus.py

c7a5a10

Delete src/main.sh

c7f1e0d

Add files via upload

35f1672

add CNN layer

Update main.py

9c8fa75

Updated for CNN_layer_model.py

Add CNN_layer_mode.py and Modify main.py

7d66050

Add korquad_finetuning

6c6bcd0

jin-jae and others added 26 commits October 27, 2024 16:52

Merge pull request #28 from boostcampaitech7/feature/jungmin

a59bad7

Feature/jungmin

Merge pull request #28 from boostcampaitech7/feature/jungmin

c9905f7

Feature/jungmin

Merge pull request #27 from boostcampaitech7/feature/ensemble

7d3bec6

Add ensemble file

Merge pull request #27 from boostcampaitech7/feature/ensemble

9d84400

Add ensemble file

add: llm answer post-preprocessing

b4854b2

add: llm answer post-preprocessing

fda22a9

Merge pull request #29 from boostcampaitech7/feature/answer_preprocess

995565c

add: llm answer post-preprocessing

Merge pull request #29 from boostcampaitech7/feature/answer_preprocess

ea86708

add: llm answer post-preprocessing

add: README contents

d27140b

add: README contents

6f4ac9a

fix: merge conflict

e6a20d7

Merge branch 'develop' into feature/korquad_finetuning

9709ffb

Merge pull request #30 from boostcampaitech7/feature/korquad_finetuning

963621f

Feature/korquad finetuning

Update README.md

fe99fb5

Add:: 역할 분담

Add:: 역할 분담

2aa653b

Update README.md

571b102

Update README.md

6759383

Update README.md

849885c

Add files via upload

5c59003

Update README.md

bbdce0f

fix: source code structure

0d35434

Merge pull request #31 from boostcampaitech7/develop

16ed97c

Project Final Completion

fix: leaderboard rank

92ab3f1

Merge pull request #32 from boostcampaitech7/develop

8519258

fix: leaderboard rank

Update README.md

20d37f3

Update README.md

85b789c

gunny97 reviewed Nov 3, 2024

View reviewed changes

ssunbear added 2 commits November 7, 2024 22:40

Update README.md

7f618b6

Add wrap-up report

9dc9e9a

Feedback #1

Are you sure you want to change the base?

Feedback #1

Uh oh!

Conversation

github-classroom bot commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gunny97 left a comment

Choose a reason for hiding this comment

Uh oh!

gunny97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-classroom bot commented Sep 30, 2024 •

edited

Loading