BERT vs ModernBERT Comparison

IMDB 영화 리뷰 감정 분석을 위한 BERT와 ModernBERT 성능 비교 실험

🎯 프로젝트 개요

데이터셋: IMDB 영화 리뷰 (긍정/부정 감정 분류)
모델: BERT-base-uncased vs answerdotai/ModernBERT-base
실험 추적: Weights & Biases (WandB)
환경: Python 3.9, PyTorch, Transformers

📁 프로젝트 구조

exp_1/
├── src/
│   ├── main.py          # 메인 실험 실행 스크립트
│   ├── model.py         # 분류 모델 정의
│   ├── data.py          # 데이터 로딩 및 전처리 (사전 토크나이징)
│   └── utils.py         # 설정 로딩 유틸리티
├── configs/
│   └── configs.yaml     # 모델, 데이터, 훈련 설정
├── wandb/              # WandB 실험 로그 (git에서 제외)
└── README.md

환경 설정

# Python 3.9 가상환경 생성
python3.9 -m venv venv39
source venv39/bin/activate

# 필요한 패키지 설치
pip install torch tqdm wandb datasets numpy omegaconf "transformers>=4.46.0"

실험 실행

cd exp_1
python src/main.py

📊 실험 설정

모델 설정

BERT: bert-base-uncased (768 hidden size)
ModernBERT: answerdotai/ModernBERT-base (768 hidden size)

데이터 설정

최대 길이: 128 토큰
배치 크기: 32
분할: Train(45,000) / Valid(2,500) / Test(2,500)

훈련 설정

에포크: 5
학습률: 5e-5
옵티마이저: Adam
손실 함수: CrossEntropyLoss

🎯 핵심 구현 사항

1. 사전 토크나이징 데이터셋

def __getitem__(self, idx: int) -> dict:
    return {key: self.dataset[idx][key] for key in self.dataset}

2. 간결한 모델 정의

def forward(self, input_ids, attention_mask, label, token_type_ids=None):
    outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
    hidden = outputs.last_hidden_state 
    pooled = hidden[:, 0, :]  # CLS 토큰 사용
    logits = self.classifier(pooled)
    loss = self.loss_fn(logits, label)
    return {'logits': logits, 'loss': loss}

3. 효율적인 배치 처리

@staticmethod
def collate_fn(batch: List[dict]) -> dict:
    keys = batch[0].keys()
    return {key: torch.stack([sample[key] for sample in batch]) for key in keys}

📈 실험 결과

모델	정확도(Accuracy)
ModernBERT	0.90902
BERT	0.89241

modernBERT의 성능이 BERT의 성능보다 더 좋다.

modernBERT 는 BERT 와 유사한 파라미터 수를 갖지만, 미세하게 토크나이저와 같은 세부적인 아키텍처가 최적화되어있기 때문이다.

(출처 : https://jina.ai/ko/news/what-should-we-learn-from-modernbert/ )

이로서 scaling law 로 파라미터 수를 늘리는 것 뿐만 아니라 아키텍처 최적화도 성능 향상을 위해 필요하다는 점을 알 수 있다.

실험 결과 - Gradient Accumulation 을 이용한 Batch Size 및 모델별 성능 비교

방법 : Accumulator 을 이용하여 batch64 를 기준으로 step 을 1, 4, 16 으로 설정하여 실질적인 batch 를 64, 256, 1024 로 설정.

배치 크기	모델	실행 시간(초)	정확도
64	ModernBERT	591	0.90273
64	BERT	416	0.87539
256	ModernBERT	563	0.91523
256	BERT	393	0.89687
1024	ModernBERT	548	0.83984
1024	BERT	386	0.89297

동일 batch size 에 대하여 accumulation 을 사용했을 때 학습 속도가 더 빨랐다.
Batch size 가 증가할수록 실행 시간(초) 는 감소하였으나, 정확도는 오히려 실행 순서에 비례했다.

추가 실험1 : 기존(1024 -> 64 -> 256) 과 반대로 실행 (256 -> 64 -> 1024)

배치 크기	모델	실행 시간(초)	정확도
64	ModernBERT	617	0.91328
64	BERT	441	0.90469
256	ModernBERT	579	0.86016
256	BERT	416	0.90078
1024	ModernBERT	580	0.91602
1024	BERT	410	0.89180

가장 정확도가 낮은 조합은 modernBERT 에서 처음 실행하는 것들임. (왜 modernBERT 에서만 이러한 현상이 일어나는지?)
BERT 에서는 동일한 현상이 보이지 않으니 가설이 맞다고 결론 내리기 어려움

추가실험2 : Learning rate 조정

필요성 : batch size 가 1024 일 때 loss 가 충분히 수렴하지 못함

🧑🏼‍💻 코드 작성 시 주의할 점

Configs 를 활용하여 효율적으로 파라미터 조절
input, output 자료형을 함수에 명시하여 오류를 예방
속도를 개선하기 위해 사전 tokenizing 을 수행
Reproductivity 를 위한 seed 고정 (seed = 42)
wandb API 키는 본인의 key 로 대체해야 합니다.

🛠️ 기술 스택

Python: 3.9+
PyTorch: 2.8.0
Transformers: 4.55.2 (ModernBERT 지원)
Datasets: 4.0.0 (IMDB 데이터 로딩)
WandB: 0.21.1 (실험 추적)
OmegaConf: 2.3.0 (설정 관리)

🧑🏼‍💻 코드 작성 시 주의할 점

Configs 를 활용하여 효율적으로 파라미터 조절
input, output 자료형을 함수에 명시하여 오류를 예방
속도를 개선하기 위해 사전 tokenizing 을 수행
Reproductivity 를 위한 seed 고정 (seed = 42)
wandb API 키는 본인의 key 로 대체해야 합니다.

🛠️ 기술 스택

Python: 3.9+
PyTorch: 2.8.0
Transformers: 4.55.2 (ModernBERT 지원)
Datasets: 4.0.0 (IMDB 데이터 로딩)
WandB: 0.21.1 (실험 추적)
OmegaConf: 2.3.0 (설정 관리)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md
batch_size_comparison.png		batch_size_comparison.png
batch_size_comparison_reordered.png		batch_size_comparison_reordered.png
batch_size_trends.png		batch_size_trends.png
batch_size_trends_reordered.png		batch_size_trends_reordered.png
requirements.txt		requirements.txt
run_accumulation_experiments.py		run_accumulation_experiments.py
run_accumulation_experiments_reordered.py		run_accumulation_experiments_reordered.py
run_batch_experiments.py		run_batch_experiments.py
run_batch_experiments_reordered.py		run_batch_experiments_reordered.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BERT vs ModernBERT Comparison

🎯 프로젝트 개요

📁 프로젝트 구조

환경 설정

실험 실행

📊 실험 설정

모델 설정

데이터 설정

훈련 설정

🎯 핵심 구현 사항

1. 사전 토크나이징 데이터셋

2. 간결한 모델 정의

3. 효율적인 배치 처리

📈 실험 결과

실험 결과 - Gradient Accumulation 을 이용한 Batch Size 및 모델별 성능 비교

추가 실험1 : 기존(1024 -> 64 -> 256) 과 반대로 실행 (256 -> 64 -> 1024)

추가실험2 : Learning rate 조정

🧑🏼‍💻 코드 작성 시 주의할 점

🛠️ 기술 스택

🧑🏼‍💻 코드 작성 시 주의할 점

🛠️ 기술 스택

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mintguineapig/nlp_practice

Folders and files

Latest commit

History

Repository files navigation

BERT vs ModernBERT Comparison

🎯 프로젝트 개요

📁 프로젝트 구조

환경 설정

실험 실행

📊 실험 설정

모델 설정

데이터 설정

훈련 설정

🎯 핵심 구현 사항

1. 사전 토크나이징 데이터셋

2. 간결한 모델 정의

3. 효율적인 배치 처리

📈 실험 결과

실험 결과 - Gradient Accumulation 을 이용한 Batch Size 및 모델별 성능 비교

추가 실험1 : 기존(1024 -> 64 -> 256) 과 반대로 실행 (256 -> 64 -> 1024)

추가실험2 : Learning rate 조정

🧑🏼‍💻 코드 작성 시 주의할 점

🛠️ 기술 스택

🧑🏼‍💻 코드 작성 시 주의할 점

🛠️ 기술 스택

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages