-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathattack_on_bert_transfer_learning_in_nlp.html
2523 lines (2312 loc) · 178 KB
/
attack_on_bert_transfer_learning_in_nlp.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<!--[if lt IE 9 ]><html class="no-js oldie" lang="zh-hant-tw"> <![endif]-->
<!--[if IE 9 ]><html class="no-js oldie ie9" lang="zh-hant-tw"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]><!-->
<html class="no-js" lang="zh-hant-tw">
<!--<![endif]-->
<head>
<!--- basic page needs
================================================== -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="author" content="Lee Meng" />
<title>LeeMeng - 進擊的 BERT:NLP 界的巨人之力與遷移學習</title>
<!--- article-specific meta data
================================================== -->
<meta name="description" content="這篇是給所有人的 BERT 科普文以及操作入門手冊。文中將簡單介紹知名的語言代表模型 BERT 以及如何用其實現兩階段的遷移學習。讀者將有機會透過 PyTorch 的程式碼來直觀理解 BERT 的運作方式並實際 fine tune 一個真實存在的假新聞分類任務。閱讀完本文的讀者將能把 BERT 與遷移學習運用到其他自己感興趣的 NLP 任務。" />
<meta name="keywords" content="自然語言處理, NLP, PyTorch" />
<meta name="tags" content="自然語言處理" />
<meta name="tags" content="NLP" />
<meta name="tags" content="PyTorch" />
<!--- Open Graph Object metas
================================================== -->
<meta property="og:image" content="https://leemeng.tw/theme/images/background/attack_on_bert.jpg" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html" />
<meta property="og:title" content="進擊的 BERT:NLP 界的巨人之力與遷移學習" />
<meta property="og:description" content="這篇是給所有人的 BERT 科普文以及操作入門手冊。文中將簡單介紹知名的語言代表模型 BERT 以及如何用其實現兩階段的遷移學習。讀者將有機會透過 PyTorch 的程式碼來直觀理解 BERT 的運作方式並實際 fine tune 一個真實存在的假新聞分類任務。閱讀完本文的讀者將能把 BERT 與遷移學習運用到其他自己感興趣的 NLP 任務。" />
<!-- mobile specific metas
================================================== -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- CSS
================================================== -->
<!--for customized css in individual page-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/bootstrap.min.css">
<!--for showing toc navigation which slide in from left-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/toc-nav.css">
<!--for responsive embed youtube video-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/embed_youtube.css">
<!--for prettify dark-mode result-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/darkmode.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/base.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/vendor.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/main.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/ipython.css">
<link rel="stylesheet" type="text/css" href='https://leemeng.tw/theme/css/progress-bar.css' />
<!--TiqueSearch-->
<link href="https://fonts.googleapis.com/css?family=Roboto:100,300,400">
<link rel="stylesheet" href="https://leemeng.tw/theme/tipuesearch/css/normalize.css">
<link rel="stylesheet" href="https://leemeng.tw/theme/tipuesearch/css/tipuesearch.css">
<!-- script
================================================== -->
<script src="https://leemeng.tw/theme/js/modernizr.js"></script>
<script src="https://leemeng.tw/theme/js/pace.min.js"></script>
<!-- favicons
================================================== -->
<link rel="shortcut icon" href="../theme/images/favicon.ico" type="image/x-icon"/>
<link rel="icon" href="../theme/images/favicon.ico" type="image/x-icon"/>
<!-- Global Site Tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-106559980-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments)};
gtag('js', new Date());
gtag('config', 'UA-106559980-1');
</script>
</head>
<body id="top">
<!-- header
================================================== -->
<header class="s-header">
<div class="header-logo">
<a class="site-logo" href="../index.html"><img src="https://leemeng.tw/theme/images/logo.png" alt="Homepage"></a>
</div>
<!--navigation bar ref: http://jinja.pocoo.org/docs/2.10/tricks/-->
<nav class="header-nav-wrap">
<ul class="header-nav">
<li>
<a href="../index.html#home">Home</a>
</li>
<li>
<a href="../index.html#about">About</a>
</li>
<li>
<a href="../index.html#projects">Projects</a>
</li>
<li class="current">
<a href="../blog.html">Blog</a>
</li>
<li>
<a href="https://demo.leemeng.tw">Demo</a>
</li>
<li>
<a href="../books.html">Books</a>
</li>
<li>
<a href="../index.html#contact">Contact</a>
</li>
</ul>
<!--<div class="search-container">-->
<!--<form action="../search.html">-->
<!--<input type="text" placeholder="Search.." name="search">-->
<!--<button type="submit"><i class="im im-magnifier" aria-hidden="true"></i></button>-->
<!--</form>-->
<!--</div>-->
</nav>
<a class="header-menu-toggle" href="#0"><span>Menu</span></a>
</header> <!-- end s-header -->
<!--TOC navigation displayed when clicked from left-navigation button-->
<div id="tocNav" class="overlay" onclick="closeTocNav()">
<div class="overlay-content">
<div id="toc"><ul><li><a class="toc-href" href="#" title="進擊的 BERT:NLP 界的巨人之力與遷移學習">進擊的 BERT:NLP 界的巨人之力與遷移學習</a><ul><li><a class="toc-href" href="#BERT:理解上下文的語言代表模型" title="BERT:理解上下文的語言代表模型">BERT:理解上下文的語言代表模型</a></li><li><a class="toc-href" href="#用-BERT-fine-tune-下游任務" title="用 BERT fine tune 下游任務">用 BERT fine tune 下游任務</a><ul><li><a class="toc-href" href="#1.-準備原始文本數據" title="1. 準備原始文本數據">1. 準備原始文本數據</a></li><li><a class="toc-href" href="#2.-將原始文本轉換成-BERT-相容的輸入格式" title="2. 將原始文本轉換成 BERT 相容的輸入格式">2. 將原始文本轉換成 BERT 相容的輸入格式</a></li><li><a class="toc-href" href="#3.-在-BERT-之上加入新-layer-成下游任務模型" title="3. 在 BERT 之上加入新 layer 成下游任務模型">3. 在 BERT 之上加入新 layer 成下游任務模型</a></li><li><a class="toc-href" href="#4.-訓練該下游任務模型" title="4. 訓練該下游任務模型">4. 訓練該下游任務模型</a></li><li><a class="toc-href" href="#5.-對新樣本做推論" title="5. 對新樣本做推論">5. 對新樣本做推論</a></li></ul></li><li><a class="toc-href" href="#結語_1" title="結語">結語</a></li></ul></li></ul></div>
</div>
</div>
<!--custom images with icon shown on left nav-->
<!--the details are set in `pelicanconf.py` as `LEFT_NAV_IMAGES`-->
<article class="blog-single">
<!-- page header/blog hero, use custom cover image if available
================================================== -->
<div class="page-header page-header--single page-hero" style="background-image:url(https://leemeng.tw/theme/images/background/attack_on_bert.jpg)">
<div class="row page-header__content narrow">
<article class="col-full">
<div class="page-header__info">
<div class="page-header__cat">
<a href="https://leemeng.tw/tag/zi-ran-yu-yan-chu-li.html" rel="tag">自然語言處理</a>
<a href="https://leemeng.tw/tag/nlp.html" rel="tag">NLP</a>
<a href="https://leemeng.tw/tag/pytorch.html" rel="tag">PyTorch</a>
</div>
</div>
<h1 class="page-header__title">
<a href="https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html" title="">
進擊的 BERT:NLP 界的巨人之力與遷移學習
</a>
</h1>
<ul class="page-header__meta">
<li class="date">2019-07-10 (Wed)</li>
<li class="page-view">
205,512 views
</li>
</ul>
</article>
</div>
</div> <!-- end page-header -->
<div class="KW_progressContainer">
<div class="KW_progressBar"></div>
</div>
<div class="row blog-content" style="position: relative">
<div id="left-navigation">
<div id="search-wrap">
<i class="im im-magnifier" aria-hidden="true"></i>
<div id="search">
<form action="../search.html">
<div class="tipue_search_right"><input type="text" name="q" id="tipue_search_input" pattern=".{2,}" title="想搜尋什麼呢?(請至少輸入兩個字)" required></div>
</form>
</div>
</div>
<div id="toc-wrap">
<a title="顯示/隱藏 文章章節">
<i class="im im-menu" aria-hidden="true" onclick="toggleTocNav()"></i>
</a>
</div>
<div id="social-wrap" style="cursor: pointer">
<a class="open-popup" title="訂閱最新文章">
<i class="im im-newspaper-o" aria-hidden="true"></i>
</a>
</div>
<div id="social-wrap">
<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html" target="_blank" title="分享到 Facebook">
<i class="im im-facebook" aria-hidden="true"></i>
</a>
</div>
<div id="social-wrap">
<a href="https://www.linkedin.com/shareArticle?mini=true&url=https%3A//leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html&title=%E9%80%B2%E6%93%8A%E7%9A%84%20BERT%EF%BC%9ANLP%20%E7%95%8C%E7%9A%84%E5%B7%A8%E4%BA%BA%E4%B9%8B%E5%8A%9B%E8%88%87%E9%81%B7%E7%A7%BB%E5%AD%B8%E7%BF%92&summary=%E9%80%99%E7%AF%87%E6%98%AF%E7%B5%A6%E6%89%80%E6%9C%89%E4%BA%BA%E7%9A%84%20BERT%20%E7%A7%91%E6%99%AE%E6%96%87%E4%BB%A5%E5%8F%8A%E6%93%8D%E4%BD%9C%E5%85%A5%E9%96%80%E6%89%8B%E5%86%8A%E3%80%82%E6%96%87%E4%B8%AD%E5%B0%87%E7%B0%A1%E5%96%AE%E4%BB%8B%E7%B4%B9%E7%9F%A5%E5%90%8D%E7%9A%84%E8%AA%9E%E8%A8%80%E4%BB%A3%E8%A1%A8%E6%A8%A1%E5%9E%8B%20BERT%20%E4%BB%A5%E5%8F%8A%E5%A6%82%E4%BD%95%E7%94%A8%E5%85%B6%E5%AF%A6%E7%8F%BE%E5%85%A9%E9%9A%8E%E6%AE%B5%E7%9A%84%E9%81%B7%E7%A7%BB%E5%AD%B8%E7%BF%92%E3%80%82%E8%AE%80%E8%80%85%E5%B0%87%E6%9C%89%E6%A9%9F%E6%9C%83%E9%80%8F%E9%81%8E%20PyTorch%20%E7%9A%84%E7%A8%8B%E5%BC%8F%E7%A2%BC%E4%BE%86%E7%9B%B4%E8%A7%80%E7%90%86%E8%A7%A3%20BERT%20%E7%9A%84%E9%81%8B%E4%BD%9C%E6%96%B9%E5%BC%8F%E4%B8%A6%E5%AF%A6%E9%9A%9B%20fine%20tune%20%E4%B8%80%E5%80%8B%E7%9C%9F%E5%AF%A6%E5%AD%98%E5%9C%A8%E7%9A%84%E5%81%87%E6%96%B0%E8%81%9E%E5%88%86%E9%A1%9E%E4%BB%BB%E5%8B%99%E3%80%82%E9%96%B1%E8%AE%80%E5%AE%8C%E6%9C%AC%E6%96%87%E7%9A%84%E8%AE%80%E8%80%85%E5%B0%87%E8%83%BD%E6%8A%8A%20BERT%20%E8%88%87%E9%81%B7%E7%A7%BB%E5%AD%B8%E7%BF%92%E9%81%8B%E7%94%A8%E5%88%B0%E5%85%B6%E4%BB%96%E8%87%AA%E5%B7%B1%E6%84%9F%E8%88%88%E8%B6%A3%E7%9A%84%20NLP%20%E4%BB%BB%E5%8B%99%E3%80%82&source=https%3A//leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html" target="_blank" title="分享到 LinkedIn">
<i class="im im-linkedin" aria-hidden="true"></i>
</a>
</div>
<div id="social-wrap">
<a href="https://twitter.com/intent/tweet?text=%E9%80%B2%E6%93%8A%E7%9A%84%20BERT%EF%BC%9ANLP%20%E7%95%8C%E7%9A%84%E5%B7%A8%E4%BA%BA%E4%B9%8B%E5%8A%9B%E8%88%87%E9%81%B7%E7%A7%BB%E5%AD%B8%E7%BF%92&url=https%3A//leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html&hashtags=zi-ran-yu-yan-chu-li,nlp,pytorch" target="_blank" title="分享到 Twitter">
<i class="im im-twitter" aria-hidden="true"></i>
</a>
</div>
<!--custom images with icon shown on left nav-->
</div>
<div class="col-full blog-content__main">
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote>
<p>
這是一篇 BERT 科普文,帶你直觀理解並實際運用現在 NLP 領域的巨人之力。
<br/>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如果你還有印象,在<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html">自然語言處理(NLP)與深度學習入門指南</a>裡我使用了 LSTM 以及 Google 的語言代表模型 <a href="https://github.com/google-research/bert">BERT</a> 來分類中文假新聞。而最後因為 BERT 本身的強大,我不費吹灰之力就在<a href="https://www.kaggle.com/c/fake-news-pair-classification-challenge/leaderboard">該 Kaggle 競賽</a>達到 85 % 的正確率,距離第一名 3 %,總排名前 30 %。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/nlp-kaggle-intro/kaggle-final-result.png"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>當初我是使用 <a href="https://github.com/google-research/bert">TensorFlow 官方釋出的 BERT</a> 進行 fine tuning,但使用方式並不是那麼直覺。最近適逢 <a href="https://pytorch.org/hub">PyTorch Hub</a> 上架 <a href="https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/">BERT</a>,李宏毅教授的<a href="http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML19.html">機器學習課程</a>也推出了 <a href="https://www.youtube.com/watch?v=UYPa347-DdE">BERT 的教學影片</a>,我認為現在正是你了解並<strong>實際運用</strong> BERT 的最佳時機!</p>
<p>這篇文章會簡單介紹 BERT 並展示如何使用 BERT 做<a href="https://docs.google.com/presentation/d/1DJI1yX4U5IgApGwavt0AmOCLWwso7ou1Un93sMuAWmA/edit?usp=sharing">遷移學習(Transfer Learning)</a>。我在文末也會提供一些有趣的研究及應用 ,讓你可以進一步探索變化快速的 NLP 世界。</p>
<p>如果你完全不熟 NLP 或是壓根子沒聽過什麼是 BERT,我強力建議你之後找時間(或是現在!)觀看李宏毅教授說明 <a href="https://allennlp.org/elmo">ELMo</a>、BERT 以及 <a href="https://github.com/openai/gpt-2">GPT</a> 等模型的影片,淺顯易懂:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<div class="resp-container">
<iframe allow="accelerometer;
autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="resp-iframe" frameborder="0" src="https://www.youtube-nocookie.com/embed/UYPa347-DdE">
</iframe>
</div>
<center>
李宏毅教授講解目前 NLP 領域的最新研究是如何讓機器讀懂文字的(我超愛這截圖)
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>我接下來會花點篇幅闡述 BERT 的基礎概念。如果你已經十分熟悉 BERT 而且迫不及待想要馬上將 BERT 應用到自己的 NLP 任務上面,可以直接跳到<a href="#用-BERT-fine-tune-下游任務">用 BERT fine tune 下游任務</a>一節。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="BERT:理解上下文的語言代表模型">BERT:理解上下文的語言代表模型<a class="anchor-link" href="#BERT:理解上下文的語言代表模型">¶</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>一個簡單的 convention,等等文中會穿插使用的:</p>
<ul>
<li>代表</li>
<li>representation</li>
<li>repr.</li>
<li>repr. 向量</li>
</ul>
<p>指的都是一個可以用來<strong>代表</strong>某詞彙(在某個語境下)的多維連續向量(continuous vector)。</p>
<p>現在在 NLP 圈混的,應該沒有人會說自己不曉得 Transformer 的<a href="https://arxiv.org/abs/1706.03762">經典論文 Attention Is All You Need</a> 以及其知名的<a href="https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#Encoder-Decoder-%E6%A8%A1%E5%9E%8B-+-%E6%B3%A8%E6%84%8F%E5%8A%9B%E6%A9%9F%E5%88%B6">自注意力機制(Self-attention mechanism)</a>。<a href="https://arxiv.org/abs/1810.04805">BERT</a> 全名為 <strong>B</strong>idirectional <strong>E</strong>ncoder <strong>R</strong>epresentations from <strong>T</strong>ransformers,是 Google 以無監督的方式利用大量無標註文本「煉成」的<strong>語言代表模型</strong>,其架構為 Transformer 中的 Encoder。</p>
<p>我在<a href="https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html">淺談神經機器翻譯 & 用 Transformer 英翻中</a>一文已經鉅細靡遺地解說過所有 Transformer 的相關概念,這邊就不再贅述。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/bert-intro.jpg"/>
</center>
<center>
BERT 其實就是 Transformer 中的 Encoder,只是有很多層
(<a href="https://youtu.be/UYPa347-DdE?list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4" target="_blank">圖片來源</a>)
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>BERT 是傳統語言模型的一種變形,而<a href="https://youtu.be/iWea12EAu6U">語言模型(<strong>L</strong>anguage <strong>M</strong>odel, LM)</a>做的事情就是在給定一些詞彙的前提下, 去估計下一個詞彙出現的機率分佈。在<a href="https://leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html">讓 AI 給我們寫點金庸</a>裡的 LSTM 也是一個語言模型 ,只是跟 BERT 差了很多個數量級。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/lm-equation.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
給定前 t 個在字典裡的詞彙,語言模型要去估計第 t + 1 個詞彙的機率分佈 P
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>為何會想要訓練一個 LM?因為有種種好處:</p>
<ul>
<li>好處 1:無監督數據無限大。不像 <a href="http://www.image-net.org/">ImageNet</a> 還要找人標注數據,要訓練 LM 的話網路上所有文本都是你潛在的資料集(BERT 預訓練使用的數據集共有 33 <strong>億</strong>個字,其中包含維基百科及 <a href="https://arxiv.org/abs/1506.06724">BooksCorpus</a>)</li>
<li>好處 2:厲害的 LM 能夠學會語法結構、解讀語義甚至<a href="http://ckip.iis.sinica.edu.tw/project/coreference/">指代消解</a>。透過特徵擷取或是 fine-tuning 能更有效率地訓練下游任務並提升其表現</li>
<li>好處 3:減少處理不同 NLP 任務所需的 architecture engineering 成本</li>
</ul>
<p>一般人很容易理解前兩點的好處,但事實上第三點的影響也十分深遠。以往為了解決不同的 NLP 任務,我們會為該任務設計一個最適合的神經網路架構並做訓練。以下是一些簡單例子:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/model_architecture_nlp_tasks.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
一般會依照不同 NLP 任務的性質為其貼身打造特定的模型架構
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>在這篇文章裡頭我不會一一介紹上述模型的運作原理,在這邊只是想讓你了解不同的 NLP 任務通常需要不同的模型,而設計這些模型並測試其 performance 是非常耗費成本的(人力、時間、計算資源)。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote>
<p>
如果有一個能直接處理各式 NLP 任務的通用架構該有多好?
<br/>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>隨著時代演進,不少人很自然地有了這樣子的想法,而 BERT 就是其中一個將此概念付諸實踐的例子。<a href="https://arxiv.org/pdf/1810.04805.pdf">BERT 論文</a>的作者們使用 Transfomer Encoder、大量文本以及兩個預訓練目標,事先訓練好一個可以套用到多個 NLP 任務的 BERT 模型,再以此為基礎 fine tune 多個下游任務。</p>
<p>這就是近來 NLP 領域非常流行的<strong>兩階段</strong>遷移學習:</p>
<ul>
<li>先以 LM Pretraining 的方式預先訓練出一個對自然語言有一定「理解」的通用模型</li>
<li>再將該模型拿來做特徵擷取或是 fine tune 下游的(監督式)任務</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/bert-2phase.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
兩階段遷移學習在 BERT 下的應用:使用預先訓練好的 BERT 對下游任務做 fine tuning
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>上面這個示意圖最重要的概念是預訓練步驟跟 fine-tuning 步驟所用的 BERT 是<strong>一模一樣</strong>的。當你學會使用 BERT 就能用同個架構訓練多種 NLP 任務,大大減少自己設計模型的 architecture engineering 成本,投資報酬率高到爆炸。</p>
<p>壞消息是,天下沒有白吃的午餐。</p>
<p>要訓練好一個有 1.1 億參數的 12 層 <strong>BERT-BASE</strong> 得用 16 個 <a href="https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-global-availability.html">TPU chips</a> 跑上整整 4 天,<a href="https://medium.com/syncedreview/the-staggering-cost-of-training-sota-ai-models-e329e80fa82">花費 500 鎂</a>;24 層的 <strong>BERT-LARGE</strong> 則有 3.4 億個參數,得用 64 個 TPU chips(約 7000 鎂)訓練。喔對,別忘了多次實驗得把這些成本乘上幾倍。<a href="https://twitter.com/arnicas/status/1147426600180494337?s=20">最近也有 NLP 研究者呼籲大家把訓練好的模型開源釋出</a>以減少重複訓練對環境造成的影響。</p>
<p>好消息是,BERT 作者們有開源釋出訓練好的模型,只要使用 <a href="https://github.com/google-research/bert">TensorFlow</a> 或是 <a href="https://github.com/huggingface/pytorch-pretrained-BERT">PyTorch</a> 將已訓練好的 BERT 載入,就能省去預訓練步驟的所有昂貴成本。好 BERT 不用嗎?</p>
<p>雖然一般來說我們只需要用訓練好的 BERT 做 fine-tuning,稍微瞭解預訓練步驟的內容能讓你直觀地理解它在做些什麼。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/bert-pretrain-tasks.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
BERT 在預訓練時需要完成的兩個任務
(<a href="https://youtu.be/UYPa347-DdE?list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4" target="_blank">圖片來源</a>)
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Google 在預訓練 BERT 時讓它<strong>同時</strong>進行兩個任務:</p>
<ul>
<li>克漏字填空(<a href="https://journals.sagepub.com/doi/abs/10.1177/107769905303000401">1953 年被提出的 Cloze task</a>,學術點的說法是 <strong>M</strong>asked <strong>L</strong>anguage <strong>M</strong>odel, MLM)</li>
<li>判斷第 2 個句子在原始文本中是否跟第 1 個句子相接(<strong>N</strong>ext <strong>S</strong>entence <strong>P</strong>rediction, NSP)</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>對上通天文下知地理的鄉民們來說,要完成這兩個任務簡單到爆。只要稍微看一下<strong>前後文</strong>就能知道左邊克漏字任務的 <code>[MASK]</code> 裡頭該填 <code>退了</code>;而 <code>醒醒吧</code> 後面接 <code>你沒有妹妹</code> 也十分合情合理。</p>
<p>讓我們馬上載入 <a href="https://pytorch.org/hub">PyTorch Hub</a> 上的 <a href="https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/">BERT 模型</a>體驗看看。首先我們需要安裝一些簡單的函式庫:</p>
<p>(2019/10/07 更新:因應 HuggingFace 團隊最近將 GitHub 專案大翻新並更名成 <a href="https://github.com/huggingface/transformers">transformers</a>,本文已直接 <code>import</code> 該 repo 並使用新的方法調用 BERT。底下的程式碼將不再使用該團隊在 PyTorch Hub 上 host 的模型。感謝網友 Hsien 提醒)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span>%%bash
pip<span class="w"> </span>install<span class="w"> </span>transformers<span class="w"> </span>tqdm<span class="w"> </span>boto3<span class="w"> </span>requests<span class="w"> </span>regex<span class="w"> </span>-q
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>接著載入中文 BERT 使用的 tokenizer:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BertTokenizer</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">clear_output</span>
<span class="n">PRETRAINED_MODEL_NAME</span> <span class="o">=</span> <span class="s2">"bert-base-chinese"</span> <span class="c1"># 指定繁簡中文 BERT-BASE 預訓練模型</span>
<span class="c1"># 取得此預訓練模型所使用的 tokenizer</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">BertTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">PRETRAINED_MODEL_NAME</span><span class="p">)</span>
<span class="n">clear_output</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"PyTorch 版本:"</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>PyTorch 版本: 1.4.0
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>為了讓你直觀了解 BERT 運作,本文使用包含繁體與簡體中文的預訓練模型。 你可以在 <a href="https://github.com/huggingface/transformers/blob/master/hubconf.py">Hugging Face 團隊的 repo </a> 裡看到所有可從 PyTorch Hub 載入的 BERT 預訓練模型。截至目前為止有以下模型可供使用:</p>
<ul>
<li>bert-base-chinese</li>
<li>bert-base-uncased</li>
<li>bert-base-cased</li>
<li>bert-base-german-cased</li>
<li>bert-base-multilingual-uncased</li>
<li>bert-base-multilingual-cased</li>
<li>bert-large-cased</li>
<li>bert-large-uncased</li>
<li>bert-large-uncased-whole-word-masking</li>
<li>bert-large-cased-whole-word-masking</li>
</ul>
<p>這些模型的參數都已經被訓練完成,而主要差別在於:</p>
<ul>
<li>預訓練步驟時用的文本語言</li>
<li>有無分大小寫</li>
<li>層數的不同</li>
<li>預訓練時遮住 wordpieces 或是整個 word</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>除了本文使用的中文 BERT 以外,常被拿來應用與研究的是英文的 <code>bert-base-cased</code> 模型。</p>
<p>現在讓我們看看 tokenizer 裡頭的字典資訊:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">vocab</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">vocab</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"字典大小:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>字典大小: 21128
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如上所示,中文 BERT 的字典大小約有 2.1 萬個 tokens。沒記錯的話,英文 BERT 的字典則大約是 3 萬 tokens 左右。我們可以瞧瞧中文 BERT 字典裡頭紀錄的一些 tokens 以及其對應的索引:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">random</span>
<span class="n">random_tokens</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">vocab</span><span class="p">),</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">random_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">vocab</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">random_tokens</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"</span><span class="si">{0:20}{1:15}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">"token"</span><span class="p">,</span> <span class="s2">"index"</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"-"</span> <span class="o">*</span> <span class="mi">25</span><span class="p">)</span>
<span class="k">for</span> <span class="n">t</span><span class="p">,</span> <span class="nb">id</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">random_tokens</span><span class="p">,</span> <span class="n">random_ids</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"</span><span class="si">{0:15}{1:10}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="nb">id</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>token index
-------------------------
##荘 18834
##尉 15259
詬 6278
32gb 11155
荨 5787
##狙 17376
兹 1074
##诈 19457
蠣 6112
gp 13228
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>BERT 使用當初 <a href="https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html">Google NMT</a> 提出的 <a href="https://arxiv.org/abs/1609.08144">WordPiece Tokenization</a> ,將本來的 words 拆成更小粒度的 wordpieces,有效處理<a href="https://en.wiktionary.org/wiki/OOV">不在字典裡頭的詞彙</a> 。中文的話大致上就像是 character-level tokenization,而有 <code>##</code> 前綴的 tokens 即為 wordpieces。</p>
<p>以詞彙 <code>fragment</code> 來說,其可以被拆成 <code>frag</code> 與 <code>##ment</code> 兩個 pieces,而一個 word 也可以獨自形成一個 wordpiece。wordpieces 可以由蒐集大量文本並找出其中常見的 pattern 取得。</p>
<p>另外有趣的是ㄅㄆㄇㄈ也有被收錄:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">indices</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">647</span><span class="p">,</span> <span class="mi">657</span><span class="p">))</span>
<span class="n">some_pairs</span> <span class="o">=</span> <span class="p">[(</span><span class="n">t</span><span class="p">,</span> <span class="n">idx</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span><span class="p">,</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">vocab</span><span class="o">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">indices</span><span class="p">]</span>
<span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">some_pairs</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">pair</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>('ㄅ', 647)
('ㄆ', 648)
('ㄇ', 649)
('ㄉ', 650)
('ㄋ', 651)
('ㄌ', 652)
('ㄍ', 653)
('ㄎ', 654)
('ㄏ', 655)
('ㄒ', 656)
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>讓我們利用中文 BERT 的 tokenizer 將一個中文句子斷詞看看:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">text</span> <span class="o">=</span> <span class="s2">"[CLS] 等到潮水 [MASK] 了,就知道誰沒穿褲子。"</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">ids</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_ids</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">tokens</span><span class="p">[:</span><span class="mi">10</span><span class="p">],</span> <span class="s1">'...'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ids</span><span class="p">[:</span><span class="mi">10</span><span class="p">],</span> <span class="s1">'...'</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>[CLS] 等到潮水 [MASK] 了,就知道誰沒穿褲子。
['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', ',', '就', '知'] ...
[101, 5023, 1168, 4060, 3717, 103, 749, 8024, 2218, 4761] ...
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>除了一般的 wordpieces 以外,BERT 裡頭有 5 個特殊 tokens 各司其職:</p>
<ul>
<li><code>[CLS]</code>:在做分類任務時其最後一層的 repr. 會被視為整個輸入序列的 repr.</li>
<li><code>[SEP]</code>:有兩個句子的文本會被串接成一個輸入序列,並在兩句之間插入這個 token 以做區隔</li>
<li><code>[UNK]</code>:沒出現在 BERT 字典裡頭的字會被這個 token 取代</li>
<li><code>[PAD]</code>:zero padding 遮罩,將長度不一的輸入序列補齊方便做 batch 運算</li>
<li><code>[MASK]</code>:未知遮罩,僅在預訓練階段會用到</li>
</ul>
<p>如上例所示,<code>[CLS]</code> 一般會被放在輸入序列的最前面,而 zero padding 在之前的 <a href="https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#%E7%9B%B4%E8%A7%80%E7%90%86%E8%A7%A3%E9%81%AE%E7%BD%A9%E5%9C%A8%E6%B3%A8%E6%84%8F%E5%87%BD%E5%BC%8F%E4%B8%AD%E7%9A%84%E6%95%88%E6%9E%9C">Transformer 文章裡已經有非常詳細的介紹</a>。<code>[MASK]</code> token 一般在 fine-tuning 或是 feature extraction 時不會用到,這邊只是為了展示預訓練階段的克漏字任務才使用的。</p>
<p>現在馬上讓我們看看給定上面有 <code>[MASK]</code> 的句子,BERT 會填入什麼字:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="sd">"""</span>
<span class="sd">這段程式碼載入已經訓練好的 masked 語言模型並對有 [MASK] 的句子做預測</span>
<span class="sd">"""</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BertForMaskedLM</span>
<span class="c1"># 除了 tokens 以外我們還需要辨別句子的 segment ids</span>
<span class="n">tokens_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="n">ids</span><span class="p">])</span> <span class="c1"># (1, seq_len)</span>
<span class="n">segments_tensors</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">tokens_tensor</span><span class="p">)</span> <span class="c1"># (1, seq_len)</span>
<span class="n">maskedLM_model</span> <span class="o">=</span> <span class="n">BertForMaskedLM</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">PRETRAINED_MODEL_NAME</span><span class="p">)</span>
<span class="n">clear_output</span><span class="p">()</span>
<span class="c1"># 使用 masked LM 估計 [MASK] 位置所代表的實際 token </span>
<span class="n">maskedLM_model</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
<span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">maskedLM_model</span><span class="p">(</span><span class="n">tokens_tensor</span><span class="p">,</span> <span class="n">segments_tensors</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># (1, seq_len, num_hidden_units)</span>
<span class="k">del</span> <span class="n">maskedLM_model</span>
<span class="c1"># 將 [MASK] 位置的機率分佈取 top k 最有可能的 tokens 出來</span>
<span class="n">masked_index</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">k</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">probs</span><span class="p">,</span> <span class="n">indices</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">topk</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">predictions</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">masked_index</span><span class="p">],</span> <span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">k</span><span class="p">)</span>
<span class="n">predicted_tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">indices</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="c1"># 顯示 top k 可能的字。一般我們就是取 top 1 當作預測值</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"輸入 tokens :"</span><span class="p">,</span> <span class="n">tokens</span><span class="p">[:</span><span class="mi">10</span><span class="p">],</span> <span class="s1">'...'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'-'</span> <span class="o">*</span> <span class="mi">50</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">predicted_tokens</span><span class="p">,</span> <span class="n">probs</span><span class="p">),</span> <span class="mi">1</span><span class="p">):</span>
<span class="n">tokens</span><span class="p">[</span><span class="n">masked_index</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Top </span><span class="si">{}</span><span class="s2"> (</span><span class="si">{:2}</span><span class="s2">%):</span><span class="si">{}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">item</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span><span class="p">),</span> <span class="n">tokens</span><span class="p">[:</span><span class="mi">10</span><span class="p">]),</span> <span class="s1">'...'</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>輸入 tokens : ['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', ',', '就', '知'] ...
--------------------------------------------------
Top 1 (82%):['[CLS]', '等', '到', '潮', '水', '來', '了', ',', '就', '知'] ...
Top 2 (11%):['[CLS]', '等', '到', '潮', '水', '濕', '了', ',', '就', '知'] ...
Top 3 ( 2%):['[CLS]', '等', '到', '潮', '水', '過', '了', ',', '就', '知'] ...
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Google 在訓練中文 BERT 鐵定沒看<a href="https://term.ptt.cc/">批踢踢</a>,還無法預測出我們最想要的那個 <code>退</code> 字。而最接近的 <code>過</code> 的出現機率只有 2%,但我會說以語言代表模型以及自然語言理解的角度來看這結果已經不差了。BERT 透過關注 <code>潮</code> 與 <code>水</code> 這兩個字,從 2 萬多個 wordpieces 的可能性中選出 <code>來</code> 作為這個情境下 <code>[MASK]</code> token 的預測值 ,也還算說的過去。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/bert/bert-attention.jpg" style="mix-blend-mode: initial;"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>這是 <a href="https://github.com/jessevig/bertviz">BertViz</a> 視覺化 BERT 注意力的結果,我等等會列出安裝步驟讓你自己玩玩。值得一提的是,以上是第 8 層 Encoder block 中 <a href="https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#Multi-head-attention%EF%BC%9A%E4%BD%A0%E7%9C%8B%E4%BD%A0%E7%9A%84%EF%BC%8C%E6%88%91%E7%9C%8B%E6%88%91%E7%9A%84">Multi-head attention</a> 裡頭某一個 head 的自注意力結果。並不是每個 head 都會關注在一樣的位置。透過 multi-head 自注意力機制,BERT 可以讓不同 heads 在不同的 representation subspaces 裡學會關注不同位置的不同 repr.。</p>
<p>學會填克漏字讓 BERT 更好地 model 每個詞彙在不同語境下該有的 repr.,而 NSP 任務則能幫助 BERT model 兩個句子之間的關係,這在<a href="https://zh.wikipedia.org/wiki/%E5%95%8F%E7%AD%94%E7%B3%BB%E7%B5%B1">問答系統 QA</a>、<a href="http://nlpprogress.com/english/natural_language_inference.html">自然語言推論 NLI </a>或是後面我們會看到的<a href="#用-BERT-fine-tune-下游任務">假新聞分類任務</a>都很有幫助。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>這樣的 word repr. 就是近年十分盛行的 <a href="https://youtu.be/S-CspeZ8FHc">contextual word representation</a> 概念。跟以往沒有蘊含上下文資訊的 <a href="https://youtu.be/8rXD5-xhemo">Word2Vec、GloVe</a> 等無語境的詞嵌入向量有很大的差異。用稍微學術一點的說法就是:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote>
<p>
Contextual word repr. 讓同 word type 的 word token 在不同語境下有不同的表示方式;而傳統的詞向量無論上下文,都會讓同 type 的 word token 的 repr. 相同。
<br/>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>直覺上 contextual word representation 比較能反映人類語言的真實情況,畢竟同個詞彙的含義在不同情境下相異是再正常不過的事情。在不同語境下給同個詞彙相同的 word repr. 這件事情在近年的 NLP 領域裡頭顯得越來越不合理。</p>
<p>為了讓你加深印象,讓我再舉個具體的例子:</p>
<div class="highlight"><pre><span></span>情境 1:
胖虎叫大雄去買漫畫,回來慢了就打他。
情境 2:
妹妹說胖虎是「胖子」,他聽了很不開心。
</pre></div>
<p>很明顯地,在這兩個情境裡頭「他」所代表的語義以及指稱的對象皆不同。如果仍使用沒蘊含上下文 / 語境資訊的詞向量,機器就會很難正確地「解讀」這兩個句子所蘊含的語義了。</p>
<p>現在讓我們跟隨<a href="https://colab.research.google.com/drive/1g2nhY9vZG-PLC3w3dcHGqwsHBAXnD9EY">這個 Colab 筆記本</a>安裝 BERT 的視覺化工具 <a href="https://github.com/jessevig/bertviz">BertViz</a>,看看 BERT 會怎麼處理這兩個情境:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="c1"># 安裝 BertViz</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="o">!</span><span class="nb">test</span><span class="w"> </span>-d<span class="w"> </span>bertviz_repo<span class="w"> </span><span class="o">||</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://github.com/jessevig/bertviz<span class="w"> </span>bertviz_repo
<span class="k">if</span> <span class="ow">not</span> <span class="s1">'bertviz_repo'</span> <span class="ow">in</span> <span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="p">:</span>
<span class="n">sys</span><span class="o">.</span><span class="n">path</span> <span class="o">+=</span> <span class="p">[</span><span class="s1">'bertviz_repo'</span><span class="p">]</span>
<span class="c1"># import packages</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BertTokenizer</span><span class="p">,</span> <span class="n">BertModel</span>
<span class="kn">from</span> <span class="nn">bertviz</span> <span class="kn">import</span> <span class="n">head_view</span>
<span class="c1"># 在 jupyter notebook 裡頭顯示 visualzation 的 helper</span>
<span class="k">def</span> <span class="nf">call_html</span><span class="p">():</span>
<span class="kn">import</span> <span class="nn">IPython</span>
<span class="n">display</span><span class="p">(</span><span class="n">IPython</span><span class="o">.</span><span class="n">core</span><span class="o">.</span><span class="n">display</span><span class="o">.</span><span class="n">HTML</span><span class="p">(</span><span class="s1">'''</span>
<span class="s1"> <script src="/static/components/requirejs/require.js"></script></span>
<span class="s1"> <script></span>
<span class="s1"> requirejs.config({</span>
<span class="s1"> paths: {</span>
<span class="s1"> base: '/static/base',</span>
<span class="s1"> "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",</span>
<span class="s1"> jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',</span>
<span class="s1"> },</span>
<span class="s1"> });</span>
<span class="s1"> </script></span>
<span class="s1"> '''</span><span class="p">))</span>
<span class="n">clear_output</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Setup 以後就能非常輕鬆地將 BERT 內部的注意力機制視覺化出來:</p>
<div class="highlight"><pre><span></span><span class="c1"># 記得我們是使用中文 BERT</span>
<span class="n">model_version</span> <span class="o">=</span> <span class="s1">'bert-base-chinese'</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">BertModel</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">model_version</span><span class="p">,</span> <span class="n">output_attentions</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">BertTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">model_version</span><span class="p">)</span>
<span class="c1"># 情境 1 的句子</span>
<span class="n">sentence_a</span> <span class="o">=</span> <span class="s2">"胖虎叫大雄去買漫畫,"</span>
<span class="n">sentence_b</span> <span class="o">=</span> <span class="s2">"回來慢了就打他。"</span>
<span class="c1"># 得到 tokens 後丟入 BERT 取得 attention</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">sentence_a</span><span class="p">,</span> <span class="n">sentence_b</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s1">'pt'</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">token_type_ids</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="s1">'token_type_ids'</span><span class="p">]</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">]</span>
<span class="n">attention</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">token_type_ids</span><span class="o">=</span><span class="n">token_type_ids</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">input_id_list</span> <span class="o">=</span> <span class="n">input_ids</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> <span class="c1"># Batch index 0</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">input_id_list</span><span class="p">)</span>
<span class="n">call_html</span><span class="p">()</span>
<span class="c1"># 交給 BertViz 視覺化</span>
<span class="n">head_view</span><span class="p">(</span><span class="n">attention</span><span class="p">,</span> <span class="n">tokens</span><span class="p">)</span>
<span class="c1"># 注意:執行這段程式碼以後只會顯示下圖左側的結果。</span>
<span class="c1"># 為了方便你比較,我把情境 2 的結果也同時附上</span>
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/bert/bert-coreference.jpg" style="mix-blend-mode: initial;"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>這是 BERT 裡第 9 層 Encoder block 其中一個 head 的注意力結果。</p>
<p>圖中的線條代表該 head 在更新「他」(左側)的 repr. 時關注其他詞彙(右側)的注意力程度。越粗代表關注權重(attention weights)越高。很明顯地這個 head 具有一定的<a href="https://youtu.be/i19m4GzBhfc">指代消解(Coreference Resolution)</a>能力,能正確地關注「他」所指代的對象。</p>
<p>要處理指代消解需要對自然語言有不少理解,而 BERT 在沒有標注數據的情況下透過自注意力機制、深度雙向語言模型以及「閱讀」大量文本達到這樣的水準,是一件令人雀躍的事情。</p>
<p>當然 BERT 並不是第一個嘗試產生 contextual word repr. 的語言模型。在它之前最知名的例子有剛剛提到的 <a href="https://allennlp.org/elmo">ELMo</a> 以及 <a href="https://github.com/openai/gpt-2">GPT</a>:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/bert_elmo_gpt.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
ELMo、GPT 以及 BERT 都透過訓練語言模型來獲得 contextual word representation
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>ELMo 利用獨立訓練的雙向兩層 LSTM 做語言模型並將中間得到的隱狀態向量串接當作每個詞彙的 contextual word repr.;GPT 則是使用 Transformer 的 Decoder 來訓練一個中規中矩,從左到右的<strong>單向</strong>語言模型。你可以參考我另一篇文章:<a href="https://leemeng.tw/gpt2-language-model-generate-chinese-jing-yong-novels.html">直觀理解 GPT-2 語言模型並生成金庸武俠小說</a>來深入了解 GPT 與 GPT-2。</p>
<p>BERT 跟它們的差異在於利用 MLM(即克漏字)的概念及 Transformer Encoder 的架構,擺脫以往語言模型只能從單個方向(由左到右或由右到左)估計下個詞彙出現機率的窘境,訓練出一個<strong>雙向</strong>的語言代表模型。這使得 BERT 輸出的每個 token 的 repr. <code>Tn</code> 都同時蘊含了前後文資訊,真正的<strong>雙向</strong> representation。</p>
<p>跟以往模型相比,BERT 能更好地處理自然語言,在著名的問答任務 <a href="https://rajpurkar.github.io/SQuAD-explorer/">SQuAD2.0</a> 也有卓越表現:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/bert/squad2.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
SQuAD 2.0 目前排行榜的前 5 名有 4 個有使用 BERT
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>我想我又犯了解說癖,這些東西你可能在看這篇文章之前就全懂了。但希望這些對 BERT 的 high level 介紹能幫助更多人直覺地理解 BERT 的強大之處以及為何值得學習它。</p>
<p>假如你仍然似懂非懂,只需記得:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote>
<p>
BERT 是一個強大的語言代表模型,給它一段文本序列,它能回傳一段相同長度且蘊含上下文資訊的 word repr. 序列,對下游的 NLP 任務很有幫助。
<br/>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>有了這樣的概念以後,我們接下來要做的事情很簡單,就是將自己感興趣的 NLP 任務的文本丟入 BERT ,為文本裡頭的每個 token 取得有語境的 word repr.,並以此 repr. 進一步 fine tune 當前任務,取得更好的結果。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="用-BERT-fine-tune-下游任務">用 BERT fine tune 下游任務<a class="anchor-link" href="#用-BERT-fine-tune-下游任務">¶</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>我們在<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html">給所有人的 NLP 入門指南</a>碰過的<a href="https://www.kaggle.com/c/fake-news-pair-classification-challenge/submissions">假新聞分類任務</a>將會是本文拿 BERT 來做 fine-tuning 的例子。選擇這個任務的最主要理由是因為中文數據容易理解,另外網路上針對兩個句子做分類的例子也較少。</p>
<p>就算你對假新聞分類沒興趣也建議繼續閱讀。因為本節談到的所有概念完全可以被套用到其他語言的文本以及不同的 NLP 任務之上。因此我希望接下來你能一邊閱讀一邊想像如何用同樣的方式把 BERT 拿來處理你自己感興趣的 NLP 任務。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/nlp-kaggle-intro/view-data-on-kaggle.jpg" style="mix-blend-mode: initial;"/>
</center>
<center>
給定假新聞 title1,判斷另一新聞 title2 跟 title1 的關係(同意、反對或無關)
(<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html" target="_blank">圖片來源</a>)
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>fine tune BERT 來解決新的下游任務有 5 個簡單步驟:</p>
<ol>
<li><a href="#1.-準備原始文本數據">準備原始文本數據</a></li>
<li><a href="#2.-將原始文本轉換成-BERT-相容的輸入格式">將原始文本轉換成 BERT 相容的輸入格式</a></li>
<li><a href="#3.-在-BERT-之上加入新-layer-成下游任務模型">在 BERT 之上加入新 layer 成下游任務模型</a></li>
<li><a href="#4.-訓練該下游任務模型">訓練該下游任務模型</a></li>
<li><a href="#5.-對新樣本做推論">對新樣本做推論</a></li>
</ol>
<p>對,就是那麼直覺。而且你應該已經看出步驟 1、4 及 5 都跟訓練一般模型所需的步驟無太大差異。跟 BERT 最相關的細節事實上是步驟 2 跟 3:</p>
<ul>
<li>如何將原始數據轉換成 <strong>BERT 相容</strong>的輸入格式?</li>
<li>如何在 BERT 之上建立 layer(s) 以符合下游任務需求?</li>
</ul>
<p>事不宜遲,讓我們馬上以假新聞分類任務為例回答這些問題。<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html">我在之前的文章已經說明過</a>,這個任務的輸入是兩個句子,輸出是 3 個類別機率的多類別分類任務(multi-class classification task),跟 NLP 領域裡常見的<a href="https://paperswithcode.com/task/natural-language-inference/latest">自然語言推論(Natural Language Inference)</a>具有相同性質。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="1.-準備原始文本數據">1. 準備原始文本數據<a class="anchor-link" href="#1.-準備原始文本數據">¶</a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>為了最大化再現性(reproducibility)以及幫助有興趣的讀者深入研究,我會列出所有的程式碼,你只要複製貼上就能完整重現文中所有結果並生成能提交到 Kaggle 競賽的預測檔案。你當然也可以選擇直接閱讀,不一定要下載數據。</p>
<p>因為 Kaggle 網站本身的限制,我無法直接提供數據載點。如果你想要跟著本文練習以 BERT fine tune 一個假新聞的分類模型,可以先<a href="https://www.kaggle.com/c/fake-news-pair-classification-challenge/data">前往該 Kaggle 競賽下載資料集</a>。下載完數據你的資料夾裡應該會有兩個壓縮檔,分別代表訓練集和測試集:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">glob</span>
<span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*.csv.zip"</span><span class="p">)</span>