-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathhow-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html
2465 lines (2296 loc) · 132 KB
/
how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<!--[if lt IE 9 ]><html class="no-js oldie" lang="zh-hant-tw"> <![endif]-->
<!--[if IE 9 ]><html class="no-js oldie ie9" lang="zh-hant-tw"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]><!-->
<html class="no-js" lang="zh-hant-tw">
<!--<![endif]-->
<head>
<!--- basic page needs
================================================== -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="author" content="Lee Meng" />
<title>LeeMeng - 讓 AI 寫點金庸:如何用 TensorFlow 2.0 及 TensorFlow.js 寫天龍八部</title>
<!--- article-specific meta data
================================================== -->
<meta name="description" content="這篇文章展示一個由 TensorFlow 2.0 以及 TensorFlow.js 實現的文本生成應用。本文也會透過深度學習專案常見的 7 個步驟,帶領讀者一步步了解如何實現一個這樣的應用。閱讀完本文,你將對開發 AI 應用的流程有些基礎的了解。" />
<meta name="keywords" content="TensorFlow, TensorFlow.js, 自然語言處理" />
<meta name="tags" content="TensorFlow" />
<meta name="tags" content="TensorFlow.js" />
<meta name="tags" content="自然語言處理" />
<!--- Open Graph Object metas
================================================== -->
<meta property="og:image" content="https://leemeng.tw/theme/images/background/text-generation-cover.jpg" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html" />
<meta property="og:title" content="讓 AI 寫點金庸:如何用 TensorFlow 2.0 及 TensorFlow.js 寫天龍八部" />
<meta property="og:description" content="這篇文章展示一個由 TensorFlow 2.0 以及 TensorFlow.js 實現的文本生成應用。本文也會透過深度學習專案常見的 7 個步驟,帶領讀者一步步了解如何實現一個這樣的應用。閱讀完本文,你將對開發 AI 應用的流程有些基礎的了解。" />
<!-- mobile specific metas
================================================== -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- CSS
================================================== -->
<!--for customized css in individual page-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/bootstrap.min.css">
<!--for showing toc navigation which slide in from left-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/toc-nav.css">
<!--for responsive embed youtube video-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/embed_youtube.css">
<!--for prettify dark-mode result-->
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/darkmode.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/base.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/vendor.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/main.css">
<link rel="stylesheet" type="text/css" href="https://leemeng.tw/theme/css/ipython.css">
<link rel="stylesheet" type="text/css" href='https://leemeng.tw/theme/css/progress-bar.css' />
<!--TiqueSearch-->
<link href="https://fonts.googleapis.com/css?family=Roboto:100,300,400">
<link rel="stylesheet" href="https://leemeng.tw/theme/tipuesearch/css/normalize.css">
<link rel="stylesheet" href="https://leemeng.tw/theme/tipuesearch/css/tipuesearch.css">
<!-- script
================================================== -->
<script src="https://leemeng.tw/theme/js/modernizr.js"></script>
<script src="https://leemeng.tw/theme/js/pace.min.js"></script>
<!-- favicons
================================================== -->
<link rel="shortcut icon" href="../theme/images/favicon.ico" type="image/x-icon"/>
<link rel="icon" href="../theme/images/favicon.ico" type="image/x-icon"/>
<!-- Global Site Tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-106559980-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments)};
gtag('js', new Date());
gtag('config', 'UA-106559980-1');
</script>
</head>
<body id="top">
<!-- header
================================================== -->
<header class="s-header">
<div class="header-logo">
<a class="site-logo" href="../index.html"><img src="https://leemeng.tw/theme/images/logo.png" alt="Homepage"></a>
</div>
<!--navigation bar ref: http://jinja.pocoo.org/docs/2.10/tricks/-->
<nav class="header-nav-wrap">
<ul class="header-nav">
<li>
<a href="../index.html#home">Home</a>
</li>
<li>
<a href="../index.html#about">About</a>
</li>
<li>
<a href="../index.html#projects">Projects</a>
</li>
<li class="current">
<a href="../blog.html">Blog</a>
</li>
<li>
<a href="https://demo.leemeng.tw">Demo</a>
</li>
<li>
<a href="../books.html">Books</a>
</li>
<li>
<a href="../index.html#contact">Contact</a>
</li>
</ul>
<!--<div class="search-container">-->
<!--<form action="../search.html">-->
<!--<input type="text" placeholder="Search.." name="search">-->
<!--<button type="submit"><i class="im im-magnifier" aria-hidden="true"></i></button>-->
<!--</form>-->
<!--</div>-->
</nav>
<a class="header-menu-toggle" href="#0"><span>Menu</span></a>
</header> <!-- end s-header -->
<!--TOC navigation displayed when clicked from left-navigation button-->
<div id="tocNav" class="overlay" onclick="closeTocNav()">
<div class="overlay-content">
<div id="toc"><ul><li><a class="toc-href" href="#" title="讓 AI 寫點金庸:如何用 TensorFlow 2.0 及 TensorFlow.js 寫天龍八部">讓 AI 寫點金庸:如何用 TensorFlow 2.0 及 TensorFlow.js 寫天龍八部</a><ul><li><a class="toc-href" href="#生成新的天龍八部橋段" title="生成新的天龍八部橋段">生成新的天龍八部橋段</a></li><li><a class="toc-href" href="#模型是怎麼被訓練的" title="模型是怎麼被訓練的">模型是怎麼被訓練的</a></li><li><a class="toc-href" href="#TensorFlow-2.0-開發" title="TensorFlow 2.0 開發">TensorFlow 2.0 開發</a></li><li><a class="toc-href" href="#深度學習專案步驟" title="深度學習專案步驟">深度學習專案步驟</a><ul><li><a class="toc-href" href="#1.-定義問題及要解決的任務" title="1. 定義問題及要解決的任務">1. 定義問題及要解決的任務</a></li><li><a class="toc-href" href="#2.-準備原始數據、資料清理" title="2. 準備原始數據、資料清理">2. 準備原始數據、資料清理</a></li><li><a class="toc-href" href="#3.-建立能丟入模型的資料集" title="3. 建立能丟入模型的資料集">3. 建立能丟入模型的資料集</a></li><li><a class="toc-href" href="#4.-定義能解決問題的函式集" title="4. 定義能解決問題的函式集">4. 定義能解決問題的函式集</a></li><li><a class="toc-href" href="#5.-定義評量函式好壞的指標" title="5. 定義評量函式好壞的指標">5. 定義評量函式好壞的指標</a></li><li><a class="toc-href" href="#6.-訓練並選擇出最好的函式" title="6. 訓練並選擇出最好的函式">6. 訓練並選擇出最好的函式</a></li><li><a class="toc-href" href="#7.-將函式-/-模型拿來做預測" title="7. 將函式 / 模型拿來做預測">7. 將函式 / 模型拿來做預測</a></li></ul></li><li><a class="toc-href" href="#如何使用-TensorFlow.js-跑模型並生成文章_1" title="如何使用 TensorFlow.js 跑模型並生成文章">如何使用 TensorFlow.js 跑模型並生成文章</a></li><li><a class="toc-href" href="#結語" title="結語">結語</a></li><li><a class="toc-href" href="#致敬" title="致敬">致敬</a></li></ul></li></ul></div>
</div>
</div>
<!--custom images with icon shown on left nav-->
<!--the details are set in `pelicanconf.py` as `LEFT_NAV_IMAGES`-->
<article class="blog-single">
<!-- page header/blog hero, use custom cover image if available
================================================== -->
<div class="page-header page-header--single page-hero" style="background-image:url(https://leemeng.tw/theme/images/background/text-generation-cover.jpg)">
<div class="row page-header__content narrow">
<article class="col-full">
<div class="page-header__info">
<div class="page-header__cat">
<a href="https://leemeng.tw/tag/tensorflow.html" rel="tag">TensorFlow</a>
<a href="https://leemeng.tw/tag/tensorflowjs.html" rel="tag">TensorFlow.js</a>
<a href="https://leemeng.tw/tag/zi-ran-yu-yan-chu-li.html" rel="tag">自然語言處理</a>
</div>
</div>
<h1 class="page-header__title">
<a href="https://leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html" title="">
讓 AI 寫點金庸:如何用 TensorFlow 2.0 及 TensorFlow.js 寫天龍八部
</a>
</h1>
<ul class="page-header__meta">
<li class="date">2019-03-27 (Wed)</li>
<li class="page-view">
27,020 views
</li>
</ul>
</article>
</div>
</div> <!-- end page-header -->
<div class="KW_progressContainer">
<div class="KW_progressBar"></div>
</div>
<div class="row blog-content" style="position: relative">
<div id="left-navigation">
<div id="search-wrap">
<i class="im im-magnifier" aria-hidden="true"></i>
<div id="search">
<form action="../search.html">
<div class="tipue_search_right"><input type="text" name="q" id="tipue_search_input" pattern=".{2,}" title="想搜尋什麼呢?(請至少輸入兩個字)" required></div>
</form>
</div>
</div>
<div id="toc-wrap">
<a title="顯示/隱藏 文章章節">
<i class="im im-menu" aria-hidden="true" onclick="toggleTocNav()"></i>
</a>
</div>
<div id="social-wrap" style="cursor: pointer">
<a class="open-popup" title="訂閱最新文章">
<i class="im im-newspaper-o" aria-hidden="true"></i>
</a>
</div>
<div id="social-wrap">
<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html" target="_blank" title="分享到 Facebook">
<i class="im im-facebook" aria-hidden="true"></i>
</a>
</div>
<div id="social-wrap">
<a href="https://www.linkedin.com/shareArticle?mini=true&url=https%3A//leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html&title=%E8%AE%93%20AI%20%E5%AF%AB%E9%BB%9E%E9%87%91%E5%BA%B8%EF%BC%9A%E5%A6%82%E4%BD%95%E7%94%A8%20TensorFlow%202.0%20%E5%8F%8A%20TensorFlow.js%20%E5%AF%AB%E5%A4%A9%E9%BE%8D%E5%85%AB%E9%83%A8&summary=%E9%80%99%E7%AF%87%E6%96%87%E7%AB%A0%E5%B1%95%E7%A4%BA%E4%B8%80%E5%80%8B%E7%94%B1%20TensorFlow%202.0%20%E4%BB%A5%E5%8F%8A%20TensorFlow.js%20%E5%AF%A6%E7%8F%BE%E7%9A%84%E6%96%87%E6%9C%AC%E7%94%9F%E6%88%90%E6%87%89%E7%94%A8%E3%80%82%E6%9C%AC%E6%96%87%E4%B9%9F%E6%9C%83%E9%80%8F%E9%81%8E%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E5%B0%88%E6%A1%88%E5%B8%B8%E8%A6%8B%E7%9A%84%207%20%E5%80%8B%E6%AD%A5%E9%A9%9F%EF%BC%8C%E5%B8%B6%E9%A0%98%E8%AE%80%E8%80%85%E4%B8%80%E6%AD%A5%E6%AD%A5%E4%BA%86%E8%A7%A3%E5%A6%82%E4%BD%95%E5%AF%A6%E7%8F%BE%E4%B8%80%E5%80%8B%E9%80%99%E6%A8%A3%E7%9A%84%E6%87%89%E7%94%A8%E3%80%82%E9%96%B1%E8%AE%80%E5%AE%8C%E6%9C%AC%E6%96%87%EF%BC%8C%E4%BD%A0%E5%B0%87%E5%B0%8D%E9%96%8B%E7%99%BC%20AI%20%E6%87%89%E7%94%A8%E7%9A%84%E6%B5%81%E7%A8%8B%E6%9C%89%E4%BA%9B%E5%9F%BA%E7%A4%8E%E7%9A%84%E4%BA%86%E8%A7%A3%E3%80%82&source=https%3A//leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html" target="_blank" title="分享到 LinkedIn">
<i class="im im-linkedin" aria-hidden="true"></i>
</a>
</div>
<div id="social-wrap">
<a href="https://twitter.com/intent/tweet?text=%E8%AE%93%20AI%20%E5%AF%AB%E9%BB%9E%E9%87%91%E5%BA%B8%EF%BC%9A%E5%A6%82%E4%BD%95%E7%94%A8%20TensorFlow%202.0%20%E5%8F%8A%20TensorFlow.js%20%E5%AF%AB%E5%A4%A9%E9%BE%8D%E5%85%AB%E9%83%A8&url=https%3A//leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html&hashtags=tensorflow,tensorflowjs,zi-ran-yu-yan-chu-li" target="_blank" title="分享到 Twitter">
<i class="im im-twitter" aria-hidden="true"></i>
</a>
</div>
<!--custom images with icon shown on left nav-->
</div>
<div class="col-full blog-content__main">
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<link href="https://leemeng.tw/tfjs-apps/lstm-text-generation/index.css" rel="stylesheet"/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote style="margin-bottom: 1rem">
<p>
木婉清轉頭向他,背脊向著南海鱷神,低聲道:「你是世上第一個見到我容貌的男子!」緩緩拉開了面幕。段譽登時全身一震,眼前所見,如新月清暉,如花樹堆雪,一張臉秀麗絕俗。
<br/>
<span style="float:right;margin-right: 1.5rem">第四回:崖高人遠</span>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><br/></p>
<p><a href="https://bit.ly/2TUycBQ">《天龍八部》</a>一直是我最喜歡的<a href="https://zh.wikipedia.org/wiki/%E9%87%91%E5%BA%B8%E4%BD%9C%E5%93%81">金庸著作</a>之一,最近重新翻閱,有很多新的感受。</p>
<p>閱讀到一半我突發奇想,決定嘗試用<a href="https://leemeng.tw/deep-learning-resources.html">深度學習</a>以及 <a href="https://www.tensorflow.org/alpha">TensorFlow 2.0</a> 來訓練一個能夠生成《天龍八部》的<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html#%E6%9C%89%E8%A8%98%E6%86%B6%E7%9A%84%E5%BE%AA%E7%92%B0%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF">循環神經網路</a>。生成結果仍不完美,但我認為已經很有娛樂性質,且有時能夠產生令人驚嘆或是捧腹大笑的文章了。</p>
<p>因此我決定使用 <a href="https://www.tensorflow.org/js">Tensorflow.js</a> 將訓練出來的模型弄上線,讓你也能實際看看這個 AI 嗑了什麼藥。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/lstm-text-generation/dali-old-castle.jpg"/>
</center>
<center>
大理古城一隅,段譽出身之地
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>在 demo 之後,我將以此文的 AI 應用為例,用 TensorFlow 2.0 帶你走過深度學習專案中常見的 7 個步驟:</p>
<ol>
<li><a href="#1.-定義問題及要解決的任務">定義問題及要解決的任務</a></li>
<li><a href="#2.-準備原始數據、資料清理">準備原始數據、資料清理</a></li>
<li><a href="#3.-建立能丟入模型的資料集">建立能丟入模型的資料集</a></li>
<li><a href="#4.-定義能解決問題的函式集">定義能解決問題的函式集</a></li>
<li><a href="#5.-定義評量函式好壞的指標">定義評量函式好壞的指標</a></li>
<li><a href="#6.-訓練並選擇出最好的函式">訓練並選擇出最好的函式</a></li>
<li><a href="#7.-將函式-/-模型拿來做預測">將函式 / 模型拿來做預測</a></li>
</ol>
<p>希望閱讀本文後能讓你學到點東西,從中獲得些啟發,並運用自己的想像力創造點新的東西。</p>
<p>前言夠長了,讓我們馬上進入 demo 吧!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="生成新的天龍八部橋段">生成新的天龍八部橋段<a class="anchor-link" href="#生成新的天龍八部橋段">¶</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>本篇使用一個十分簡單的<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html#%E8%A8%98%E6%86%B6%E5%8A%9B%E5%A5%BD%E7%9A%84-LSTM-%E7%B4%B0%E8%83%9E">長短期記憶 RNN</a> 來生成文章。在多次「閱讀」天龍八部之後,這個模型可以在給定一段文本的情況下,逐字產生類似天龍八部小說的文章。</p>
<p>比方說給定書中的一個橋段:</p>
<div class="highlight"><pre><span></span>烏老大偏生要考一考慕容復,說道:「慕容公子,你瞧這不是大大的
</pre></div>
<p>你會怎麼接下去?</p>
<p>本文的模型順著上面的話生成的其中一次結果:</p>
<div class="highlight"><pre><span></span>不算?」馬夫人道:「不錯,咱們非要尋死不可。」
段譽大喜,說道:「小姑娘,你待我這麼好,鬼鬼祟祟,一切又不聽你的話,你管甚麼老兄弟不相干,我去幫過彥之。」
王夫人哼了一聲,說道:「這裏是甚麼話?」段譽道:「不行!你別過來。用真蠻子,我便將這件事了,一大惡人擠在地下,立時便會斃命,那便如何是好?」
</pre></div>
<p>文章內容很ㄎ一ㄤ,惹人發笑,但用詞本身很天龍八部。(至少我自己寫不出這樣的內容)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/lstm-text-generation/antony-xia-522590-unsplash.jpg"/>
</center>
<center>
姑蘇慕容家所在的蘇州
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>現在馬上就讓我們產生一些新的橋段吧!首先將已經訓練好的模型載入你的瀏覽器。</p>
<p>(建議在網速快的地方載入模型以減少等待時間,或者點擊載入後先閱讀<a href="#模型是怎麼被訓練的">模型是怎麼被訓練的</a>,等等再回來查看)</p>
<p>成功載入模型後,你將可以用它不斷地產生新的橋段:</p>
<section style="margin-bottom: 3rem">
<button id="load-model" style="display:inline-block">載入模型</button>
<div id="app-status" style="display:inline-block"></div>
</section>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>另外你會發現有 2 個可供你調整的參數:</p>
<section style="margin-bottom: 3rem">
<div>
<span class="input-title">生成長度(字單位)</span>
<input id="generate-length" value="150"/>
</div>
<div>
<span class="input-title">生成溫度(隨機度)</span>
<input id="temperature" value="0.6"/>
</div>
</section><p>第一次可以直接使用預設值。現在點擊<strong>生成文章</strong>來產生全新的天龍八部橋段:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<section style="margin-bottom: 3rem">
<div>
<button disabled="true" id="generate-text">生成文章</button>
<button disabled="true" id="initialize-seed">重置輸入</button>
</div>
</section>
<section style="margin-bottom: 3rem">
<div>
<span class="input-title">起始句子:</span>
<span id="text-generation-status" style="display: none"></span>
<textarea id="seed-text" rows="1" style="min-height: 6em" value="">蕭峯吃了一驚,心想:「哥哥大喜之餘,說話有些忘形了,眼下亂成</textarea>
</div>
</section>
<section style="margin-bottom: 3rem">
<div>
<span class="input-title">生成結果:</span>
<textarea id="generated-text" readonly="true" rows="10" value=""></textarea>
</div>
</section>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如何?希望模型產生的結果有成功令你嘴角上揚。當初它可快把我逗死了。</p>
<p>現在你可以嘗試幾件事情:</p>
<ul>
<li>點<strong>生成文章</strong>來讓模型依據同輸入產生新橋段</li>
<li>點<strong>重置輸入</strong>來隨機取得一個新的起始句子</li>
<li>增加模型生成的<strong>文章長度</strong></li>
<li>調整<strong>生成溫度</strong>來改變文章的變化性</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/chris-rhoads-254898-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>生成溫度是一個實數值,而當溫度越高,模型產生出來的結果越隨機、越不可預測(也就越ㄎㄧㄤ);而溫度越低,產生的結果就會越像天龍八部原文。優點是真實,但同時字詞的重複性也會提升。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote>
<p>
機器並沒有情感,只有人類可以賦予事物意義。我們無法讓機器自動找出最佳的生成溫度,因為人的感覺十分主觀:找出你自己覺得最適合的溫度來生成文章。
<br/>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如果你沒有打算深入探討技術細節,那只需要記得在這篇文章裡頭的模型是一個以「字」為單位的語言模型(Character-based Language Model)即可:給定一連串已經出現過的字詞,模型會想辦法去預測出下一個可能出現的字。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/raychan-1229841-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>值得注意的是,我們並不單純是拿出現機率最高的字出來當生成結果,這樣太無趣了。</p>
<p>每次機器做預測前都會拿著一個包含大量中文字的機率分布 p,在決定要吐出哪個字時,會對該機率分佈 p 做抽樣,從中隨機選出一個字。</p>
<p>因此就跟你在上面 demo 看到的一樣,就算輸入的句子相同,每次模型仍然會生成完全不同的文章。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/lstm-text-generation/max-felner-448887-unsplash.jpg"/>
</center>
<center>
抽樣的過程類似擲骰子,儘管有些結果較易出現,你還是有機會骰到豹子
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>因為隨機抽樣的關係,每次模型產生的結果基本上都是獨一無二的。</p>
<p>如果你在生成文章的過程中得到什麼有趣的虛擬橋段,都歡迎與我分享:)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/chris-ried-512801-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>本文接著將詳細解說此應用是怎麼被開發出來的。如果你現在沒有打算閱讀,可以直接跳到<a href="#結語">結語</a>。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="模型是怎麼被訓練的">模型是怎麼被訓練的<a class="anchor-link" href="#模型是怎麼被訓練的">¶</a></h2><p>在看完 demo 以後,你可能會好奇這個模型是怎麼被訓練出來的。</p>
<p>實際的開發流程大致可以分為兩個部分:</p>
<ul>
<li><a href="https://www.tensorflow.org/alpha/tutorials/sequences/text_generation">用 TensorFlow 2.0 訓練一個 LSTM 模型</a></li>
<li><a href="https://github.com/tensorflow/tfjs-examples/tree/master/lstm-text-generation">使用 TensorFlow.js 部屬該模型</a></li>
</ul>
<p>這些在 TensorFlow 以及 TensorFlow.js 的官網都有詳細的教學以及程式碼供你參考。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/lstm-text-generation/tf-demo.png"/>
</center>
<center>
這篇文章參考了不少 TensorFlow 官網(左)及 TensorFlow.js 線上 demo(右)的程式碼
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如果你也想開發一個類似的應用,閱讀官方教學中你所熟悉的語言版本(Python / JavaScript)是最直接的作法:</p>
<ul>
<li><a href="https://www.tensorflow.org/alpha/tutorials/sequences/text_generation">TensorFlow 2.0 Alpha - Text generation with an RNN</a></li>
<li><a href="https://github.com/tensorflow/tfjs-examples/tree/master/lstm-text-generation">TensorFlow.js Example: Train LSTM to Generate Text</a></li>
</ul>
<p>因為官方已經有提供能在 <a href="https://colab.research.google.com/">Google Colab</a> 上使用 GPU <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/text_generation.ipynb">訓練 LSTM 的教學筆記本</a>,本文便不再另行提供。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/simon-abrams-286276-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>另外,具備以下背景可以讓你更輕鬆地閱讀接下來的內容:</p>
<ul>
<li>熟悉 <a href="https://www.python.org/">Python</a></li>
<li>碰過 <a href="https://keras.io/">Keras</a> 或是 <a href="https://www.tensorflow.org/">TensorFlow</a></li>
<li>具備<a href="https://leemeng.tw/deep-learning-resources.html#courses">機器學習 & 深度學習基礎</a></li>
<li>了解何謂<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html#%E6%9C%89%E8%A8%98%E6%86%B6%E7%9A%84%E5%BE%AA%E7%92%B0%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF">循環神經網路</a>以及<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html#%E8%A8%98%E6%86%B6%E5%8A%9B%E5%A5%BD%E7%9A%84-LSTM-%E7%B4%B0%E8%83%9E">長短期記憶</a></li>
</ul>
<p>如果你是喜歡先把基礎打好的人,可以先查閱我上面附的這些資源連結。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="TensorFlow-2.0-開發">TensorFlow 2.0 開發<a class="anchor-link" href="#TensorFlow-2.0-開發">¶</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>平常有在接觸深度學習的讀者或許都已經知道,最近 TensorFlow 隆重推出 <a href="https://www.tensorflow.org/alpha">2.0 Alpha 預覽版</a>,希望透過全新的 API 讓更多人可以輕鬆地開發機器學習以及深度學習應用。</p>
<p>當初撰寫本文的其中一個目的,也是想趁著這次大改版來讓自己熟悉一下 TensorFlow 2.0 的開發方式。</p>
<div class="resp-container">
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="resp-iframe" frameborder="0" src="https://www.youtube.com/embed/TTQQiJ-mHYA"></iframe>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>TensorFlow 2.0 值得關注的<a href="https://youtu.be/YzLnnGiLNRE?list=PLQY2H8rRoyvzoUYI26kHmKSJBedn3SQuB">更新</a>不少,但以下幾點跟一般的 ML 開發者最為相關:</p>
<ul>
<li><a href="https://www.tensorflow.org/alpha/guide/keras/overview">tf.keras</a> 被視為官方高級 API,強調其地位</li>
<li>方便除錯的 <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/eager.ipynb">Eager Execution</a> 成為預設值</li>
<li>負責讀取、處理大量數據的 <a href="https://www.tensorflow.org/alpha/guide/data_performance">tf.data</a> API</li>
<li>自動幫你建構計算圖的 <a href="https://youtu.be/Up9CvRLIIIw?list=PLQY2H8rRoyvzoUYI26kHmKSJBedn3SQuB">tf.function</a></li>
</ul>
<p>在這篇文章裡頭會看到前 3 者。下節列出的程式碼皆在 <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/text_generation.ipynb">Google Colab</a> 上用最新版本的 TensorFlow 2.0 Nightly 執行。</p>
<div class="highlight"><pre><span></span>pip<span class="w"> </span>install<span class="w"> </span>tf-nightly-gpu-2.0-preview
</pre></div>
<p>如果有 GPU 則強烈建議安裝 GPU 版本的 TF Nightly,訓練速度跟 CPU 版本可以差到 10 倍以上。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="深度學習專案步驟">深度學習專案步驟<a class="anchor-link" href="#深度學習專案步驟">¶</a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>好戲終於登場。</p>
<p>如同多數的深度學習專案,要訓練一個以 LSTM 為基礎的語言模型,你大致需要走過以下幾個步驟:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<center>
<img src="https://leemeng.tw/images/lstm-text-generation/deep-learning-pj-steps-menglee.jpg"/>
</center>
<center>
開發一個 DL 專案時我常用的流程架構
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>這個流程是一個大方向,依據不同情境你可能需要做些調整來符合自己的需求,且很多步驟需要重複進行。</p>
<p>這篇文章會用 TensorFlow 2.0 簡單地帶你走過所有步驟。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="1.-定義問題及要解決的任務">1. 定義問題及要解決的任務<a class="anchor-link" href="#1.-定義問題及要解決的任務">¶</a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>很明顯地,在訓練模型前首先得確認我們的問題(Problem)以及想要交給機器解決的任務(Task)是什麼。</p>
<p>前面已經提過,我們的目標就是要找出一個天龍八部的語言模型(Language Model),讓該模型在被餵進一段文字以後,能吐出類似天龍八部的文章。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><div class="resp-container">
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" class="resp-iframe" frameborder="0" src="https://www.youtube.com/embed/f1KUUz7v8g4?list=PLJV_el3uVTsPMxPbjeX7PicgWbY7F8wW9"></iframe>
</div></p>
<center>
十分推薦李宏毅教授講解序列生成的影片
<br/>
<br/>
</center>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>這實際上是一個<a href="https://youtu.be/f1KUUz7v8g4?list=PLJV_el3uVTsPMxPbjeX7PicgWbY7F8wW9">序列生成(Sequence Generation)</a>問題,而機器所要解決的任務也變得明確:給定一段文字單位的序列,它要能吐出下一個合理的文字單位。</p>
<p>這邊說的文字單位(Token)可以是</p>
<ul>
<li>字(Character,如劍、寺、雲)</li>
<li>詞(Word,如吐蕃、師弟、阿修羅)</li>
</ul>
<p>本文則使用「字」作為一個文字單位。現在假設有一個天龍八部的句子:</p>
<div class="highlight"><pre><span></span>『六脈神劍經』乃本寺鎮寺之寶,大理段氏武學的至高法要。
</pre></div>
<p>這時候句子裡的每個字(含標點符號)都是一個文字單位,而整個句子就構成一個文字序列。我們可以擷取一部份句子:</p>
<div class="highlight"><pre><span></span>『六脈神劍經』乃本寺鎮寺之寶,大理段氏武
</pre></div>
<p>接著在訓練模型時要求它讀入這段文字,並預測出原文裡頭出現的下一個字:<code>學</code>。</p>
<p>一旦訓練完成,就能得到你開頭看到的那個語言模型了。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="2.-準備原始數據、資料清理">2. 準備原始數據、資料清理<a class="anchor-link" href="#2.-準備原始數據、資料清理">¶</a></h3><p>巧婦難為無米之炊,沒有數據一切免談。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/caroline-attwood-243834-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>我在網路上蒐集天龍八部原文,做些簡單的數據清理後發現整本小說總共約含 120 萬個中文字,實在是一部曠世巨作。儘管因為版權問題不宜提供下載連結,你可以 Google 自己有興趣的文本。</p>
<p>現在假設我們把原文全部存在一個 Python 字串 <code>text</code> 裡頭,則部分內容可能如下:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="c1"># 隨意取出第 9505 到 9702 的中文字</span>
<span class="nb">print</span><span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="mi">9505</span><span class="p">:</span><span class="mi">9702</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>咱們見敵方人多,不得師父號令,沒敢隨便動手。」左子穆道:「嗯,來了多少人?」干光豪道:「大約七八十人。」左子穆嘿嘿冷笑,道:「七八十人,便想誅滅無量劍了?只怕也沒這麼容易。」
龔光傑道:「他們用箭射過來一封信,封皮上寫得好生無禮。」說著將信呈上。
左子穆見信封上寫著:「字諭左子穆」五個大字,便不接信,說道:「你拆來瞧瞧。」龔光傑道:「是!」拆開信封,抽出信箋。
那少女在段譽耳邊低聲道:
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>我們也可以看看整本小說裡頭包含多少中文字:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">w</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"天龍八部小說共有 </span><span class="si">{</span><span class="n">n</span><span class="si">}</span><span class="s2"> 中文字"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"包含了 </span><span class="si">{</span><span class="n">w</span><span class="si">}</span><span class="s2"> 個獨一無二的字"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>天龍八部小說共有 1235431 中文字
包含了 4330 個獨一無二的字
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>相較於英文只有 26 個簡單字母,博大精深的中文裡頭有非常多漢字。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/raychan-1061280-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>如同<a href="https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html">寫給所有人的自然語言處理與深度學習入門指南</a>裡頭說過的,要將文本數據丟入只懂數字的神經網路,我們得先做些前處理。</p>
<p>具體來說,得將這些中文字對應到一個個的索引數字(Index)或是向量才行。</p>
<p>我們可以使用 <code>tf.keras</code> 裡頭的 <code>Tokenizer</code> 幫我們把整篇小說建立字典,並將同樣的中文字對應到同樣的索引數字:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="nn">tf</span>
<span class="c1"># 初始化一個以字為單位的 Tokenizer</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">keras</span>\
<span class="o">.</span><span class="n">preprocessing</span>\
<span class="o">.</span><span class="n">text</span>\
<span class="o">.</span><span class="n">Tokenizer</span><span class="p">(</span>
<span class="n">num_words</span><span class="o">=</span><span class="n">num_words</span><span class="p">,</span>
<span class="n">char_level</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">filters</span><span class="o">=</span><span class="s1">''</span>
<span class="p">)</span>
<span class="c1"># 讓 tokenizer 讀過天龍八部全文,</span>
<span class="c1"># 將每個新出現的字加入字典並將中文字轉</span>
<span class="c1"># 成對應的數字索引</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">fit_on_texts</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">text_as_int</span> <span class="o">=</span> <span class="n">tokenizer</span>\
<span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">([</span><span class="n">text</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># 隨機選取一個片段文本方便之後做說明</span>
<span class="n">s_idx</span> <span class="o">=</span> <span class="mi">21004</span>
<span class="n">e_idx</span> <span class="o">=</span> <span class="mi">21020</span>
<span class="n">partial_indices</span> <span class="o">=</span> \
<span class="n">text_as_int</span><span class="p">[</span><span class="n">s_idx</span><span class="p">:</span><span class="n">e_idx</span><span class="p">]</span>
<span class="n">partial_texts</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">index_word</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> \
<span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">partial_indices</span>
<span class="p">]</span>
<span class="c1"># 渲染結果,可忽略</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"原本的中文字序列:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">partial_texts</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"-"</span> <span class="o">*</span> <span class="mi">20</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"轉換後的索引序列:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">partial_indices</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>原本的中文字序列:
['司', '空', '玄', '雙', '掌', '飛', '舞', ',', '逼', '得', '牠', '無', '法', '近', '前', '。']
--------------------
轉換後的索引序列:
[557, 371, 215, 214, 135, 418, 1209, 1, 837, 25, 1751, 49, 147, 537, 111, 2]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>很明顯地,現在整部天龍八部都已經被轉成一個巨大的數字序列,每一個數字代表著一個獨立的中文字。</p>
<p>我們可以換個方向再看一次:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>人類看的中文字 機器看的輸入索引
------------------------------
司 557
空 371
玄 215
雙 214
掌 135
飛 418
舞 1209
, 1
逼 837
得 25
牠 1751
無 49
法 147
近 537
前 111
。 2
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="3.-建立能丟入模型的資料集">3. 建立能丟入模型的資料集<a class="anchor-link" href="#3.-建立能丟入模型的資料集">¶</a></h3><p>做完基本的數據前處理以後,我們需要將 <code>text_as_int</code> 這個巨大的數字序列轉換成神經網路容易消化的格式與大小。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">text_as_int</span><span class="p">[:</span><span class="mi">10</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>[1639, 148, 3, 3, 280, 5, 192, 819, 374, 800]</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="n">_type</span> <span class="o">=</span> <span class="nb">type</span><span class="p">(</span><span class="n">text_as_int</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">text_as_int</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"text_as_int 是一個 </span><span class="si">{</span><span class="n">_type</span><span class="si">}</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"小說的序列長度: </span><span class="si">{</span><span class="n">n</span><span class="si">}</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"前 5 索引:"</span><span class="p">,</span> <span class="n">text_as_int</span><span class="p">[:</span><span class="mi">5</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>text_as_int 是一個 <class 'list'>
小說的序列長度: 1235431
前 5 索引: [1639, 148, 3, 3, 280]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<blockquote>
<p>
在建立資料集時,你要先能想像最終交給模型的數據長什麼樣子。這樣能幫助你對數據做適當的轉換。
<br/>
<br/>
</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>依照當前機器學習任務的性質,你會需要把不同格式的數據餵給模型。</p>
<p>在本文的序列生成任務裡頭,理想的模型要能依據前文來判斷出下一個中文字。因此我們要丟給模型的是一串代表某些中文字的數字序列:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="s2">"實際丟給模型的數字序列:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">partial_indices</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"方便我們理解的文本序列:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">partial_texts</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>實際丟給模型的數字序列:
[557, 371, 215, 214, 135, 418, 1209, 1, 837, 25, 1751, 49, 147, 537, 111]
方便我們理解的文本序列:
['司', '空', '玄', '雙', '掌', '飛', '舞', ',', '逼', '得', '牠', '無', '法', '近', '前']
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>而模型要給我們的理想輸出應該是向左位移一個字的結果:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class="highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="s2">"實際丟給模型的數字序列:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">partial_indices</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"方便我們理解的文本序列:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">partial_texts</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>實際丟給模型的數字序列:
[371, 215, 214, 135, 418, 1209, 1, 837, 25, 1751, 49, 147, 537, 111, 2]
方便我們理解的文本序列:
['空', '玄', '雙', '掌', '飛', '舞', ',', '逼', '得', '牠', '無', '法', '近', '前', '。']
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>為什麼是這樣的配對?</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<img src="https://leemeng.tw/images/lstm-text-generation/bruce-mars-559223-unsplash.jpg"/>
<br/>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>讓我們將輸入序列及輸出序列拿來對照看看:</p>
<div class="highlight"><pre><span></span>司 空 玄 雙 掌 飛 舞 , 逼 得 牠 無 法 近
空 玄 雙 掌 飛 舞 , 逼 得 牠 無 法 近 前
</pre></div>
<p>從左看到右你會發現,一個模型如果可以給我們這樣的輸出,代表它:</p>
<ul>
<li>看到第一個輸入字 <code>司</code> 時可以正確輸出 <code>空</code></li>
<li>在之前看過 <code>司</code>,且新輸入字為 <code>空</code> 的情況下,可以輸出 <code>玄</code> </li>
<li>在之前看過 <code>司空</code>,且新輸入字為 <code>玄</code> 的情況下,可以輸出 <code>雙</code></li>