Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
378 changes: 378 additions & 0 deletions week20_nlp_hw_김나현.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,378 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "week20_nlp_hw_김나현.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/nowionlyseedaylight/2022-1-Euron-Study-Assignments/blob/Week_20/week20_nlp_hw_%EA%B9%80%EB%82%98%ED%98%84.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"📌 week20 과제는 **18주차의 Constituency Parsing TreeRNNS 실습**으로 구성되어 있습니다.\n",
"\n",
"📌 위키독스의 딥러닝을 이용한 자연어 처리 입문 교재 실습, 관련 블로그 등의 문서 자료로 구성되어 있는 과제입니다. \n",
"\n",
"📌 안내된 링크에 맞추어 **직접 코드를 따라 치면서 (필사)** 해당 nlp task 의 기본적인 라이브러리와 메서드를 숙지해보시면 좋을 것 같습니다😊 필수라고 체크한 부분은 과제에 반드시 포함시켜주시고, 선택으로 체크한 부분은 자율적으로 스터디 하시면 됩니다.\n",
"\n",
"📌 궁금한 사항은 깃허브 이슈나, 카톡방, 세션 발표 시작 이전 시간 등을 활용하여 자유롭게 공유해주세요!"
],
"metadata": {
"id": "BX3ac8Ag1RPC"
}
},
{
"cell_type": "markdown",
"source": [
"🥰 **이하 예제를 실습하시면 됩니다.**\n",
"\n",
"**1-2는 필수과제입니다.**\n",
"\n",
"실습보다는 개념 이해가 주인 단원이라, 실습 과제는 적습니다. TreeRNN은 현실적으로 잘 쓰이지 않기 때문이죠. 왜 잘 쓰이지 않을까요? 이 질문에 대해 생각해 보는 것으로 이번 과제를 시작해 봅시다."
],
"metadata": {
"id": "Kq8aMYKGPQR0"
}
},
{
"cell_type": "markdown",
"source": [
"`your answer here`"
],
"metadata": {
"id": "ZJLIfgQ9vlNe"
}
},
{
"cell_type": "markdown",
"source": [
"### **1️⃣ Probabilistic Parsing 실습**"
],
"metadata": {
"id": "SHTPAk95iNtP"
}
},
{
"cell_type": "markdown",
"source": [
"📌 [Probabilistic Context Free Grammars](https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/pcfg/nochunks.html#pcfg) \n",
"\n",
"TreeRNN에 PCFG rule을 적용해 만든 것이 Syntactically-United RNN이죠. PCFG rule을 실습해 봅시다."
],
"metadata": {
"id": "9L-jAHPkiBV0"
}
},
{
"cell_type": "markdown",
"source": [
"오류 미해결 상태..."
],
"metadata": {
"id": "80M7kAPWWk0P"
}
},
{
"cell_type": "code",
"source": [
"from nltk import Tree\n",
"import nltk\n",
"import argparse\n",
"from nltk import Nonterminal\n",
"#from nltk.corpus import treebank\n",
"#from nltk import treetransforms\n",
"from nltk import induce_pcfg\n",
"from nltk.parse import pchart\n",
"import pandas as pd\n",
"\n",
"# Define some nonterminals\n",
"(NP, V, VP) = [Nonterminal(s) for s in 'S NP VP'.split()]\n",
"\n",
"# Create some PCFG rules\n",
"rule1 = PCFG_Rule(0.23, VP, V, NP)\n",
"rule2 = PCFG_Rule(0.12, V, 'saw')\n",
"rule3 = PCFG_Rule(0.04, NP, 'cookie')"
],
"metadata": {
"id": "Klj6gjETZdgS",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 251
},
"outputId": "28e21044-3b03-43d2-df5d-b300de7e12eb"
},
"execution_count": null,
"outputs": [
{
"output_type": "error",
"ename": "NameError",
"evalue": "ignored",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-5-db3ad05ff788>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0;31m# Create some PCFG rules\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mrule1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPCFG_Rule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.23\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mVP\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mV\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mNP\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 16\u001b[0m \u001b[0mrule2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPCFG_Rule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.12\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mV\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'saw'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0mrule3\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPCFG_Rule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.04\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mNP\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'cookie'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: name 'PCFG_Rule' is not defined"
]
}
]
},
{
"cell_type": "code",
"source": [
"print rule1.p(), rule2.p(), rule3.p()"
],
"metadata": {
"id": "xxZb8lvFS_ER"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"rules = [PCFG_Rule(1.0, S, VP, NP),\n",
" PCFG_Rule(0.4, VP, 'saw', NP),\n",
" PCFG_Rule(0.3, VP, 'ate'),\n",
" PCFG_Rule(0.3, VP, 'gave', NP, NP),\n",
" PCFG_Rule(0.8, NP, 'the', 'cookie'),\n",
" PCFG_Rule(0.2, NP, 'Jack')]\n",
"grammar = PCFG(S, rules)"
],
"metadata": {
"id": "MsJZ9ldFiARK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"grammar.start()\n",
"grammar.rules()"
],
"metadata": {
"id": "kHnDXc72UPwb"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### **2️⃣ Tree Based Algorithm**"
],
"metadata": {
"id": "HfTr_BPwGc8D"
}
},
{
"cell_type": "markdown",
"source": [
"📌 [Decision Tree](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-decision-tree/tutorial/) 실습\n",
"\n",
"참고: [Embedding Graphs with Deep Learning 읽기 자료](https://towardsdatascience.com/embedding-graphs-with-deep-learning-55e0c66d7752)\n",
"\n",
"Decision Tree는 머신 러닝에서 자주 쓰이는 Tree Algorith입니다. Decision Tree를 실습해 보며 Tree 구조와 알고리즘을 이해해 봅시다.\n"
],
"metadata": {
"id": "0HnXNyCAwSHJ"
}
},
{
"cell_type": "code",
"source": [
"#Importing required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.datasets import load_iris\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.model_selection import train_test_split"
],
"metadata": {
"id": "svtikbdZatBY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Loading the iris data\n",
"from google.colab import files\n",
"data = files.upload()\n",
"print('Classes to predict: ', data.target_names)"
],
"metadata": {
"id": "EcPIZc8GXCsW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Extracting data attributes\n",
"X = data.data\n",
"### Extracting target/ class labels\n",
"y = data.target\n",
"\n",
"print('Number of examples in the data:', X.shape[0])"
],
"metadata": {
"id": "s2l9Pw0vXyDd"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#First four rows in the variable 'X'\n",
"X[:4]\n",
"\n",
"#Output\n",
"Out: array([[5.1, 3.5, 1.4, 0.2],\n",
" [4.9, 3. , 1.4, 0.2],\n",
" [4.7, 3.2, 1.3, 0.2],\n",
" [4.6, 3.1, 1.5, 0.2]])"
],
"metadata": {
"id": "O7lpcCxBXz9R"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Using the train_test_split to create train and test sets.\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)"
],
"metadata": {
"id": "sndI6Bx5X3ip"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Importing the Decision tree classifier from the sklearn library.\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"clf = DecisionTreeClassifier(criterion = 'entropy')"
],
"metadata": {
"id": "eE6drUsBX4XW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Training the decision tree classifier. \n",
"clf.fit(X_train, y_train)\n",
"\n",
"#Output:\n",
"Out:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,\n",
" max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
" splitter='best')"
],
"metadata": {
"id": "P0shnzAqX7aN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Predicting labels on the test set.\n",
"y_pred = clf.predict(X_test)"
],
"metadata": {
"id": "wZz6zOMNX94S"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Importing the accuracy metric from sklearn.metrics library\n",
"\n",
"from sklearn.metrics import accuracy_score\n",
"print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))\n",
"print('Accuracy Score on test data: ', accuracy_score(y_true=y_test, y_pred=y_pred))\n",
"\n",
"#Output:\n",
"Out: Accuracy Score on train data: 1.0\n",
" Accuracy Score on test data: 0.9473684210526315"
],
"metadata": {
"id": "GYoWWt5IX_jN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=50)\n",
"clf.fit(X_train, y_train)\n",
"print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))\n",
"print('Accuracy Score on the test data: ', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))\n",
"\n",
"#Output:\n",
"Out: Accuracy Score on train data: 0.9553571428571429\n",
" Accuracy Score on test data: 0.9736842105263158"
],
"metadata": {
"id": "IFNHxKjZYBCz"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### 참고 자료\n",
"\n",
"- [RNTN](https://ratsgo.github.io/deep%20learning/2017/06/24/RNTN/)\n",
"\n",
"- [When and Why Tree-Based Models (Often) Outperform Neural Networks](https://towardsdatascience.com/when-and-why-tree-based-models-often-outperform-neural-networks-ceba9ecd0fd8)\n",
"\n",
"- 저번 주차 내용과 관련된 페이퍼 [Parsing with Compositional Vector Grammars](https://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf) 도 도움이 될 거예요! :)\n",
"\n",
"- [RNN과 LSTM이 왜 잘 안 쓰이는지에 대한 칼럼](https://medium.com/towards-data-science/the-fall-of-rnn-lstm-2d1594c74ce0)도 한번 읽어 보시면 좋을 것 같습니다."
],
"metadata": {
"id": "pNDTxHF3sxo8"
}
}
]
}