Ewha-Euron · nowionlyseedaylight · Jul 18, 2022
diff --git a/week20_nlp_hw_김나현.ipynb b/week20_nlp_hw_김나현.ipynb
@@ -0,0 +1,378 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "week20_nlp_hw_김나현.ipynb",
+      "provenance": [],
+      "collapsed_sections": [],
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/nowionlyseedaylight/2022-1-Euron-Study-Assignments/blob/Week_20/week20_nlp_hw_%EA%B9%80%EB%82%98%ED%98%84.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "📌 week20 과제는 **18주차의 Constituency Parsing TreeRNNS 실습**으로 구성되어 있습니다.\n",
+        "\n",
+        "📌 위키독스의 딥러닝을 이용한 자연어 처리 입문 교재 실습, 관련 블로그 등의 문서 자료로 구성되어 있는 과제입니다. \n",
+        "\n",
+        "📌 안내된 링크에 맞추어 **직접 코드를 따라 치면서 (필사)** 해당 nlp task 의 기본적인 라이브러리와 메서드를 숙지해보시면 좋을 것 같습니다😊 필수라고 체크한 부분은 과제에 반드시 포함시켜주시고, 선택으로 체크한 부분은 자율적으로 스터디 하시면 됩니다.\n",
+        "\n",
+        "📌 궁금한 사항은 깃허브 이슈나, 카톡방, 세션 발표 시작 이전 시간 등을 활용하여 자유롭게 공유해주세요!"
+      ],
+      "metadata": {
+        "id": "BX3ac8Ag1RPC"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "🥰 **이하 예제를 실습하시면 됩니다.**\n",
+        "\n",
+        "**1-2는 필수과제입니다.**\n",
+        "\n",
+        "실습보다는 개념 이해가 주인 단원이라, 실습 과제는 적습니다. TreeRNN은 현실적으로 잘 쓰이지 않기 때문이죠. 왜 잘 쓰이지 않을까요? 이 질문에 대해 생각해 보는 것으로 이번 과제를 시작해 봅시다."
+      ],
+      "metadata": {
+        "id": "Kq8aMYKGPQR0"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "`your answer here`"
+      ],
+      "metadata": {
+        "id": "ZJLIfgQ9vlNe"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### **1️⃣ Probabilistic Parsing 실습**"
+      ],
+      "metadata": {
+        "id": "SHTPAk95iNtP"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "📌 [Probabilistic Context Free Grammars](https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/pcfg/nochunks.html#pcfg) \n",
+        "\n",
+        "TreeRNN에 PCFG rule을 적용해 만든 것이 Syntactically-United RNN이죠. PCFG rule을 실습해 봅시다."
+      ],
+      "metadata": {
+        "id": "9L-jAHPkiBV0"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "오류 미해결 상태..."
+      ],
+      "metadata": {
+        "id": "80M7kAPWWk0P"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from nltk import Tree\n",
+        "import nltk\n",
+        "import argparse\n",
+        "from nltk import Nonterminal\n",
+        "#from nltk.corpus import treebank\n",
+        "#from nltk import treetransforms\n",
+        "from nltk import induce_pcfg\n",
+        "from nltk.parse import pchart\n",
+        "import pandas as pd\n",
+        "\n",
+        "# Define some nonterminals\n",
+        "(NP, V, VP) = [Nonterminal(s) for s in 'S NP VP'.split()]\n",
+        "\n",
+        "# Create some PCFG rules\n",
+        "rule1 = PCFG_Rule(0.23, VP, V, NP)\n",
+        "rule2 = PCFG_Rule(0.12, V, 'saw')\n",
+        "rule3 = PCFG_Rule(0.04, NP, 'cookie')"
+      ],
+      "metadata": {
+        "id": "Klj6gjETZdgS",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 251
+        },
+        "outputId": "28e21044-3b03-43d2-df5d-b300de7e12eb"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "NameError",
+          "evalue": "ignored",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+            "\u001b[0;32m<ipython-input-5-db3ad05ff788>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     13\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0;31m# Create some PCFG rules\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mrule1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPCFG_Rule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.23\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mVP\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mV\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mNP\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     16\u001b[0m \u001b[0mrule2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPCFG_Rule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.12\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mV\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'saw'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m \u001b[0mrule3\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPCFG_Rule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.04\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mNP\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'cookie'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mNameError\u001b[0m: name 'PCFG_Rule' is not defined"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print rule1.p(), rule2.p(), rule3.p()"
+      ],
+      "metadata": {
+        "id": "xxZb8lvFS_ER"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "rules = [PCFG_Rule(1.0, S, VP, NP),\n",
+        "             PCFG_Rule(0.4, VP, 'saw', NP),\n",
+        "             PCFG_Rule(0.3, VP, 'ate'),\n",
+        "             PCFG_Rule(0.3, VP, 'gave', NP, NP),\n",
+        "             PCFG_Rule(0.8, NP, 'the', 'cookie'),\n",
+        "             PCFG_Rule(0.2, NP, 'Jack')]\n",
+        "grammar = PCFG(S, rules)"
+      ],
+      "metadata": {
+        "id": "MsJZ9ldFiARK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "grammar.start()\n",
+        "grammar.rules()"
+      ],
+      "metadata": {
+        "id": "kHnDXc72UPwb"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### **2️⃣ Tree Based Algorithm**"
+      ],
+      "metadata": {
+        "id": "HfTr_BPwGc8D"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "📌 [Decision Tree](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/ml-decision-tree/tutorial/) 실습\n",
+        "\n",
+        "참고: [Embedding Graphs with Deep Learning 읽기 자료](https://towardsdatascience.com/embedding-graphs-with-deep-learning-55e0c66d7752)\n",
+        "\n",
+        "Decision Tree는 머신 러닝에서 자주 쓰이는 Tree Algorith입니다. Decision Tree를 실습해 보며 Tree 구조와 알고리즘을 이해해 봅시다.\n"
+      ],
+      "metadata": {
+        "id": "0HnXNyCAwSHJ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Importing required libraries\n",
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "from sklearn.datasets import load_iris\n",
+        "from sklearn.tree import DecisionTreeClassifier\n",
+        "from sklearn.model_selection import train_test_split"
+      ],
+      "metadata": {
+        "id": "svtikbdZatBY"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Loading the iris data\n",
+        "from google.colab import files\n",
+        "data = files.upload()\n",
+        "print('Classes to predict: ', data.target_names)"
+      ],
+      "metadata": {
+        "id": "EcPIZc8GXCsW"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Extracting data attributes\n",
+        "X = data.data\n",
+        "### Extracting target/ class labels\n",
+        "y = data.target\n",
+        "\n",
+        "print('Number of examples in the data:', X.shape[0])"
+      ],
+      "metadata": {
+        "id": "s2l9Pw0vXyDd"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#First four rows in the variable 'X'\n",
+        "X[:4]\n",
+        "\n",
+        "#Output\n",
+        "Out: array([[5.1, 3.5, 1.4, 0.2],\n",
+        "       [4.9, 3. , 1.4, 0.2],\n",
+        "       [4.7, 3.2, 1.3, 0.2],\n",
+        "       [4.6, 3.1, 1.5, 0.2]])"
+      ],
+      "metadata": {
+        "id": "O7lpcCxBXz9R"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Using the train_test_split to create train and test sets.\n",
+        "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)"
+      ],
+      "metadata": {
+        "id": "sndI6Bx5X3ip"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Importing the Decision tree classifier from the sklearn library.\n",
+        "from sklearn.tree import DecisionTreeClassifier\n",
+        "clf = DecisionTreeClassifier(criterion = 'entropy')"
+      ],
+      "metadata": {
+        "id": "eE6drUsBX4XW"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Training the decision tree classifier. \n",
+        "clf.fit(X_train, y_train)\n",
+        "\n",
+        "#Output:\n",
+        "Out:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,\n",
+        "            max_features=None, max_leaf_nodes=None,\n",
+        "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
+        "            min_samples_leaf=1, min_samples_split=2,\n",
+        "            min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
+        "            splitter='best')"
+      ],
+      "metadata": {
+        "id": "P0shnzAqX7aN"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Predicting labels on the test set.\n",
+        "y_pred =  clf.predict(X_test)"
+      ],
+      "metadata": {
+        "id": "wZz6zOMNX94S"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#Importing the accuracy metric from sklearn.metrics library\n",
+        "\n",
+        "from sklearn.metrics import accuracy_score\n",
+        "print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))\n",
+        "print('Accuracy Score on test data: ', accuracy_score(y_true=y_test, y_pred=y_pred))\n",
+        "\n",
+        "#Output:\n",
+        "Out: Accuracy Score on train data:  1.0\n",
+        "    Accuracy Score on test data:  0.9473684210526315"
+      ],
+      "metadata": {
+        "id": "GYoWWt5IX_jN"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=50)\n",
+        "clf.fit(X_train, y_train)\n",
+        "print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))\n",
+        "print('Accuracy Score on the test data: ', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))\n",
+        "\n",
+        "#Output:\n",
+        "Out: Accuracy Score on train data:  0.9553571428571429\n",
+        "    Accuracy Score on test data:  0.9736842105263158"
+      ],
+      "metadata": {
+        "id": "IFNHxKjZYBCz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 참고 자료\n",
+        "\n",
+        "- [RNTN](https://ratsgo.github.io/deep%20learning/2017/06/24/RNTN/)\n",
+        "\n",
+        "- [When and Why Tree-Based Models (Often) Outperform Neural Networks](https://towardsdatascience.com/when-and-why-tree-based-models-often-outperform-neural-networks-ceba9ecd0fd8)\n",
+        "\n",
+        "- 저번 주차 내용과 관련된 페이퍼 [Parsing with Compositional Vector Grammars](https://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf) 도 도움이 될 거예요! :)\n",
+        "\n",
+        "- [RNN과 LSTM이 왜 잘 안 쓰이는지에 대한 칼럼](https://medium.com/towards-data-science/the-fall-of-rnn-lstm-2d1594c74ce0)도 한번 읽어 보시면 좋을 것 같습니다."
+      ],
+      "metadata": {
+        "id": "pNDTxHF3sxo8"
+      }
+    }
+  ]
+}