wri · yirogue · Nov 4, 2019 · Nov 4, 2019 · Nov 15, 2019 · Nov 20, 2019
diff --git a/columbia_codes/README.md b/columbia_codes/README.md
@@ -0,0 +1,44 @@
+Columbia-Project
+==============================
+
+This folder contains code and data for the columbia team's improvements on the original WRI project.
+
+## Codes
+
+The codes that we developed are located in the following four folders.
+1. feature engineering: 
+   * data exploration at the very beginning: [comparison in gold_standard.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/comparison%20in%20gold_standard.ipynb); [extraction+tokens+sentiment+entity+similarity.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/extraction%2Btokens%2Bsentiment%2Bentity%2Bsimilarity.ipynb)
+   * named entity recognition: [Entity Recognition.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/Entity%20Recognition.ipynb)
+   * minimum spanning tree: [distance calculation + minimum spanning tree.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/distance%20calculation%20%2B%20minimum%20spanning%20tree.ipynb)
+   * finding new rules and testing: [feature engineering_for snorkel.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/feature%20engineering_for%20snorkel.ipynb)
+   * pos-tags related features: [pos tag and other features.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/pos%20tag%20and%20other%20features.ipynb)
+   * topic modeling and sentiment score: [topic modeling & sentiment score_clean version.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/topic%20modeling%20%26%20sentiment%20score_clean%20version.ipynb)
+
+2. snorkel:
+   * original snorkel with label in three class: [snorkel_first version.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/snorkel/snorkel_first%20version.ipynb)
+   * original snorkel adding 4 new functions: [snorkel_second version.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/snorkel/snorkel_second%20version.ipynb)
+   * updated model with binary labels and 4 new functions: [snorkel_third version_binary.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/snorkel/snorkel_third%20version_binary.ipynb)
+
+3. babble labble:
+   * codes for babble labble implementation: [babble labble.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/babble%20labble/babble%20labble.ipynb)
+
+4. roberta & neural nets:
+   * roberta codes for word embedding: [roberta-embedding-1028.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/roberta-classification-1028.py)
+   * generating pos tags and n-gram features which were added into the neural nets model: [feature_engineering.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/feature_engineering.py)
+   * final neural nets model with 80-20 train-valid split: [neural-networks-1201.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/neural-networks-1201.py)
+   * final neural nets model with k-fold cross validation: [neural-networks-1204.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/neural-networks-1204.py)
+   * alternative roberta classification method: [roberta_direct_classification.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/roberta_direct_classification.py); [utils.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/utils.py) (tools to use in classification)
+
+## Data
+
+The `data` folder contains new data generated by the snorkel model and new features.
+1. New gold standard data and noisy data with binary labels generated by snorkel:
+   * gold standard data: [snorkel_proba_updated.npy](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/snorkel_proba_updated.npy)
+   * noisy data: [snorkel_noisy_proba_updated.npy](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/snorkel_noisy_proba_updated.npy)
+
+2. Features:
+   * named entity recognition: [NER_results_gold_file.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/NER_results_gold_file.csv) (gold_standard); [NER_results_noisy_file.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/NER_results_noisy_file.csv) (noisy data)
+   * minimum spanning tree: [tfidf_matrix.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/tfidf_matrix.csv) (use tfidf to calculate similarity); [word2vec_matrix.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/word2vec_matrix.csv) (use word2vec to calculate similarity)
+   * topic modeling and sentiment score: [alldata_with_topic_sentiscore.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/alldata_with_topic_sentiscore.csv)
+   * features added into the final model (POS tags, top 200 unigram, topic modeling, sentiment score): [gs_allFeatures_new.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/gs_allFeatures_new.csv) (gold_standard); [noisy_allFeatures_new.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/noisy_allFeatures_new.csv) 
+
diff --git a/columbia_codes/babble labble/babble labble.ipynb b/columbia_codes/babble labble/babble labble.ipynb
@@ -0,0 +1,126 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "\n",
+    "import nltk\n",
+    "import metal as metal\n",
+    "import metal.contrib.info_extraction.mentions\n",
+    "import numpy as np\n",
+    "\n",
+    "\n",
+    "\n",
+    "# load sentence\n",
+    "sentence_1 = 'with international standards. international and regional markets are more lucrative than local markets and accessing them will increase returns.'\n",
+    "sentence_2 = 'developing and operationalizing internal data management within the subsector and among the agricultural sector ministries and agencies will enhance efficiency in service delivery.'\n",
+    "\n",
+    "# tokenize and make pos_tag\n",
+    "token1 = nltk.word_tokenize(sentence_1)\n",
+    "token2 = nltk.word_tokenize(sentence_2)\n",
+    "\n",
+    "pos1 = [i[1] for i in nltk.pos_tag(token1)]\n",
+    "pos2 = [i[1] for i in nltk.pos_tag(token2)]\n",
+    "\n",
+    "# make entity_types and ner_tags\n",
+    "ent1 = ['0','0','0','0','0','0','1','0','0','0','0','0','0','0','0','0','0','0','1']\n",
+    "ent2 = ['0','0','0','0','0','0','0','0','1','0','0','0','0','0','0','0','0','0','0','1', '0','0','0','0']\n",
+    "\n",
+    "ner1 = ['0','0','0','0','0','0','PERSON','0','0','0','0','0','0','0','0','0','0','0','PERSON']\n",
+    "ner2 = ['0','0','0','0','0','0','0','0','PERSON','0','0','0','0','0','0','0','0','0','0','PERSON', '0','0','0','0']\n",
+    "\n",
+    "import metal as metal\n",
+    "import metal.contrib.info_extraction.mentions\n",
+    "import numpy as np\n",
+    "\n",
+    "# prepare sentence into relationmention format\n",
+    "ex1 = metal.contrib.info_extraction.mentions.RelationMention(1,\n",
+    "            sentence_1, [(57,63),(136,143)],pos_tags = pos1, ner_tags = ner1,entity_types = ent1 )\n",
+    "ex2 = metal.contrib.info_extraction.mentions.RelationMention(2,\n",
+    "            sentence_2, [(68,77),(149,159)],pos_tags = pos2, ner_tags = ner2,entity_types = ent2  )\n",
+    "train_data = [[ex1],[ex2]]\n",
+    "\n",
+    "# prepare labels into desired format\n",
+    "label1 = np.array([1])\n",
+    "label2 = np.array([2])\n",
+    "label = [label1,label2]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import babble\n",
+    "from babble.utils import display_candidate\n",
+    "\n",
+    "babbler = babble.BabbleStream(train_data, label, balanced=False, shuffled=False, seed=456)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "candidate = babbler.next()\n",
+    "display_candidate(candidate)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "explanation1 = babble.explanation.Explanation(\n",
+    "    name='LF_owes_between',\n",
+    "    label=2,\n",
+    "    condition='The word \"lucrative\" is between X and Y',\n",
+    "    candidate=train_data[0][0]\n",
+    ")\n",
+    "\n",
+    "explanations = [explanation1]\n",
+    "\n",
+    "parses, filtered = babbler.apply(explanations)\n",
+    "print(parses)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "babbler.commit()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}