Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions columbia_codes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
Columbia-Project
==============================

This folder contains code and data for the columbia team's improvements on the original WRI project.

## Codes

The codes that we developed are located in the following four folders.
1. feature engineering:
* data exploration at the very beginning: [comparison in gold_standard.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/comparison%20in%20gold_standard.ipynb); [extraction+tokens+sentiment+entity+similarity.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/extraction%2Btokens%2Bsentiment%2Bentity%2Bsimilarity.ipynb)
* named entity recognition: [Entity Recognition.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/Entity%20Recognition.ipynb)
* minimum spanning tree: [distance calculation + minimum spanning tree.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/distance%20calculation%20%2B%20minimum%20spanning%20tree.ipynb)
* finding new rules and testing: [feature engineering_for snorkel.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/feature%20engineering_for%20snorkel.ipynb)
* pos-tags related features: [pos tag and other features.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/pos%20tag%20and%20other%20features.ipynb)
* topic modeling and sentiment score: [topic modeling & sentiment score_clean version.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/feature%20engineering/topic%20modeling%20%26%20sentiment%20score_clean%20version.ipynb)

2. snorkel:
* original snorkel with label in three class: [snorkel_first version.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/snorkel/snorkel_first%20version.ipynb)
* original snorkel adding 4 new functions: [snorkel_second version.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/snorkel/snorkel_second%20version.ipynb)
* updated model with binary labels and 4 new functions: [snorkel_third version_binary.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/snorkel/snorkel_third%20version_binary.ipynb)

3. babble labble:
* codes for babble labble implementation: [babble labble.ipynb](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/babble%20labble/babble%20labble.ipynb)

4. roberta & neural nets:
* roberta codes for word embedding: [roberta-embedding-1028.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/roberta-classification-1028.py)
* generating pos tags and n-gram features which were added into the neural nets model: [feature_engineering.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/feature_engineering.py)
* final neural nets model with 80-20 train-valid split: [neural-networks-1201.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/neural-networks-1201.py)
* final neural nets model with k-fold cross validation: [neural-networks-1204.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/neural-networks-1204.py)
* alternative roberta classification method: [roberta_direct_classification.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/roberta_direct_classification.py); [utils.py](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/roberta%20%26%20nerual%20nets/utils.py) (tools to use in classification)

## Data

The `data` folder contains new data generated by the snorkel model and new features.
1. New gold standard data and noisy data with binary labels generated by snorkel:
* gold standard data: [snorkel_proba_updated.npy](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/snorkel_proba_updated.npy)
* noisy data: [snorkel_noisy_proba_updated.npy](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/snorkel_noisy_proba_updated.npy)

2. Features:
* named entity recognition: [NER_results_gold_file.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/NER_results_gold_file.csv) (gold_standard); [NER_results_noisy_file.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/NER_results_noisy_file.csv) (noisy data)
* minimum spanning tree: [tfidf_matrix.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/tfidf_matrix.csv) (use tfidf to calculate similarity); [word2vec_matrix.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/word2vec_matrix.csv) (use word2vec to calculate similarity)
* topic modeling and sentiment score: [alldata_with_topic_sentiscore.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/alldata_with_topic_sentiscore.csv)
* features added into the final model (POS tags, top 200 unigram, topic modeling, sentiment score): [gs_allFeatures_new.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/gs_allFeatures_new.csv) (gold_standard); [noisy_allFeatures_new.csv](https://github.com/yg2619/policy-toolkit/blob/columbia_team_codes/columbia_codes/data/noisy_allFeatures_new.csv)

126 changes: 126 additions & 0 deletions columbia_codes/babble labble/babble labble.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import nltk\n",
"import metal as metal\n",
"import metal.contrib.info_extraction.mentions\n",
"import numpy as np\n",
"\n",
"\n",
"\n",
"# load sentence\n",
"sentence_1 = 'with international standards. international and regional markets are more lucrative than local markets and accessing them will increase returns.'\n",
"sentence_2 = 'developing and operationalizing internal data management within the subsector and among the agricultural sector ministries and agencies will enhance efficiency in service delivery.'\n",
"\n",
"# tokenize and make pos_tag\n",
"token1 = nltk.word_tokenize(sentence_1)\n",
"token2 = nltk.word_tokenize(sentence_2)\n",
"\n",
"pos1 = [i[1] for i in nltk.pos_tag(token1)]\n",
"pos2 = [i[1] for i in nltk.pos_tag(token2)]\n",
"\n",
"# make entity_types and ner_tags\n",
"ent1 = ['0','0','0','0','0','0','1','0','0','0','0','0','0','0','0','0','0','0','1']\n",
"ent2 = ['0','0','0','0','0','0','0','0','1','0','0','0','0','0','0','0','0','0','0','1', '0','0','0','0']\n",
"\n",
"ner1 = ['0','0','0','0','0','0','PERSON','0','0','0','0','0','0','0','0','0','0','0','PERSON']\n",
"ner2 = ['0','0','0','0','0','0','0','0','PERSON','0','0','0','0','0','0','0','0','0','0','PERSON', '0','0','0','0']\n",
"\n",
"import metal as metal\n",
"import metal.contrib.info_extraction.mentions\n",
"import numpy as np\n",
"\n",
"# prepare sentence into relationmention format\n",
"ex1 = metal.contrib.info_extraction.mentions.RelationMention(1,\n",
" sentence_1, [(57,63),(136,143)],pos_tags = pos1, ner_tags = ner1,entity_types = ent1 )\n",
"ex2 = metal.contrib.info_extraction.mentions.RelationMention(2,\n",
" sentence_2, [(68,77),(149,159)],pos_tags = pos2, ner_tags = ner2,entity_types = ent2 )\n",
"train_data = [[ex1],[ex2]]\n",
"\n",
"# prepare labels into desired format\n",
"label1 = np.array([1])\n",
"label2 = np.array([2])\n",
"label = [label1,label2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import babble\n",
"from babble.utils import display_candidate\n",
"\n",
"babbler = babble.BabbleStream(train_data, label, balanced=False, shuffled=False, seed=456)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"candidate = babbler.next()\n",
"display_candidate(candidate)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"explanation1 = babble.explanation.Explanation(\n",
" name='LF_owes_between',\n",
" label=2,\n",
" condition='The word \"lucrative\" is between X and Y',\n",
" candidate=train_data[0][0]\n",
")\n",
"\n",
"explanations = [explanation1]\n",
"\n",
"parses, filtered = babbler.apply(explanations)\n",
"print(parses)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"babbler.commit()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading