hvkone · harisalam · Nov 8, 2021
diff --git a/.gitignore b/.gitignore
diff --git a/PRD.md b/PRD.md
@@ -0,0 +1,29 @@
+## software development document
+
+
+### 1 Train Corpus Acquisition
+#### 1.1 wikipedia
+
+click following link:
+- https://dumps.wikimedia.org/frwiki/20210301/
+- select an item with xml and bz2 format, such as 
+ frwiki-20210301-pages-articles-multistream3.xml-p2550823p2977214.bz2 138.8 MB
+
+ following is mirror link and faster to download:
+ https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/frwiki/20210320/
+- replace fr of link the above with another name of languages.
+
+#### 1.2 how to extract corpus
+
+- first download source code of wikiextractor
+- cd to wikiextractor
+- execute this commond:
+  python -m wikiextractor.WikiExtractor -o . --process 2 -b 512K --json 
+  /home/zglg/SLU/psd/corpus/english/enwiki-20210301-pages-articles-multistream11.xml-p6899367p7054859.bz2
+
+- get json files.
+- execute codes following: 
+  wordfinder/src/corpusget/extractwiki.py
+
+### 2 Train Corpus
+
diff --git a/README.md b/README.md
@@ -196,7 +196,7 @@ Sprint #3 planning
 - 1 methods to get corpus for many languages
     - 1.1 wikipedia : language abbreviation: https://zh.wikipedia.org/wiki/ISO_639-1
     - 1.2 how to get via wikipedia https://jdhao.github.io/2019/01/10/two_chinese_corpus/
-        
+
 2. database, tables structures
     - current tables structure
     - wordpos table and sentence table
@@ -214,4 +214,54 @@ Sprint #3 planning
 
 7. deploy to hopper.slu.edu
 
-8. alpha version release
+8. alpha version release
+
+### before beta version
+
+Right now I found our repository has a problem considerable us to pay enough attention. Everyone has an individual file path and they are different from each other, 
+
+such as file path of train corpus, the file path of cluster model, the file path of database config. These file paths cannot be pushed to our base repository! 
+
+We should think of a nice way to solve this issue. And I have an idea. We should maintain a common file relative path and all data files and config data should be put inside it. Also, there's another important thing to remember: don't push these corpus and pre-train models to our base repository. We should maintain a common remote disk to store and then open and share a link to provide everyone in our group to use.
+
+I have created a file named input, there are three files inside it: corpus, udpipemodel, and word2vecmodel. All files in them are hosted at 
+
+download: https://pan.baidu.com/s/14RzwuGjTZwsUhiyVSe-Pgg 
+password: td3e
+
+downloading them and put them on root directory of wordfiner folder
+
+
+### Features
+
+Beta version supports features:
+
+1. Support query in 10 + languages
+2. Support to select a certain language, input corresponding words, and display multiple parts of speech of words
+3. Click a part of speech of the word to be looked up to show all the corresponding examples
+4. Use KWIC to show examples
+5. Support to input different number of clusters
+6. Click cluster sentences to get examples containing words
+7. Examples showing all words are supported
+
+
+
+Update features:
+
+1. KWIC, in the middle of the line
+
+2. now only show part sentence, it's better to show the whole sentence when click.
+
+   <a href="">a point on the bank hidden by brush where </a>
+
+3. in cluster web interface, we should group the sentences as cluster labels, sorting.
+
+4. .gitignore files 
+
+5. French clustering 3: 
+
+   ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
+
+   Chinese
+
+6.  there are bugs of cluster function
diff --git a/corpus.session.sql b/corpus.session.sql
diff --git a/corpus/database_cnx/cnx_db.py b/corpus/database_cnx/cnx_db.py
@@ -0,0 +1,46 @@
+# install mysql
+# pip install mysql-connector-python
+from mysql.connector import errorcode
+import mysql.connector
+
+
+config = {
+  'host': 'localhost',
+  'database': 'wordfinder_corpora',
+  'user': 'user',
+  'password': 'project',
+  'raise_on_warnings': True
+}
+try:
+  cnx = mysql.connector.connect(**config)
+  if cnx.is_connected():
+    print('Connected to MySQL database')
+
+  mycursor = cnx.cursor()
+  #mycursor.execute('CREATE DATABASE wordfinder_corpora')
+  mycursor.execute('SHOW DATABASES')
+
+  for db in mycursor:
+    print(db)
+
+
+  #mycursor.execute('CREATE TABLE english (text VARCHAR(255))')
+  #mycursor.execute('CREATE TABLE latin (text VARCHAR(255))')
+  #mycursor.execute('CREATE TABLE french (text VARCHAR(255))')
+  mycursor.execute('SHOW TABLES')
+  for tb in mycursor:
+    print(tb)
+
+
+except mysql.connector.Error as err:
+  if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
+    print("Something is wrong with your user name or password")
+  elif err.errno == errorcode.ER_BAD_DB_ERROR:
+    print("Database does not exist")
+  else:
+    print(err)
+else:
+  cnx.close()
+
+
+
diff --git a/corpus/english/semeval_english/__MACOSX/4A-English/._README.txt b/corpus/english/semeval_english/__MACOSX/4A-English/._README.txt
diff --git a/...h/semeval_english/__MACOSX/4A-English/._SemEval2017-task4-dev.subtask-A.english.INPUT.txt b/...h/semeval_english/__MACOSX/4A-English/._SemEval2017-task4-dev.subtask-A.english.INPUT.txt
diff --git a/...s/english/semeval_english/__MACOSX/4A-English/._SemEval2017_task4_test_scorer_subtaskA.pl b/...s/english/semeval_english/__MACOSX/4A-English/._SemEval2017_task4_test_scorer_subtaskA.pl
diff --git a/corpus/english/semeval_english/__MACOSX/4A-English/._baseline-A-english.txt b/corpus/english/semeval_english/__MACOSX/4A-English/._baseline-A-english.txt
diff --git a/corpus/english/semeval_english/__MACOSX/4A-English/._twitter-2016test-A-English.txt b/corpus/english/semeval_english/__MACOSX/4A-English/._twitter-2016test-A-English.txt