Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .gitignore

This file was deleted.

29 changes: 29 additions & 0 deletions PRD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## software development document


### 1 Train Corpus Acquisition
#### 1.1 wikipedia

click following link:
- https://dumps.wikimedia.org/frwiki/20210301/
- select an item with xml and bz2 format, such as
frwiki-20210301-pages-articles-multistream3.xml-p2550823p2977214.bz2 138.8 MB

following is mirror link and faster to download:
https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/frwiki/20210320/
- replace fr of link the above with another name of languages.

#### 1.2 how to extract corpus

- first download source code of wikiextractor
- cd to wikiextractor
- execute this commond:
python -m wikiextractor.WikiExtractor -o . --process 2 -b 512K --json
/home/zglg/SLU/psd/corpus/english/enwiki-20210301-pages-articles-multistream11.xml-p6899367p7054859.bz2

- get json files.
- execute codes following:
wordfinder/src/corpusget/extractwiki.py

### 2 Train Corpus

54 changes: 52 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Sprint #3 planning
- 1 methods to get corpus for many languages
- 1.1 wikipedia : language abbreviation: https://zh.wikipedia.org/wiki/ISO_639-1
- 1.2 how to get via wikipedia https://jdhao.github.io/2019/01/10/two_chinese_corpus/

2. database, tables structures
- current tables structure
- wordpos table and sentence table
Expand All @@ -214,4 +214,54 @@ Sprint #3 planning

7. deploy to hopper.slu.edu

8. alpha version release
8. alpha version release

### before beta version

Right now I found our repository has a problem considerable us to pay enough attention. Everyone has an individual file path and they are different from each other,

such as file path of train corpus, the file path of cluster model, the file path of database config. These file paths cannot be pushed to our base repository!

We should think of a nice way to solve this issue. And I have an idea. We should maintain a common file relative path and all data files and config data should be put inside it. Also, there's another important thing to remember: don't push these corpus and pre-train models to our base repository. We should maintain a common remote disk to store and then open and share a link to provide everyone in our group to use.

I have created a file named input, there are three files inside it: corpus, udpipemodel, and word2vecmodel. All files in them are hosted at

download: https://pan.baidu.com/s/14RzwuGjTZwsUhiyVSe-Pgg
password: td3e

downloading them and put them on root directory of wordfiner folder


### Features

Beta version supports features:

1. Support query in 10 + languages
2. Support to select a certain language, input corresponding words, and display multiple parts of speech of words
3. Click a part of speech of the word to be looked up to show all the corresponding examples
4. Use KWIC to show examples
5. Support to input different number of clusters
6. Click cluster sentences to get examples containing words
7. Examples showing all words are supported



Update features:

1. KWIC, in the middle of the line

2. now only show part sentence, it's better to show the whole sentence when click.

<a href="">a point on the bank hidden by brush where </a>

3. in cluster web interface, we should group the sentences as cluster labels, sorting.

4. .gitignore files

5. French clustering 3:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Chinese

6. there are bugs of cluster function
Empty file added corpus.session.sql
Empty file.
46 changes: 46 additions & 0 deletions corpus/database_cnx/cnx_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# install mysql
# pip install mysql-connector-python
from mysql.connector import errorcode
import mysql.connector


config = {
'host': 'localhost',
'database': 'wordfinder_corpora',
'user': 'user',
'password': 'project',
'raise_on_warnings': True
}
try:
cnx = mysql.connector.connect(**config)
if cnx.is_connected():
print('Connected to MySQL database')

mycursor = cnx.cursor()
#mycursor.execute('CREATE DATABASE wordfinder_corpora')
mycursor.execute('SHOW DATABASES')

for db in mycursor:
print(db)


#mycursor.execute('CREATE TABLE english (text VARCHAR(255))')
#mycursor.execute('CREATE TABLE latin (text VARCHAR(255))')
#mycursor.execute('CREATE TABLE french (text VARCHAR(255))')
mycursor.execute('SHOW TABLES')
for tb in mycursor:
print(tb)


except mysql.connector.Error as err:
if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print("Something is wrong with your user name or password")
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print("Database does not exist")
else:
print(err)
else:
cnx.close()



Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading