Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add japanese strinprior #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/distributions/lmparams/kanji_probabilities.csv

Large diffs are not rendered by default.

1,903 changes: 1,903 additions & 0 deletions src/distributions/lmparams/kanji_transition_matrix.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/distributions/lmparams/kanjis.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
携,帯, ,電,話,プ,リ,ペ,イ,ド,カ,ー,布,教,も,は,や,今,さ,ら,だ,が,と,い,う,接,頭,句,で,始,め,る,し,か,な,ほ,ど,私,を,ず,っ,使,て,犯,罪,に,用,れ,よ,り,メ,ジ,悪,化,せ,ま,た,一,ユ,ザ,あ,つ,こ,の,友,人,振,料,金,親,払,別,べ,答,え,返,く,ば,日,お,単,中,高,生,遊,び,的,ツ,ル,緊,急,時,連,絡,実,現,利,目,認,知,傾,向,強,ろ,そ,自,分,身,銭,わ,ざ,得,所,少,或,部,思,限,定,エ,ン,ト,ス,タ,す,け,最,近,原,因,不,明,黒,横,線,入,v,o,d,a,f,n,e,s,t,b,k,ち,ん,液,晶,切,修,理,間,困,上,完,璧,壊,他,手,関,情,報,仕,代,替,機,出,早,弱,点,画,面,見,前,年,買,換,変,結,局,サ,ビ,圧,倒,値,段,有,解,落,着,き,選,択,肢,迫,月,円,状,維,持,超,フ,ウ,仲,論,確,w,通,公,衆,会,社,謳,オ,ク,げ,受,誕,割,特,必,要,能,加,ハ,ッ,ピ,方,大,程,度,開,言,意,味,瑣,末,対,マ,ナ,考,み,貧,乏,臭,怪,ぽ,類,僕,影,響,与,正,直,想,仮,後,契,約,じ,ゃ,可,性,ラ,何,逃,避,注,m,i,x,コ,ミ,ュ,楽,決,好,ょ,レ,ロ,嫌,感,似,以,戒,素,朴,違,和,ァ,設,京,都,観,光,雨,殿,行,ぐ,嵐,山,へ,ィ,シ,体,験,残,念,閉,館,場,旅,待,合,全,気,過,ぎ,発,寮,バ,乗,際,路,車,ぜ,ひ,試,福,鉄,転,三,条,口,西,御,池,j,r,二,駅,道,歩,者,昔,走,止,世,田,谷,雰,囲,東,区,市,勤,珍,客,率,足,当,細,地,洒,亭,外,周,負,級,ャ,デ,下,辿,任,天,堂,資,本,百,首,博,物,語,心,奥,神,経,衰,姿,初,許,諾,写,真,禁,様,徹,底,荷,鍵,付,進,む,階,町,探,索,多,書,無,空,散,取,参,技,術,精,尽,印,象,深,総,紹,介,ダ,競,回,挑,戦,パ,ェ,抜,畳,張,展,示,々,軍,国,飾,眺,幸,土,産,枚,板,販,売,瞬,迷,冷,静,香,セ,食,ぶ,同,ゆ,帰,途,誰,除,価,純,製,殊,ボ,新,聞,半,種,更,悩,衝,撃,水,濡,モ,ノ,扱,雑,密,械,頑,丈,誤,数,十,猛,学,校,舎,隙,彼,飛,元,驚,減,丹,波,ガ,記,旧,北,施,在,右,名,鉱,跡,ポ,歴,史,案,内,採,掘,坑,飯,模,型,動,員,朝,鮮,労,働,作,業,従,事,権,育,研,環,訪,団,究,養,粋,洞,窟,来,問,題,交,便,遠,隠,低,苦,済,況,運,営,良,界,激,ォ,ソ,番,号,略,称,忘,u,絵,文,字,改,善,圏,法,吉,塔,建,噂,詳,由,戻,増,消,費,独,占,態,益,寡,安,欲,磁,優,ア,主,流,ぱ,ね,春,宿,阪,寺,伝,統,テ,致,夏,休,閣,ヶ,耳,住,頃,毎,慣,染,検,討,請,求,ケ,額,質,ブ,放,音,屋,相,談,騙,壺,配,調,臓,専,ヤ,聴,購,c,込,著,保,護,存,力,録,駄,紅,葉,盛,構,授,個,疑,女,計,台,撮,浮,巡,銀,清,肩,並,醸,覚,共,予,g,y,孫,長,ネ,講,ふ,小,ぼ,色,慈,照,鹿,苑,終,続,キ,愛,折,角,ご,ゴ,品,店,舗,唯,既,ぞ,達,集,投,肝,ぴ,火,再,室,否,海,絶,逆,ぁ,非,常,届,萎,厳,遅,刻,魔,謝,束,普,守,疎,づ,活,難,到,源,チ,家,族,次,第,県,魅,了,敵,仏,古,甘,系,秋,ズ,溢,反,省,期,立,漫,然,景,南,至,離,贅,沢,鴨,糺,森,鞄,端,鬱,陶,鳴,我,固,厄,抑,揚,失,甲,声,快,重,寝,坊,起,愚,痴,恐,皆,美,林,舞,妓,紋,袴,男,協,応,援,練,習,件,射,郷,愁,湧,悲,先,差,微,妙,ム,閲,覧,損,登,士,及,痛,勢,壮,指,笑,降,紀,征,夷,将,坂,村,麻,呂,尊,置,羽,滝,興,挙,係,憎,ゲ,貴,読,邪,卒,犬,飼,城,週,疲,庭,園,ヨ,宮,巨,植,左,工,噴,湖,縮,尺,版,表,鯉,亀,グ,烏,丸,六,蕪,庵,風,箱,積,隔,陳,列,木,曜,商,袋,格,夕,河,居,酒,輩,軒,晩,飽,欠,司,僚,冬,季,白,ホ,飲,議,牛,乳,濃,務,平,ぇ,昼,量,則,華,l,券,勧,米,節,栗,刀,魚,旬,ニ,栽,培,輸,簡,四,辛,余,触,刺,拾,剥,万,遍,綺,麗,子,ョ,暗,側,繰,広,机,宝,辰,樫,材,是,借,座,席,門,評,基,準,載,h,p,裏,支,造,ギ,ベ,壁,映,像,演,奏,打,柳,馬,腹,頼,猫,ぺ,判,断,夜,樽,八,背,油,干,移,式,歌,弊,害,赤,塾,帳,充,軽,狭,憶,儚,異,ゅ,太,寄,厚,測,玉,曲,秀,逸,押,容,去,健,康,ワ,管,縁,撰,組,堺,雅,隣,秘,含,昨,服,凄,極,埋,顔,街,説,例,橋,川,豊,臣,柱,石,武,突,両,喜,送,惜,瀬,伏,境,俗,抵,院,添,穴,信,徴,戸,滋,賀,宅,宣,琵,琶,怒,緒,趣,旨,符,津,午,呼,控,井,母,繋,鬼,洗,茶,黄,職,混,湯,辺,渡,揃,比,桃,溯,葛,標,千,倉,鑿,菩,提,弔,嵯,峨,創,峡,岸,殆,追,越,看,導,挟,幅,叡,拝,潰,供,桜,ぬ,抹,菓,暖,羅,蜜,豪,也,勉,奇,愉,畏,抱,才,漕,暦,浅,令,勝,望,備,死,艇,速,徐,曇,炎,候,喋,倖,未,蹴,涙,ゥ,揺,諦,詰,塵,凱,旋,嬉,努,恋,爽,床,眠,襲,晴,肌,寒,畿,編,成,盤,夫,喰,寸,秒,陸,順,位,短,悔,紳,引,退,暇,野,満,狛,駐,輪,輝,命,竜,仁,五,暮,竹,伸,各,掴,推,効,搭,潮,唱,適,勿,譲,矜,骨,遡,憂,遭,閑,役,滅,裂,章,偶,吊,告,淋,納,伴,英,兎,柄,片,仙,墳,複,喫,粘,餌,富,藤,賢,偏,伊,國,庫,較,図,稿,謗,怖,徳,枯,凡,哲,叱,永,幻,髄,忙,拘,鎮,脳,梗,塞,祖,父,救,花,層,震,耐,慢,算,願,根,希,桂,申,庁,轄,制,吹,装,怠,惰,果,紛,詮,為,若,倍,描,祭,収,隆,拡,煎,餅,媚,仰,恒,久,識,視,助,祀,威,堕,醜,氏,愕,祈,謐,包,粛,奉,繁,栄,具,互,域,貨,巷,形,規,属,尋,暢,易,丁,寧,鯖,脱,民,封,鎖,律,委,砂,僅,項,薄,祇,隈,留,給,盗,泥,棒,龍,穏,弁,絆,縛,承,即,暴,赦,距,隅,殺,淀,駆,徒,遇,綿,蔵,税,款,吸,己,防,衛,把,握,継,挿,査,架,献,宛,併,惑,財,政,革,促,義,破,慮,誘,拐,功,捨,遂,胸,批,傷,警,鐘,岐,息,概,般,懐,故,陣,菅,補,梅,爆,礼,覗,ぅ,髪,粧,紙,恩,恵,碁,温,器,航,危,険,踏,循,拠,夢,乱,策,築,弟,欒,葵,覇,衣,緑,懸,俺,頻,錯,薦,網,募,貼,蕎,麦,杯,処,株,堅,叩,叔,諭,志,廊,被,棟,削,掛,暑,汗,涼,陰,亡,ゼ,証,柔,軟,賛,辞,鵜,呑,惹,整,荘,虚,彩,沈,没,殻,拍,球,些,松,樋,嗣,廃,墟,劣,屁,屈,煩,痒,抗,訳,唄,幾,沿,苔,郵,熟,z,捉,締,渋,滞,巻,免,票,憩,歳,ぉ,往,復,慎,如,瞑,老,婦,墓,浸,狙,泊,札,幌,船,詐,欺,郎,帆,澤,蛇,虫,筋,肉,麩,兵,瀟,迎,履,腰,舌,鼓,薬,膳,粥,氷,癒,執,蒸,喧,騒,傍,江,尾,皇,揮,据,浴,冴,茂,虜,青,錦,織,操,縦,焼,酎,ヒ,渇,汚,奮,訴,喉,漱,煙,草,奪,咲,嬢,裕,芸,群,雄,争,娯,島,沸,熱,児,敗,泣,毛,監,督,菜,幹,冠,汲,炭,酸,粕,灘,憧,沖,縄,弓,剣,稼,就,七,掲,羨,泰,嫉,妬,惨,漂,酷,焦,鳥,岩,楼,禅,瓢,岡,敷,丼,椒,鋭,鶏,卵,艶,彦,典,熊,暁,頂,芋,釈,迦,智,誌,鞍,察,屏,翌,輿,派,宗,旦,泡,唐,塩,奴,妹,闘,誇,q,毒,依,障,玄,幡,延,排,桁,均,央,践,敏,騰,泉,杖,停,露,ぷ,炊,猿,症,枝,豆,藻,鍋,拓,餡,馳,血,糖,弾,垂,曰,謎,礎,剰,述,矢,煽,恥,担,遺,槽,泳,賞,霊,笠,宇,治,ぢ,灯,需,聳,脇,荒,葬,兄,胃,腸,呆,責,敬,堀,呪,塁,浪,躍,凍,黙,農,煮,筆,粉,尚,姓,斤,君,侮,辱,恰,祉,某,稲,穫,随,豚,噛,歯,噌,汁,碗,等,浄,煤,絨,毯,皮,卓,偉,病,医,療,綻,欧,酢,漬,酵,腐,蓄,偽,洋,凝,唖,企,謂,芳,緩,悟,陥,雲,拭,醤,敢,袖,還,括,銅,塘,鈴,佐,塚,嘉,諸,刈,筑,励,劇,王,堪,膨,償,斗,貢,獲,佇,脂,潜,濁,颯,姉,戚,馴,ヴ,垢,鉢,哀,遮,耗,預,墾,嘩,奈,胎,綱,闇,聖,寂,官,捗,辻,ξ,斐,餃,浜,姜,九,州,匹,祝,絞,麺,讃,茸,贈,賃,柿,凶,謀,舶,挽,襄,邸,譜,箇,蚊,縫,董,沌,駱,駝,闊,催,遥,吐,燃,儀,峠,磨,鴬,鶯,斂,蓮,粟,孝,紫,宸,師,ぃ,芯,赴,脊,慨,窓,幕,輔,昇,党,崩,里,庇,透,烈,癖,扉,塗,遣,漢,漏,兆,妥,伺,遵,阻,審,宜,醍,醐,糸,澄,薩,珠,叶,婚,招,杏,沙,汰,韓,膝,靭,撤,坦,
64 changes: 64 additions & 0 deletions src/distributions/string_prior_japanese.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
using CSV
using DataFrames: DataFrame

struct StringPrior <: PCleanDistribution end

letter_probs_file = joinpath(dirname(pathof(PClean)), "distributions", "lmparams", "kanji_probabilities.csv")
letter_trans_file = joinpath(dirname(pathof(PClean)), "distributions", "lmparams", "kanji_transition_matrix.csv")
kanjis_file = joinpath(dirname(pathof(PClean)), "distributions", "lmparams", "kanjis.csv")
const initial_letter_probs = CSV.File(letter_probs_file; header=false) |> DataFrame |> Array{Float64}
const japanese_character_transitions = CSV.File(letter_trans_file; header=false) |> DataFrame |> Matrix{Float64}
const alphabet = CSV.File(kanjis_file; header=false) |> DataFrame |> Array{String}
const alphabet_lookup = Dict([l => i for (i, l) in enumerate(alphabet)])

has_discrete_proposal(::StringPrior) = true

# Assume proposal_atoms are unique.
function discrete_proposal(::StringPrior, min_length::Int, max_length::Int, proposal_atoms::Vector{String})::Tuple{Vector{Union{String, ProposalDummyValue}}, Vector{Float64}}
options = Union{String, ProposalDummyValue}[proposal_atoms..., proposal_dummy_value]
probs = map(s -> logdensity(StringPrior(), s, min_length, max_length, proposal_atoms), proposal_atoms)
total = logsumexp(probs)
probs = Float64[probs..., log1p(-exp(total))]
return (options, probs)
end

discrete_proposal_dummy_value(::StringPrior, min_length::Int, max_length::Int, proposal_atoms::Vector{String}) = begin
join(fill("*", Int(floor((min_length + max_length) / 2))))
end

random(::StringPrior, min_length::Int, max_length::Int, proposal_atoms::Vector{String}) = begin
len = rand(DiscreteUniform(min_length, max_length))
letters = []
for i=1:len
dist = (i == 1) ? vec(initial_letter_probs) : vec(japanese_character_transitions[:, letters[end]])
if !isprobvec(dist)
dist = normalize(dist)
end
push!(letters, rand(Categorical(dist)))
end
join([alphabet[letter] for letter in letters])
end

const UNUSUAL_LETTER_PENALTY = 1000
const string_prior_density_dict = Dict{Tuple{String, Int, Int}, Float64}()
function logdensity(::StringPrior, observed::String, min_length::Int, max_length::Int, proposal_atoms::Vector{String})
get!(string_prior_density_dict, (observed, min_length, max_length)) do
if length(observed) < min_length || length(observed) > max_length
return -Inf
end
score = -log(max_length-min_length+1)
if length(observed) == 0
return score
end

prev_letter = nothing
for letter in observed
dist = isnothing(prev_letter) ? initial_letter_probs : vec(japanese_character_transitions[:, prev_letter])
prev_letter = haskey(alphabet_lookup, lowercase(letter)) ? alphabet_lookup[lowercase(letter)] : nothing
score += isnothing(prev_letter) ? -log(28) : max(log(dist[prev_letter]), -UNUSUAL_LETTER_PENALTY)
end
score
end
end

export StringPrior