Foundations of Statistical Natural Language Processing 5. Collocations

Foundations of Statistical Natural Language Processing 5. Collocations
米澤研究室M1 増山隆

概要 Collocationとは Collocationを統計的に見つけ出す方法 Frequency Mean and Variance
Hypothesis testing(仮説検定) The t test Hypothesis testing of difference(using the t test) Pearson’s chi-square test Likelihood ratios

Collocationとは

Collocation(連語) 複数の単語が慣習的に結びついてひとつの表現になったもの(例 New York)
Compositional(部分から全体の意味が分かる)とは限らない　　例　kick the bucket (死ぬ) 「結びつきやすさ」がある　　例　strong tea / powerful tea

Firth vs. Saussure & Chomsky
Collocationは無視されていた文、節の構造を重視 Firth (Contextual Theory of Meaning) Contextを重視社会設定会話の流れ Collocation Firth Strong teaはありだがpowerful teaはなし

Collocationを統計的に見つけ出す方法

5.1 Frequency 2語が続いて現れる回数を数える素朴
そのまま行うと of the, in theのような興味のない結果が得られる(Table 5.1)

Frequency + POS filter (Justeson and Katz 1995)
cf. Table 5.2, 5.3 例　Strong tea and powerful tea New York Timesには現れなかった Webでの実験では799(strong)と19(powerful)であった strong,powerfulどちらにも使える語に対してはより洗練された分析が必要

5.2 Mean and Variance(1/2) (Smadja 1993)
2語が同時に出現するときの距離を分析例　knock on his doorでのknockに対するdoorの距離は3 距離の平均と分散を算出分散が小さいほうがよい幅を限る　collocation window

Mean and Variance(2/2) 結果はTable 5.2,5.4 Smadjaは急激なピークのみをとりだした
Window size 9 分散が小さいとき平均距離は0に近い（興味のないcollocation) Smadjaは急激なピークのみをとりだしただいたい80%の出来 Collocationよりももっと緩い関係がわかる　例　knock と　door

5.3 Hypothesis Testing (仮説検定)
ある2語が偶然隣り合うのか決まって隣り合うのかを調べたい New companiesはnewもcompaniesも出現頻度が高いならば隣り合う確率も高い H0 null hypothesis (帰無仮説) 統計的に正しいか調べたい命題ここでは、「ある2語w1w2が偶然隣り合う」 P(w1w2) = P(w1)P(w2) .. 独立性で仮定仮説検定一般の話

The t test 平均に関する検定によく使う w1w2が偶然隣り合うか?を検定手順1.) 以下の式でt scoreを計算
信頼区間α: 棄却、採択の基準%(ここでは0.05) w1w2が偶然隣り合うか?を検定手順1.)　以下の式でt scoreを計算ここでは片側である

The t test 手順2) t分布表を見る　ｔの値が表の値より大ならばH0を棄却積分値がαである点

T testの計算例 New companies C(New) = 15828 C(companies) = 4675
s2=p(1-p)～pを使用 (cf ) t = α=0.005の時の基準値は2.576(表を見る) H0は棄却できない　⇒New companiesは偶然並んだ

The t testの結果と特徴結果は表5.6 信頼区間 αはそれほど重要ではない Collocationのランク付けもできる
5.6はstop wordを含むほとんどのbigramでH0(独立性の仮説)を棄却できた ⇒言語は予測できないことはほとんどおきない。　　word sence disambiguationや確率的パーズの能力の裏付け信頼区間　αはそれほど重要ではない Collocationのランク付けもできる T test のれいは１６４

Hypothesis testing of differences
微妙に異なるcollocationの発見に使う　例) strongとpowerfulの違いを見るためにそれらの直後によく出現する語を見る二標本t検定　以下のWelchの近似を使う

仮説とt score 帰無仮説H0は「両者に違いがない」こと。標本数は共通でN (Bernoulli試行をN回)
μ1-μ0=0 標本数は共通でN (Bernoulli試行をN回) 以上を考慮してtを語数で表す bernoulli思考ｓは近似値

Hypothesis testing of differencesの結果と応用
結果はTable 5.7 Church & Hanks(1989) 内的性質と外的性質 strong: 実際には力を持たないかもしれない。内的 powerful: 実際に力をもつ。外的文化的な側面のような微妙なところがある　例) strong tea, powerful drugはtea,drugの差応用: 辞書作成単語の微妙なニュアンスをつかむ

Pearson’s chi-square test
ばらつき(分散)の検定 t検定よりも適用範囲が広い t検定.. サンプルが標準正規分布にしたがっていることを仮定観測で得た表と独立性を仮定した表がマッチするか?

χ2値と検定手順式と見る表以外はt検定と同様 new companiesはH0を棄却できない 5.7式の導出は
new companiesはH0を棄却できない

χ2検定の性質と応用 t検定よりも適用範囲が広い応用1: ある単語の翻訳語を見つける(Church & Gale 1991)
例) vache(フランス語) と cow(英語) H0を棄却できれば、翻訳語だといえる応用2: 2コーパスの類似性の尺度(Kilgarriff & Rose 1998)

Likelihood ratios(最尤比検定)
直感に合う(?)方法「現実の標本は確率最大のものが実現したものだ」と仮定(最尤原理) 仮説 w1w2というbigramについて H1 P(w2|w1) = p = P(w2|￢w1) H2 P(w2|w1)=p1≠p2=P(w2|￢w1) H1は独立性の仮説

Likelihoodのイメージ真の確率pに近いほどlikelihood(最尤度)は高い

Likelihoodの計算(1/2) p,p1,p2を得られたデータから計算二項分布を仮定(Bernoulli分布)
この値が当てはまりのよさを示す

Likelihoodの計算(2/2) ただし -2logλは漸近的にχ2分布に従う(らしい)

likelihood ratiosの結果と特徴
結果はTable 5.12 結果の解釈は直感的に出来る e0.5*(-2logλ)の値をみて、どれくらいの確からしさで棄却されたかが分かる出現回数が少ないbigramにも適用可能何ばいの比較

Relative frequency ratios
コーパスを特徴づけるcollocationを他のコ　ーパスたちと比較して見つける例 1990年、1989年のNew York Times 　cf. Table 5.13　1989年に頻出　1990年に2回　1989年の出来事、1990年に終わったコラムある特定分野向けのcollocationを見つける普通の文章と特定分野の文章を比較

参考文献基礎統計学I 統計学入門自然科学の統計学(p155に5.7式の導出) 東京大学教養学部統計学教室編雑なメモ

Foundations of Statistical Natural Language Processing 5. Collocations

Similar presentations

Presentation on theme: "Foundations of Statistical Natural Language Processing 5. Collocations"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

Foundations of Statistical Natural Language Processing 5. Collocations

Similar presentations

Presentation on theme: "Foundations of Statistical Natural Language Processing 5. Collocations"— Presentation transcript:

Similar presentations

About project

フィードバック