辻井潤一東京大学大学院理学系研究科情報科学専攻

辻井潤一東京大学大学院理学系研究科情報科学専攻
自然言語の処理と理解の研究辻井　潤一東京大学大学院理学系研究科情報科学専攻

プロジェクトの目的１．学術的な目的構造的な言語処理と確率・統計的な言語処理の融合理論からのアプローチからの工学への貢献
　　構造的な言語処理と確率・統計的な言語処理の融合　　　　　理論からのアプローチからの工学への貢献　　言語処理と知識処理２．社会的なインパクト　　ネットワーク時代の言語処理　　　　　テキストからの知識獲得、情報検索、対話システム３．国際的な情報の発信　　積極的な国際的な共同研究　　　　　焦点を絞った、実質的なGoalを持った国際Workshop 　　　　　緊密な研究協力体制の構築

理論言語学からの妥当な文法枠組み処理効率耐性タイプ付素性構造に基づく文法枠組み文法記述の偏り：現実テキストへの適用
文法記述の偏り：　現実テキストへの適用　　　　　　　　　　　　系統的な文法の拡充

処理効率 Abstract Machine for Unification (T.Makino, et.al.)
処理効率　 Abstract Machine for Unification (T.Makino, et.al.) Prolog with Typed Feature Structure (LiLFes) Coling 98, JNE-00 CFG Approximation (K.Torisawa, et.al) Multi-staged Parsing (TNT) Coling 98, JNE-00 Preventing Combinatorial Explosion (Y.Miyao) Packing of FSs ACL 99

Abstract Machine Abstract machine code of a TFS TFS data on memory
(Carpenter and Qu, 1995) nelist Abstract machine code of a TFS FIRST REST PUSH FIRST ADDNEW list UNIFYVAR 1 POP PUSH REST UNIFYVAR 1 POP list nelist nelist nelist FIRST FIRST REST REST FIRST REST bot list nelist nelist nelist FIRST FIRST FIRST REST REST REST This figure shows a basic idea of the abstract machine architecture. As you know, typed feature structures can be regarded as a directed graph with a root. The abstract machine has two distinct ways of representation of typed feature structures. One is a direct representation of graph structures on a memory area, and the other is a sequence of abstract machine instructions. The graph structures are “compiled” to the code that performs unification. The instruction contains an instruction for following features, retreating from a feature and unifying types. If one applies this instruction to a direct representation of feature structures stored on memory, the results is a feature structure obtained by unifying these two structures. The key of this idea is that one does not have to “interpret” one of the feature structures to be unified at run-time. By this point, significant speed-up can be achieved. foo list foo list foo list 1 STR nelist 2 VAR bot 3 PTR 4 4 STR nelist 5 VAR foo 6 VAR list 1 STR nelist 2 VAR list 3 PTR 4 4 STR nelist 5 VAR foo 6 VAR list 1 STR nelist 2 PTR 4 3 PTR 4 4 STR nelist 5 VAR foo 6 VAR list TFS data on memory

LiLFeS: Performance (2/2)
Comparison with other inference engines for typed feature structures FASTER LiLFeS: Native Code Compiler LiLFeS: Byte Code Emulator ProFIT on SICStus Emulator ALE 3.1 on SICStus Emulator Intel Pentium II 400Mhz Grammar : a small grammar distributed with ALE

Filtering with CFG (1/5) 2-phased parsing +
Approximate HPSG with CFG with keeping important constraints. Obtained CFG might over-generate, but can be used in filtering. Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Feature Structures HPSG + Compile CFG Input Sentences Built-in CFG Parser LiLFeS Unification Parsing Output Complete parse trees

Evaluation of HPSG Parsers DFKI, Stanford, U-Tokyo
Processing time per sentence (sec) 0.38 0.88 EDR(20.5) SLUNG 0.30 14.27 ATIS (7.42) XHPSG 3.10 1.90 14.71 blend （11） LinGO 0.61 0.31 1.72 aged（8.4） 0.23 0.12 0.68 csli（5.8） LKB Parser （Stanford: DFKI) TNT parser Naïve parser Corpus (average length: words） Grammar Sun UltraSparc, 336 mhz, 6GB main memory

Packed Feature Structure
Each dependency function for one of the input feature structures A set of feature structures Packed feature structure verb indicative VMODE indicative past_part PASSIVE false TENSE past verb >>> false VMODE verb 1 PASSIVE true VMODE past_part 2 TENSE PASSIVE true 3 past TENSE tense tense verb VMODE past_part PASSIVE false TENSE tense

Experimental Results (1)
Execution time for unification Packing achieved a considerable speed-up in unification Unpacked (msec.) Test data # of LEs Packed (msec.) Improvement credited 37 36.5 5.7 6.4 walked 79 77.2 9.2 8.4 This is the result. The column Test data shows the lexical entries used for the experiments. The column # of LEs shows the number of lexical entries for each word. Comparing with the unification of the unpacked feature structures, the unification of packed feature structures achieved a speed-up by a factor of 6.4 to 8.4. >>>

大規模な文法の構成英語文法スタンフォード大学、DFKIとの共同： LinGO文法（HPSG)
ペンシルベニア大学との共同：　XTAG文法の変換　　　　　　手作業が介在する変換（XHPSG) 　　　　　　２つの文法枠組みの自動変換日本語文法 SLUNG:　Underspecified　な日本語文法 KNP：　係り受け解析、高耐性の日本語文法（京都大学）

Overview of GENIA Project
Pre‐processing Named entity Template element Scenario template Information Extraction Learning Terminology Databases Corpora ① A researcher with a question ② query ③ ④ information extracted ⑤ answer to the question Ontology Information Retrieval WWW Links Thesaurus

CSNDB （国立衛生研究所) A data- and knowledge- base for signaling pathways of human cells. It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. Final goal is to make a computerized model for various biological phenomena.

Example. 3 A Polymerization Reaction “Ah receptor + HSP90 ”
Signal_Reaction: “Ah receptor + HSP90 ” Component “Ah receptor” “HSP90” Effect “activation dissociation” Interaction “PAS domain” “of Ah receptor” Activity “inactivation of Ah receptor” Reference [Powell-Coffman_1998]

Syntax/Semantics An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole

言語と知識処理：　理解へ

Event Ontology substance ACTIVATE substance substance ACTIVATE protein
REACTION1 attribute1 attribute : REACTION2 attribute1 attribute : REACTION3 attribute1 attribute : substance ACTIVATE substance substance ACTIVATE protein protein ACTIVATE pathway PHOSPHORYLATE INHIBIT REGULATE REACTION4 attribute1 attribute : REACTION5 attribute1 attribute :

Example of NE Annotation
UI TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">. AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.

TIMS – Tag Information Management System –
Ｓ Will/TreeEdit XML Tree Viewer / XML Tree Editor LiLFeS/XHPSG HPSG-based Syntactic/ Semantic Parser JTAG Manual Tagging Aid Interface XML Data XML Data XML Data XML Data Mining XML Document Management TIMS VTAG Automatic Tagging Workbench XML Data XML Database

アブストラクト400件に対するタグ付け文章数：約４，０００文単語数：約１００，０００語タグ付けされた項目の数計約１２，０００個所
計　約１２，０００個所 SOURCE 　３１２３ DNA 　９４５ RNA １００ PROTEIN 　２６３９その他　５１８０

36 semantic subclasses TAG NAME sub class Count organism
multi-cell organism 477 mono-cell organism 20 virus 153 tissue - 213 cell type 1478 sub-location of cells 79 other (natural source) 1 cell line 695 other (artificial source) 7 protein family or group 1172 complex 170 molecule 1181 subunit 65 substructure 29 domain or region 77 N/A 98 peptide 40 amino acid monomer 27 TAG NAME sub class Count DNA family or group 29 complex molecule 81 subunit substructure 41 domain or region 770 N/A 24 RNA 13 80 1 2 4 other polymer - 43 nucleic acid monomer 47 lipid 1113 carbohydrate 10 other organic compounds 829 inorganic atom other name 2850 36 semantic subclasses

CLASSの頻度分布

アブストラクトに頻出する動詞 CSNDB(国衛研)の９２５件の出現回数 (Have, be動詞以外） show 375回 bind 226回
indicate 195回 suggest 183回 induce 162回 inhibit 148回 mediate 140回 report 139回 activate 135回 require 130回 ENGCGで機械的に数えた結果。beが1623回haveが193回動詞上位40個の個数が

show NP show that-clause NP show NP NP be shown to-infinitive
researcher show conclusion experiment show conclusion NP show NP structure show property NP be shown to-infinitive substance be shown to reaction it is shown that-clause it is shown conclusion

inhibit NP inhibit NP substance inhibit reaction
substance inhibit pathway substance inhibit substance substance inhibit source reaction inhibit substance reaction inhibit reaction structure inhibit pathway

頻出動詞の構文・意味パターン辞書のエントリーが何種類必要か show 5 bind 4 indicate suggest induce
inhibit 7 mediate 6 report activate 9 require 次に続くスライドの第2段のインデント（主語目的語の意味クラスまで数える）の数派生形容詞（X-activating Yなど）は除く

“indicate” の意味表現の例（LiLFeS）
semantic_primitive(Tnx0Vnx1, indicate, SYNSEM\LOCAL\CONX\IND\(transitive & ARG1\chem_struct & ARG2\mechanism)). ARG1\research & ARG2\$OBJ)). semantic_primitive(Tnx0Vs1, indicate, the structure indicates mechanisms these findings indicate an unexpected role of … the data indicate that …

Experiment (A.Yakushiji et.al, PSB2001)
XHPSG: HPSG-like Grammar translated from XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98) Terms (Compound nouns) are chunked beforehand. 180 sentences from abstracts in MEDLINE The average parse time per sentence: 2.7 sec by a naïve parser (This has been improved by the multi-stage parser by 10 times)

Argument Frame Extractor
133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely 31 68% Extracted with ambiguity 32 Extractable from pp’s 26 Parsing Failures Not extractable 27 Memory limitation,etc 17

ＫＮＰ精度：約９０％企業や金融機関に不良債権の早急な処理を促し、特に金融機関には「この過程で従来のような横並びの
 ＫＮＰ金融機関に PARA 不良債権の早急な処理を促し、 特に金融機関には「この過程で従来のような横並びの決算や 配当が PARA 維持されるのではなく、 経営格差を顕在化させる PARA 覚悟を精度：約９０％求めたい」としている。 PARA

システムの概要ユーザユーザインタフェイスメール送信部（WWWブラウザ）入力解析部知識対話管理部データベース
Mewでメールを送信する方法は？ユーザインタフェイス（WWWブラウザ）メール送信部入力解析部知識データベース対話管理部

対話データの評価成功：38％失敗：知識　　　：約30％　減少傾向失敗：対話管理：約5％　増加傾向

研究成果（東京工業大学）再現率の改善精度の改善再現率と精度の両立複数のシソーラスを利用した検索質問拡張（1998-1999）
クラスタベースの情報検索（1997）大規模テキストクラスタリング（）精度の改善格フレームを利用した情報検索（1996）索引語の洗練と選択的利用（）再現率と精度の両立多段階検索モデル（）まず，再現率の改善については，これらの3つのテーマについて研究をおこなった．再現率を低下させる原因のひとつにユーザの用いる語彙と文書作成者の用いる語彙の不整合という問題がある．たとえば，… 検索質問拡張はこの不整合を解消する手法として知られている．しかし，従来の検索質問拡張では，人手で作成されたシソーラスあるいはコーパスから自動構築したシソーラスを用いて検索質問拡張をおこなっており，その効果は必ずしもはかばかしいものではなかった．東工大のグループではそれぞれのシソーラスの特徴を分析し，タイプの異なる複数のシソーラスを組み合わせて用いることによって検索質問拡張によるシステムの性能が改善できることを示した．東工大のグループは1997年のTRECにおいて，この方式を採用し，約100システム中で最高4位の成績をおさめている．再現率を改善する別のアプローチとして，クラスタベースの情報検索を提案した．これは，検索対象となる文書をクラスタリングすることによって類似文書をあらかじめクラスタにまとめておき，このクラスタを検索対象とするアプローチである．検索質問拡張が検索質問におけるスムージング効果を狙っているのに対し，クラスタベースの情報検索では，検索対象となる文書におけるスムージングをねらっている．また，クラスタベースの検索を現実的なものとするために，大規模な文書を高精度で効率良くクラスタリングするための手法についても研究をおこなった．精度の改善のためには基本的には，索引語を単一語からより豊かな情報を持った句や述語−項構造などを導入する方向で研究をおこなった．まず，これまでのような単一語の代わりに文書や検索質問を構文解析し，格フレームを取り出して，格フレームのマッチングによる検索手法を試みた．この手法により一定の成果をあげることができたが，処理コストが高いことと必ずしもすべての検索質問について効果があることが判明した．この点を改善するために従来から用いられてきた単一語と句や述語−項構造などのより複雑な索引語を検索質問に応じて選択的に利用することによって，さらに精度を改善することができた．再現率と精度の両立については，まず，最初のフェーズで再現率の向上を狙い，その結果に対して精度の向上を狙うフェーズを設ける2段階の検索方式によって両者を両立する方式を提案した．これにより，後段のフェーズで，処理コストをかけないでより深い言語解析をおこなうことができる．

シナリオ（東京工業大学） Query Therauri Query Expansion Final Result Expanded
この図は以上の研究成果全体をまとめた概念図である． Initial Retrieval Intermediate Result Index term Refinement Revised Query Second Retrieval Document Collection

Workshops Tutorials 共同研究初年度：立ち上げのためのClosed WS
二年度：　IRなど応用に焦点　（日立基礎研と協賛）三年度：　理論と応用の関係　（日立中研と協賛）四年度：　Parsing　Strategy（ドイツ）　　　　　　（DFKI,　Stanford大学と協賛）　　　　　　（論文誌のSpecial　Issue,　CSLIからの本） Tutorials NLP for Bio-Informtaics: Eureka Groupと共同（PSB2001) 　　　　　　　　　　　　　　　Eureka, TIDESと共同（ISMB2001) 共同研究スタンフォード、DFKI、ペンシルベニア、UMIST、ローマ大

将来の研究課題１．構造処理と確率的な処理意味空間まで含めた、豊かな確率空間での処理２．文法記述間の相互変換、等価性の理論的基礎
　　　意味空間まで含めた、豊かな確率空間での処理２．文法記述間の相互変換、等価性の理論的基礎　　　言語資源の共有、理論言語学への寄与３．大規模素性構造のデータベース　　　XMLデータベースとの相互関連４．制御された、教師なし学習の機構　　　意味クラスの同定、データからの文法学習５．間テキストでの文脈処理

辻井潤一東京大学大学院理学系研究科情報科学専攻

Similar presentations

Presentation on theme: "辻井潤一東京大学大学院理学系研究科情報科学専攻"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

辻井 潤一 東京大学大学院理学系研究科 情報科学専攻

Similar presentations

Presentation on theme: "辻井 潤一 東京大学大学院理学系研究科 情報科学専攻"— Presentation transcript:

Similar presentations

About project

フィードバック

辻井潤一東京大学大学院理学系研究科情報科学専攻

Presentation on theme: "辻井潤一東京大学大学院理学系研究科情報科学専攻"— Presentation transcript: