Minimally Supervised Learning of Semantic Knowledge from Query Logs

Slides:



Advertisements
Similar presentations
R Basics 2013/12/09 Yamada. 今日の方針 Today’s plan テキスト・文字列を扱うにあたっての用 語の理解をすることの方が、 R での操作を 見るより有意義と思われるので、そちら を優先 Learning terms on text/strings is more.
Advertisements

Introduction to New Media Development Association June 2001 このプレゼンテーションでは、出 席者間で討論をし、アクション アイテムを作成する場合があり ます。 PowerPoint を使ってプ レゼンテーションの実行中にア クション アイテムを作成する.
第 5 章 2 次元モデル Chapter 5 2-dimensional model. Contents 1.2 次元モデル 2-dimensional model 2. 弱形式 Weak form 3.FEM 近似 FEM approximation 4. まとめ Summary.
Constructing a Chinese-Japanese Parallel Corpus from Wikipedia Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi (Graduate School of Informatics, Kyoto.
第1回レポートの課題 6月15日出題 今回の課題は1問のみ 第2回レポートと併せて本科目の単位を認定 第2回は7月に出題予定
 辞書系(じしょけい).
Chapter 11 Queues 行列.
日本語... ジェパディー! This is a template for you to use in your classroom.
Goal: I will understand the goal and summative assessent for Chapter 2
Bellwork: English meaning? 1)はじめまして 2)どうぞ 3)すみません 4)おはようございます 5)しゅくだい
Vanessa Lopez, Michele Pasin, and Enrico Motta
今しましょう Translate the story on the next slide. せんせいは しゅくだいを みます。
Location nouns.
じょし Particles.
What did you do, mate? Plain-Past
Verb Plain Negativeform
日本人の英語文章の中で「ENJOY」はどういうふうに使われているのか
Japanese verbs informal forms
There are 5 wearing verbs in Japanese depending on the part of body or the item being worn.
にほんご 111 (11/09/2006) Chapter 4 Quiz #1 〜は…です。 は vs. が えいが.
Tohoku University Kyo Tsukada
A 02 I like sushi! I like origami!
日本語3 4月26日 漢字ゲス 出会い How do people meet? お見合いは何ですか?
にほんご JPN101 Sep. 23, 2009 (Wednesday).
十年生の 日本語 Year 10 Writing Portfolio
Unit Book 10_课件_U1_Reading2-8 4 Word power university 1.
Licensing information
Chapter 4 Quiz #2 Verbs Particles を、に、で
The Sacred Deer of 奈良(なら)
On / in / at Honoka Tanno.
CRLA Project Assisting the Project of
VTA 02 What do you do on a weekend? しゅうまつ、何をしますか。
検索ログを用いた意味知識獲得のための ブートストラップ手法
What is the English Lounge?
Japanese 111 Instructor name.
Students’ reactions to Japanese and foreign teachers’ use of L1/L2
New accessory hardware Global Platform Division
て みる.
Session 8: How can you present your research?
Which season do you like?
Japan /12/2006 Hiragana Quiz #3 Grammar: More on じゃありません
Causative Verbs Extensively borrowed from Rubin, J “Gone Fishin’”, Power Japanese (1992: Kodansha:Tokyo) Created by K McMahon.
Windows Azure 通知ハブ.
WLTC Mode Construction
Effective methods for multiplying hypericum oliganthum
suppose to be expected to be should be
全国粒子物理会 桂林 2019/1/14 Implications of the scalar meson structure from B SP decays within PQCD approach Yuelong Shen IHEP, CAS In collaboration with.
-Get test signed and make corrections
Term paper, Report (1st, first)
My Favorite Movie I will introduce my favorite movie.
豊田正史(Masashi Toyoda) 福地健太郎(Kentarou Fukuchi)
Windows Summit /24/2019 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be.
第24回応用言語学講座公開連続講演会 後援:国際言語文化研究科教育研究プロジェクト経費
There is/are X (living thing)
Question Words….
いくらですか?.
検索ログを用いた意味知識獲得のためのブートストラップ手法
Suzaku and the Results ~1 years after launch Suzaku (朱雀)
2019/4/22 Warm-up ※Warm-up 1~3には、小学校外国語活動「アルファベットを探そう」(H26年度、神埼小学校におけるSTの授業実践)で、5年生が撮影した写真を使用しています(授業者より使用許諾済)。
第1回レポートの課題 6月24日出題 今回の課題は1問のみ 第2回レポートと併せて本科目の単位を認定 第2回は7月に出題予定
The Facilitative Cues in Learning Complex Recursive Structures
英語勉強会:川口英語 Supporting of Continuing Life Habit Improvement Using the Theory of Cognitive Dissonance : System Extension and Evaluation Experiment B4 渡邉.
MO装置開発 Core part of RTR-MOI Photograph of core part.
非等方格子上での クォーク作用の非摂動繰り込み
JEFFREY WITZEL (University of Texas at Arlington, USA)
Grammar Point 2: Describing the locations of objects
Comparison (cont’ed) peonominal の The particle で
へいせい二十七ねん 二がつにち ここのか・げつようび
Improving Strategic Play in Shogi by Using Move Sequence Trees
Goldscmidt2019, Barcelona, August 20, 2019
Presentation transcript:

Minimally Supervised Learning of Semantic Knowledge from Query Logs Mamoru Komachi(†) and Hisami Suzuki(‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA 30分だが22分くらいで終わるようにしたほうがいい 長くても25分。。 MSR でインターンシップしていたときの仕事だということを言う IJCNLP-08, Hyderabad, India 2018/11/21

Task similar similar Darjeeling Chai (Indian tea) Kombucha (Japanese tea) Learn semantic categories from web search query logs by bootstrapping with minimal supervision Semantic category: a set of words which are interrelated Named entities, technical terms, paraphrases, … Can be useful for search ads, etc… 例がよくない アプリケーションのひとつにすぎない Semantic category を学ぶと言うのにインスタンス獲得の話になって分からない カテゴリーって結局なんなのか分からない←例が必要 2 2018/11/21

Our Contribution First to use the Japanese query logs for the task of learning of named entities Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm 単語分割の必要がないのは検索ログを使うなら自明では? 日本語に特化した話であることを言わなくてもいい 2018/11/21

Table of Contents Related work The Tchai algorithm Experiment Bootstrapping techniques for relation extraction Scoring metrics The Tchai algorithm Problems of Espresso Extension to Espresso Experiment System performance and comparison to other algorithms Samples of extracted instances and patterns 2018/11/21

Compare toshiba satellite laptop Bootstrapping Iteratively conduct pattern induction and instance extraction starting from seed instances Can fertilize small set of seed instances Query log (Corpus) Instances Contextual patterns vaio Compare vaio laptop Compare # laptop Toshiba satellite Compare toshiba satellite laptop #:slot HP xb3000 Compare HP xb3000 laptop 2018/11/21

Instance lookup and pattern induction ANA ANA 予約 # 予約 instance query log extracted pattern Instance Query log Pattern Count ANA (All Nippon Airways) ANA 予約 (reservation) # 予約 644 ラスベガス (Las Vegas) ラスベガス格安航空券 (discount flight ticket) #格安航空券 140 ラスベガスホテル #ホテル(hotel) 114 … Restaurant reservation? Flight reservation? Broad coverage, Noisy patterns Use all strings but instances =Require no segmentation Instance とpatternの順番が上の図と下の図で対応していない Generic pattern の説明 Semantic drift Computational efficency Generic patterns 2018/11/21

Instance/Pattern Scoring Metrics Sekine & Suzuki (2007) Starts from a large named entity dictionary Assign low scores to generic patterns and ignore Basilisk (Thelen and Riloff, 2002) Balance the recall and precision of generic patterns Espresso (Pantel and Pennacchiotti, 2006) 式は出さないほうがいい 理解しようと努力したら次に進まれる。。 PMI is normalized by the maximum of all P and I P: patterns in corpus I: instances in corpus PMI: pointwise mutual information r: reliability score Reliability of an instance and a pattern is mutually defined 2018/11/21

The Tchai Algorithm Only induce patterns at the beginning Filter generic patterns/instances Not to select generic patterns and instances Replace scaling factor in reliability scores Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns This modification shows a large impact on the effectiveness of our algorithm Only induce patterns at the beginning Tchai runs 400X faster than Espresso 2018/11/21

Experiments Japanese query logs from 2007/01-02 Target categories Unique one million (166 millions in token) Target categories Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list Travel: the largest category (712 words) Finance: the smallest category (240 words) Category Seeds (with English translation) Travel JAL(Japan Airlines), ANA(All Nippon Airways), JR(Japan Railways), じゃらん(Travel information site), HIS(Travel agency) Finance みずほ銀行(Mizuho Bank), 三井住友銀行(Sumitomo Mitsui Banking Corporation), JCB, 新生銀行(Shinsei Bank), 野村證券(Nomura Securities) Jal, ana などの説明 23 categories 2018/11/21

Results High precision (92.1%) Travel Learned 251 novel words 10K list Not in 10K list Travel Not Travel 280 17 251 7 125 Finance Due to the ambiguity of hand labeling (e.g. Tokyo Disney Land) Include common nouns related to Travel (e.g. Rental car) 10K list Not in 10K list Finance Not Finance 41 30 5 99 10Kリストと文字がかぶっている やはり10Kリストは意味が分かりにくいらしい どこがゴールドスタンダードでどこが評価したものなのか 低頻度の単語を獲得することができたとはどこで分かるのか 2018/11/21

Sample of Instances (Travel category) Type Examples (with translation) Place トルコ (Turkey), ラスベガス (Las Vegas), バリ島 (Bali Island) Travel agency Jtb, トクー (www.tocoo.jp), yahoo (Yahoo ! Travel), net cruiser Attraction ディズニーランド (Disneyland), usj (Universal Studio Japan) Hotel 帝国ホテル(Imperial Hotel), リッツ(Ritz Hotel) Transportation 京浜急行(Keihin Express), 奈良交通(Nara Kotsu Bus Lines) Able to learn several sub-categories in which no seed words given 2018/11/21

System Performance Travel Finance Instances Precision Rel. Recall Basilisk 651 63.4% 1.26 Espresso 500 65.6% 1.00 Tchai 680 80.6% 1.67 System Instances Precision Rel. Recall Basilisk 278 27.3% 0.70 Espresso 704 15.2% 1.00 Tchai 223 35.0% 0.73 High precision and recall High precision but low relative recall due to strict filtering Relative Recall (Pantel et al., 2004) 2018/11/21

Cumulative precision: Travel Tchai achieved the best precision 2018/11/21

Sample Extracted Patterns Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain System Sample Patterns (with English translation) Basilisk #東日本(east_japan), #西日本(west_japan), p#sonic, #時刻表(timetable), #九州(Kyushu), #+マイレージ(mileage), #バス(bus), google+#lytics, #+料金(fare), #+国内(domestic), #ホテル(hotel) Espresso #バス(bus), 日本#(Japan), #ホテル(hotel), #道路(road), #イン(inn), フジ#(Fuji), #東京(Tokyo), #料金(fare), #九州(Kyushu), #時刻表(timetable), #+旅行(travel), #+名古屋(Nagoya) Tchai #+ホテル(hotel), #+ツアー(tour), #+旅行(travel), #予約(reserve), #+航空券(flight_ticket), #+格安航空券(discount_flight_titcket), #マイレージ(mileage), 羽田空港+#(Haneda Airport) Tchai found context patterns that are characteristic to the domain 2018/11/21

Conclusion and future work Use of query logs for semantic category learning Improved Espresso algorithm in both precision and performance Future work Generalize bootstrapping method by graph-based matrix calculation 2018/11/21

Thank you for listening! Tchai Thank you for listening! 2018/11/21