Minimally Supervised Learning of Semantic Knowledge from Query Logs

Minimally Supervised Learning of Semantic Knowledge from Query Logs
Mamoru Komachi(†) and Hisami Suzuki(‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA 30分だが22分くらいで終わるようにしたほうがいい長くても25分。。 MSR でインターンシップしていたときの仕事だということを言う IJCNLP-08, Hyderabad, India 2018/11/21

Task similar similar Darjeeling Chai (Indian tea) Kombucha (Japanese tea) Learn semantic categories from web search query logs by bootstrapping with minimal supervision Semantic category: a set of words which are interrelated Named entities, technical terms, paraphrases, … Can be useful for search ads, etc… 例がよくないアプリケーションのひとつにすぎない Semantic category を学ぶと言うのにインスタンス獲得の話になって分からないカテゴリーって結局なんなのか分からない←例が必要 2 2018/11/21

Our Contribution First to use the Japanese query logs for the task of learning of named entities Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm 単語分割の必要がないのは検索ログを使うなら自明では？日本語に特化した話であることを言わなくてもいい 2018/11/21

Table of Contents Related work The Tchai algorithm Experiment
Bootstrapping techniques for relation extraction Scoring metrics The Tchai algorithm Problems of Espresso Extension to Espresso Experiment System performance and comparison to other algorithms Samples of extracted instances and patterns 2018/11/21

Compare toshiba satellite laptop
Bootstrapping Iteratively conduct pattern induction and instance extraction starting from seed instances Can fertilize small set of seed instances Query log (Corpus) Instances Contextual patterns vaio Compare vaio laptop Compare # laptop Toshiba satellite Compare toshiba satellite laptop #:slot HP xb3000 Compare HP xb3000 laptop 2018/11/21

Instance lookup and pattern induction
ANA ANA 予約 # 予約 instance query log extracted pattern Instance Query log Pattern Count ANA (All Nippon Airways) ANA 予約 (reservation) # 予約 644 ラスベガス (Las Vegas) ラスベガス格安航空券 (discount flight ticket) #格安航空券 140 ラスベガスホテル #ホテル(hotel) 114 … Restaurant reservation? Flight reservation? Broad coverage, Noisy patterns Use all strings but instances =Require no segmentation Instance とpatternの順番が上の図と下の図で対応していない Generic pattern の説明 Semantic drift Computational efficency Generic patterns 2018/11/21

Instance/Pattern Scoring Metrics
Sekine & Suzuki (2007) Starts from a large named entity dictionary Assign low scores to generic patterns and ignore Basilisk (Thelen and Riloff, 2002) Balance the recall and precision of generic patterns Espresso (Pantel and Pennacchiotti, 2006) 式は出さないほうがいい理解しようと努力したら次に進まれる。。 PMI is normalized by the maximum of all P and I P: patterns in corpus I: instances in corpus PMI: pointwise mutual information r: reliability score Reliability of an instance and a pattern is mutually defined 2018/11/21

The Tchai Algorithm Only induce patterns at the beginning
Filter generic patterns/instances Not to select generic patterns and instances Replace scaling factor in reliability scores Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns This modification shows a large impact on the effectiveness of our algorithm Only induce patterns at the beginning Tchai runs 400X faster than Espresso 2018/11/21

Experiments Japanese query logs from 2007/01-02 Target categories
Unique one million (166 millions in token) Target categories Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list Travel: the largest category (712 words) Finance: the smallest category (240 words) Category Seeds (with English translation) Travel JAL(Japan Airlines), ANA(All Nippon Airways), JR(Japan Railways), じゃらん(Travel information site), HIS(Travel agency) Finance みずほ銀行(Mizuho Bank), 三井住友銀行(Sumitomo Mitsui Banking Corporation), JCB, 新生銀行(Shinsei Bank), 野村證券(Nomura Securities) Jal, ana などの説明 23 categories 2018/11/21

Results High precision (92.1%) Travel Learned 251 novel words 10K list
Not in 10K list Travel Not Travel 280 17 251 7 125 Finance Due to the ambiguity of hand labeling (e.g. Tokyo Disney Land) Include common nouns related to Travel (e.g. Rental car) 10K list Not in 10K list Finance Not Finance 41 30 5 99 10Kリストと文字がかぶっているやはり10Kリストは意味が分かりにくいらしいどこがゴールドスタンダードでどこが評価したものなのか低頻度の単語を獲得することができたとはどこで分かるのか 2018/11/21

Sample of Instances (Travel category)
Type Examples (with translation) Place トルコ (Turkey), ラスベガス (Las Vegas), バリ島 (Bali Island) Travel agency Jtb, トクー ( yahoo (Yahoo ! Travel), net cruiser Attraction ディズニーランド (Disneyland), usj (Universal Studio Japan) Hotel 帝国ホテル(Imperial Hotel), リッツ(Ritz Hotel) Transportation 京浜急行(Keihin Express), 奈良交通(Nara Kotsu Bus Lines) Able to learn several sub-categories in which no seed words given 2018/11/21

System Performance Travel Finance
Instances Precision Rel. Recall Basilisk 651 63.4% 1.26 Espresso 500 65.6% 1.00 Tchai 680 80.6% 1.67 System Instances Precision Rel. Recall Basilisk 278 27.3% 0.70 Espresso 704 15.2% 1.00 Tchai 223 35.0% 0.73 High precision and recall High precision but low relative recall due to strict filtering Relative Recall (Pantel et al., 2004) 2018/11/21

Cumulative precision: Travel
Tchai achieved the best precision 2018/11/21

Sample Extracted Patterns
Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain System Sample Patterns (with English translation) Basilisk #東日本(east_japan), #西日本(west_japan), p#sonic, #時刻表(timetable), #九州(Kyushu), #+マイレージ(mileage), #バス(bus), google+#lytics, #+料金(fare), #+国内(domestic), #ホテル(hotel) Espresso #バス(bus), 日本#(Japan), #ホテル(hotel), #道路(road), #イン(inn), フジ#(Fuji), #東京(Tokyo), #料金(fare), #九州(Kyushu), #時刻表(timetable), #+旅行(travel), #+名古屋(Nagoya) Tchai #+ホテル(hotel), #+ツアー(tour), #+旅行(travel), #予約(reserve), #+航空券(flight_ticket), #+格安航空券(discount_flight_titcket), #マイレージ(mileage), 羽田空港+#(Haneda Airport) Tchai found context patterns that are characteristic to the domain 2018/11/21

Conclusion and future work
Use of query logs for semantic category learning Improved Espresso algorithm in both precision and performance Future work Generalize bootstrapping method by graph-based matrix calculation 2018/11/21

Thank you for listening!
Tchai Thank you for listening! 2018/11/21

Minimally Supervised Learning of Semantic Knowledge from Query Logs

Similar presentations

Presentation on theme: "Minimally Supervised Learning of Semantic Knowledge from Query Logs"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

Minimally Supervised Learning of Semantic Knowledge from Query Logs

Similar presentations

Presentation on theme: "Minimally Supervised Learning of Semantic Knowledge from Query Logs"— Presentation transcript:

Similar presentations

About project

フィードバック