Example-based Machine Translation without Saying Inferable Predicate


Presentation transcript:

Example-based Machine Translation without Saying Inferable Predicate I am Eiji Aramaki from University of Tokyo and ATR. The title of my talk is “Example based machine translation without saying inferable predicate” Eiji Aramaki* **, Sadao Kurohashi* ** , Hideki Kashioka** , Hideki Tanaka*** * University of Tokyo ** ATR *** NHK

EBMT Framework EBMT using Structural Translation Examples Input カナダで (Canada) Translation Examples カナダで 開かれた 五月初めに to be held in Canada 会議で 議論されました conference for negotiations 開かれた (be held) 会議で (conference) For each phrase in an input, a plausible TE is retrieved We are working on Example-based-machine-translation using Structural translation-examples. First I’ll explain our system framework. For each phrase in an-input-sentence, a translation-example is retrieved. Then, English phrases in translation-examples are combined to generate an output sentence. In this slide, the system generates a sentence like this. Output conference └held └in Canada English phrases are combined to generate output

Learning Inferable Expression Purpose INPUT : カナダで 開かれる 会議で Redundant (Canada) (be held) (conference) OUTPUT: a conference held in Canada HUMAN : a conference in Canada 観光シーズンを 迎えた 北海道では INPUT : Failure (tourism season) (start) (Hokkaido) OUTPUT: in Hokkaido greeted tourism season In the previous slide, the system generates “a conference held in Canada“ However, a human translation is sometimes different from the system-output. For example, a human translates simply "a conference in Canada" As shown in this-example, machine-translations tend to be redundant. ■ This is another example. This verb-phrase “greeted” make this-sentence strange and unreadable. Such redundancy often causes translation failures like this. Therefore, our purpose is to learn inferable-expressions and avoid redundancy. Furthermore, as shown in this example, the redundant-verb-phrases often causes translation-failures. So, In this research, Our purpose is to learn Inferable-predicates. HUMAN : in Hokkaido as the tourism season Learning Inferable Expression Predicate

Outline Verb phase (VP) aligned corpus Method Experiment Conclusion How to build VP-aligned-corpus Investigation of VP-aligned-corpus Method Experiment Conclusion This is outline of this talk. we made a Verb-phrase-aligned-corpus to investigate how to translate-verb-phrases. First, I will explain how to build the-corpus and the investigation of it. Then, I will explain proposed-method and experiments. Finally, I will conclude my talk.

VP-aligned-corpus Motivation Process Target Verb-phrase translation is difficult ⇒To learn how to translate verb-phrases, we annotated verb-phrases and their correspondences. Process Estimate alignment automatically Modify alignment by hand Target NHK-News-Corpus Verb-phrase translation is difficult. As I’ve mentioned before, some redundant verb-phrases cause some translation errors. Therefore, in order to learn how to translate verb-phrases, we annotated verb-phrases and their correspondences in a corpus. We call the corpus “Verb-phrase-aligned-corpus” To build the corpus, First, we automatically estimate alignment. Then, we modified alignment by hand. For the annotation target, we used NHK-NEWS-Corpus.

NHK-News-Corpus NHK is a Japanese broadcasting corporation Japanese articles are translated to be natural as English news ⇒ not parallel corpus 田植えフェスティバル石川県輪島市で外国の大使や一般の参加者など千人あまりが急な斜面の棚田で田植えを体験する催しが行われました。 Ambassadors and diplomats from 37 countries took part in a rice planting festival on Sunday in small paddies on steep hillsides in Wajima, central Japan. 輪島市白米町には(しろよねまち)千枚田と呼ばれる(せんまいだ)大小二千百枚の棚田が急な斜面から海に向かって拡がっています。 About one-thousand people gathered at the hill, where some two-thousand 100 miniature paddies, called Senmaida, stretch toward the Sea of Japan. 田植え体験は農作業を通して米作りの意義などを考えていこうという地球環境平和財団の呼び掛けで開かれたもので、海外三十四ヵ国の大使や書記官、それに一般の参加者ら合わせておよそ千人が集まりました。 NHK is a Japanese-broad-casting-corporation which provides service mainly in Japanese and the other languages represented by English. Original-news is Japanese, and it is translated to be natural as English-news So it is not parallel corpus The event was organized by the private Foundation for Global Peace and Environment. 田植えに使われた苗は去年の秋、天皇陛下が皇居で収穫された稲籾から育てたものです。 The rice seedlings are grown from grain harvested by the Emperor at the Imperial Palace in Tokyo last autumn. 参加者たちは裸足になって水田に足を踏み入れ地元に伝わる田植え歌に合わせて慣れない手つきで苗を植えていました。 Barefoot participants waded into the paddies to plant the seedlings by hand while singing a local folk song about the practice of rice planting. きょうの輪島市は雲が広がったもののまずまずの天気となり、出席された高円宮さまも海からの風に吹かれながら田植えに加わっていました。 地球環境平和財団では今年の夏休みに全国の子どもたちを対象に草刈りや生きものの観察会を開く他、秋には稲刈体験を行なう予定にしています。

NHK-News-Corpus Some phrases has no parallel expressions in the other language. 田植えフェスティバル石川県輪島市で外国の大使や一般の参加者など千人あまりが急な斜面の棚田で田植えを体験する催しが行われました。 Ambassadors and diplomats from 37 countries took part in a rice planting festival on Sunday in small paddies on steep hillsides in Wajima, central Japan. 輪島市白米町には(しろよねまち)千枚田と呼ばれる(せんまいだ)大小二千百枚の棚田が急な斜面から海に向かって拡がっています。 About one-thousand people gathered at the hill, where some two-thousand 100 miniature paddies, called Senmaida, stretch toward the Sea of Japan. 田植え体験は農作業を通して米作りの意義などを考えていこうという地球環境平和財団の呼び掛けで開かれたもので、海外三十四ヵ国の大使や書記官、それに一般の参加者ら合わせておよそ千人が集まりました。 Some phrase has no parallel expression in the other language Such expressions are shown in red-color. The event was organized by the private Foundation for Global Peace and Environment. 田植えに使われた苗は去年の秋、天皇陛下が皇居で収穫された稲籾から育てたものです。 The rice seedlings are grown from grain harvested by the Emperor at the Imperial Palace in Tokyo last autumn. 参加者たちは裸足になって水田に足を踏み入れ地元に伝わる田植え歌に合わせて慣れない手つきで苗を植えていました。 Barefoot participants waded into the paddies to plant the seedlings by hand while singing a local folk song about the practice of rice planting. きょうの輪島市は雲が広がったもののまずまずの天気となり、出席された高円宮さまも海からの風に吹かれながら田植えに加わっていました。 地球環境平和財団では今年の夏休みに全国の子どもたちを対象に草刈りや生きものの観察会を開く他、秋には稲刈体験を行なう予定にしています。

Alignment Estimation DP matching Sentence alignment using translation dictionaries Extract 1-to-1 sentence pairs Parse and Phrasal alignment [Aramaki et al., 2003] 田植え体験は農作業を通して米作りの意義などを考えていこうという地球環境平和財団の呼び掛けで開かれたもので、海外三十四ヵ国の大使や書記官、それに一般の参加者ら合わせておよそ千人が集まりました。 The event was organized by the private Foundation for Global Peace and Environment. 田植えに使われた苗は去年の秋、天皇陛下が皇居で収穫された稲籾から育てたものです。 参加者たちは裸足になって水田に足を踏み入れ地元に伝わる田植え歌に合わせて慣れない手つきで苗を植えていました。 田植えフェスティバル石川県輪島市で外国の大使や一般の参加者など千人あまりが急な斜面の棚田で田植えを体験する催しが行われました。 輪島市白米町には(しろよねまち)千枚田と呼ばれる(せんまいだ)大小二千百枚の棚田が急な斜面から海に向かって拡がっています。 きょうの輪島市は雲が広がったもののまずまずの天気となり、出席された高円宮さまも海からの風に吹かれながら田植えに加わっていました。 地球環境平和財団では今年の夏休みに全国の子どもたちを対象に草刈りや生きものの観察会を開く他、秋には稲刈体験を行なう予定にしています。 Ambassadors and diplomats from 37 countries took part in a rice planting festival on Sunday in small paddies on steep hillsides in Wajima, central Japan. About one-thousand people gathered at the hill, where some two-thousand 100 miniature paddies, called Senmaida, stretch toward the Sea of Japan. The rice seedlings are grown from grain harvested by the Emperor at the Imperial Palace in Tokyo last autumn. Barefoot participants waded into the paddies to plant the seedlings by hand while singing a local folk song about the practice of rice planting. 田植えフェスティバル石川県輪島市で外国の大使や一般の参加者など千人あまりが急な斜面の棚田で田植えを体験する催しが行われました。 輪島市白米町には(しろよねまち)千枚田と呼ばれる(せんまいだ)大小二千百枚の棚田が急な斜面から海に向かって拡がっています。 きょうの輪島市は雲が広がったもののまずまずの天気となり、出席された高円宮さまも海からの風に吹かれながら田植えに加わっていました。 地球環境平和財団では今年の夏休みに全国の子どもたちを対象に草刈りや生きものの観察会を開く他、秋には稲刈体験を行なう予定にしています。 Ambassadors and diplomats from 37 countries took part in a rice planting festival on Sunday in small paddies on steep hillsides in Wajima, central Japan. About one-thousand people gathered at the hill, where some two-thousand 100 miniature paddies, called Senmaida, stretch toward the Sea of Japan. 田植え体験は農作業を通して米作りの意義などを考えていこうという地球環境平和財団の呼び掛けで開かれたもので、海外三十四ヵ国の大使や書記官、それに一般の参加者ら合わせておよそ千人が集まりました。 First, we Estimated sentence-alignment. We used a conventional DP matching method using 5 translation-dictionaries. Then, we extracted 1-to-1 sentence-pairs. After that, we parsed their structures by Japanese and English parsers /and/ Estimated phrasal-alignment The event was organized by the private Foundation for Global Peace and Environment. 田植えに使われた苗は去年の秋、天皇陛下が皇居で収穫された稲籾から育てたものです。 The rice seedlings are grown from grain harvested by the Emperor at the Imperial Palace in Tokyo last autumn. 参加者たちは裸足になって水田に足を踏み入れ地元に伝わる田植え歌に合わせて慣れない手つきで苗を植えていました。 Barefoot participants waded into the paddies to plant the seedlings by hand while singing a local folk song about the practice of rice planting.

Alignment Modification We modified alignment by hand (5,500 sentence-pairs) (1) VP or not VP:= phrase includes a verb or an adjective that has an argument (2) For VP, its corresponding phrases VP has no corresponding phrases The automatic estimation includes some errors, So, we modified alignment by hand. We modified about 5-thousand-sentence-pairs in the following-2-point. 1: verb-phrase or not verb-phrase. we regard a phrase includes a verb or an adjective that has an argument as a verb-phrase . 2: for a verb phrase, we Annotated its corresponding phrases in the other language. This is an Annotation tool, some verb-phrases do not have their corresponding phrases. In such a case, we Annotated this mark which means it has no-corresponding-phrases.

Investigation of Corpus Where does a Japanese phrase corresponds in English? Japanese : English # VP : VP 9779 VP : φ 6831 VP : NP or PP 716 VP : Others 319 Condensed Alignment Pattern 1149 Others 5682 By this process, we got a verb-phrase-aligned-corpus. Then, we count correspondences from a view point of where a Japanese phrase corresponds in English The result shown in this table. many Japanese phrases have no-corresponding phrases in English. The number is shown in VP-φ And, some of them occur with this-alignment-pattern, which consists of three-Japanese-phrases and two-English –phrases. We call this-alignment-Pattern a “Condensed Alignment Pattern” or simply , a “CAP” カナダで (Canada) at a trade conference 開かれた (be held) in Canada 通商会議で (trade conference)

Investigation of Corpus VP-φ often occurs with Condensed Alignment Pattern (or CAP) Where does a Japanese phrase corresponds in English? Japanese : English # VP : VP 9779 VP : φ 6831 VP : NP or PP 716 VP : Others 319 P’ C’ C P VP Condensed Alignment Pattern 1149 Others 5682 If a verb-phrase in a CAP is always inferable or redundant to translate, by omitting this verb phrase, we can achieve a compact translation If a in a CAP is always inferable ⇒We can achieve a compact translation by omitting the . VP カナダで (Canada) at a trade conference 開かれた VP (be held) in Canada 通商会議で (trade conference)

VP in CAP is inferable We randomly extracted 80 CAPs, checked whether s are inferable or not VP P’ C’ C P VP inferable 56 not Parse Error 3 Alignment Error 11 Phrase Chunking Error 1 Others 9 In-order-to examine the assumption, we randomly extracted 80-CAPs from the corpus. Then, we manually checked whether verb-phrases in CAPs are inferable or not. shown in this-table, except for some errors (for example, parse errors , alignment errors and so on) Almost all verb-phrases in CAPs are inferable.

VP in CAP is inferable We randomly extracted 80 CAPs and checked whether s are inferable or not VP P’ C’ C P VP Inferable P-CONTEXT 21 C-CONTEXT 16 PC-CONTEXT 19 Not Parse Error 3 Alignment Error 11 Phrase Chunking Error 1 Others 9 Furthermore, we can classify inferable CAPs in 3 types. P-context, C-context, and PC-context I will explain these three-classifications.

P (Parent)-CONTEXT カナダで 開かれた 会議で ~ (Canada) (be held) (conference) カナダで 開かれた 会議で ~ (Canada) (be held) (conference) at a conference in Canada … C カナダで P’ at a conference (Canada) This is a P-context example. As I’ve mentioned before, a verb-phrase “be held” is inferable in this context We think, this verb-phrase “be held” can be inferred from its parent-phrase “conference” We call such-a-pattern P-context VP 開かれた (be held) C’ in Canada P 会議で (conference)

C (Child)-CONTEXT 肺の病気に かかった 男性の ~ (lung disease) (suffer) (man) 肺の病気に かかった 男性の ~ (lung disease) (suffer) (man) the man with a lung disease … 肺の P’ the man (lung) C 病気に (disease) This is C-context example. The Japanese-sentence means “the man who is suffering from a lung DIisease” And the English-sentence is simply “the man with a lung disease” In this case, verb-phrase “suffer“ can be inferred from these child-phrase “disease”. We call such a pattern, C-context VP かかった (suffer) C’ with a lung disease P 男性の (man)

PC-CONTEXT 各国から 派遣された 救助チームの ~ (countries) (sent) (rescue team) 各国から 派遣された 救助チームの ~ (countries) (sent) (rescue team) rescue teams form countries … C 各国から (countries) P’ rescue teams This is PC-context example. The Japanese sentence means “rescue teams which were sent from countries” And the English-sentence is simply “the Rescue teams from countries” In this case, this verb-phrase “sent” can be inferred from both “Rescue team” and “countries” So, we call this pattern “PC-CONTEXT” VP 派遣された (sent) C’ from counties P 救助チームの (rescue team)

VP in CAP is inferable We randomly extracted 80 CAPs and checked whether s are inferable or not VP P’ C’ C P VP Inferable P-CONTEXT 21 C-CONTEXT 16 PC-CONTEXT 19 Not Parse Error 3 Alignment Error 11 Phrase Chunking Error 1 Others 9 These three-classifications are sometimes subjective. So, I have to say, the number is not rigid and can be slightly different.

Outline Verb phase aligned corpus Method Experiments Conclusion Then, I will explain our method to realize compact translation using condensed alignment patterns

Basic Idea Incorporate CAPs into translation examples Generalize TEs X Alignment Translation Examples カナダで in Canada (Canada) 開かれた カナダで in Canada (be held) (Canada) at a conference 会議で (conference) 会議で at a conference (conference) Our translation system Generate translation-examples from alignments like-this Our-previous-system could-not deal-with entire condensed–alignment–patterns And 2-small-translation-examples are generated, because this-verb-phrase is not aligned. But we can regard it A correspondence based-on the investigation mentioned before, So, We newly incorporate CAPs into translation examples like this. Furthermore, we generalize this-translation-example. As I’ve mentioned before this-verb-phrase “be held” can be inferred from “conference” so we can generalize the other phrase like this. For this generalization, we have to automatically classify Condensed-Alignment-Patterns. カナダで 開かれた 会議で in Canada at a conference (be held) (conference) (Canada) X X Generalize TEs P-CONTEXT ⇒ Generalize C C-CONTEXT ⇒ Generalize P

Automatic Classification Divide CAPs into 2 fragments and count frequencies P’ C’ C P VP ロンドンで 開かれた 会議で in London at a conference (be held) (London) (conference) P’ C’ C P VP シンガポールで 開かれた 会議で in Singapore at a conference (be held) (Singapore) (conference) We automatically classify the patterns. First, we divide a CAP into two fragments, the parent-fragment and the child fragment And we count their frequency in corpus. We think, If a CAP is P-CONTEXT, we can correct many parent-fragment In this example, “conference held in “Singapore” And “conference held in London”

Fragment Frequency Fragments which include “開く (held)” V C C’ Freq 開く カナダ Canada 4 リヨン Lyon シンガポール Singapore 3 ロンドン London P P’ 会議 conference 17 会合 meeting 2 集会 gathering 1 This shows fragment-frequencies which include (“開く/ held”) Because conferences are held in many places, the parent-fragment occurs more often. In this example, the-parent-pattern-frequency is 17. And the-child-pattern-frequency is only 4.

Automatic Classification Divide CAPs into 2 fragments and count frequencies freq(C) =4 So,, this Pattern is classified into P-CONTEXT by this rule. In the following experiments we set parameter C as 2. freq(P) =17 freq(C) > freq(P) x C ⇒ C-CONTEXT freq(P) > freq(C) x C ⇒ P-CONTEXT Otherwise ⇒ PC-CONTEXT 2 2

Translation Framework Input Translation Examples 東京で (Tokyo) 東京 Tokyo (Tokyo) 開かれた (be held) Translation Examples (=CAP) 開かれた *会議で in Canada X (be held) ( * conference) 会議で (conference) This is the translation framework. The system uses both conventional translation examples and generalized patterns. Then, They are combined into an output sentence. Output at a conference in Tokyo

Outline Verb phase aligned corpus Method Experiments Conclusion I’ll talk about experimental results.

Preparation for Experiments Extract 30,000 1-to-1 sentence pairs Automatically estimate phrasal alignments Extract and classify 4,219 CAPs # P-CONTEXT 1120 C-CONTEXT 297 PC-CONTEXT 2802 For preparation of experiments, we extracted 30,000 1-to-1 sentence-pair form NHK-News-Corpus. Then, we automatically estimated their phrasal-alignments. and we extracted and classified about 4-thousand CAPs This table shows the result.

Experiments Judgments of CAP Classifications are subjective task. ⇒ Evaluation by full Translation using generalized CAPs Translation Direction: Japanese-English translation Test-set: 240 lead (top)-sentences of NHK-News-Corpus Evaluation: BLEU-score (4 references) The judgments of the classifications are too subjective. So, we evaluated them by full-translation task using BLEU. We prepared a test-set consisting of 240 (2 hundred and forty) lead-sentences which are extracted randomly from the NHK-news-corpus, and four references that were made by NHK's professional translators.

Methods & Results Methods BLEU BASELINE 24.6 CAPMT 24.8 CAPMT+ 25.0 BASELINE : EBMT without CAPs CAPMT : EBMT using CAPs CAPMT+ : EBMT using generalized CAPs BLEU BASELINE 24.6 CAPMT 24.8 CAPMT+ 25.0 We compared 3 methods. BASELINE is the EBMT system without CAPs. CAPMT is the system which use un-generalized CAPs, CAPMT+ is the system using generalized CAPs. The result is shown in this table. Proposed method achieved a little improvement. But our proposed method change small parts in translations So, We think (that) it demonstrated the basic feasibility of our approach.

Example INPUT アメリカのメリーランド州で十四日に行われた航空ショーで・・・ REF An air show in the US state of Maryland on the 14th BASELINE Air show was held in maryland of the united states on the 14th ... CAP+ Air show in maryland of the united states on the 14th ... This is an output example. Our proposed method does not translate this inferable verb “occurred” and approaches a human reference.

Example INPUT REF ... summit due to be held on the 25th BASELINE 二十五日に行われる日韓首脳会談に ... REF ... summit due to be held on the 25th BASELINE ... summit meeting conducted on 25th CAP+ ... summit meeting on 25th This is an error example In this is example, our system does not translate this verb “held” On the hand, a human-reference translate it. One of the-reasons is (that) this news describes a schedule of the summit meeting and it will be held in the future, in that case, news-translation say like this. “summit due to be held on the 25th”

Outline Verb phase aligned corpus Method Experiments Conclusion Let me conclude this talk.

Conclusion VP-aligned-Corpus How to build the corpus Investigations of the corpus Method for translations without saying inferable predicate Future work A human translates inferable expressions with a certain context We have to deal with larger context In this research, we describe how to build the verb-aligned-corpus and investigations of the corpus. Then, we proposed a method for translations without saying inferable predicate Proposal method works well as shown in the experimental results And For future work, we need more detailed study about larger context