Presentation is loading. Please wait.

Presentation is loading. Please wait.

前回の練習問題 12ページの例において,3次,4次のエントロピーを求めよ.可能であれば,n 次エントロピーを計算するプログラムを書け.

Similar presentations


Presentation on theme: "前回の練習問題 12ページの例において,3次,4次のエントロピーを求めよ.可能であれば,n 次エントロピーを計算するプログラムを書け."— Presentation transcript:

1 前回の練習問題 12ページの例において,3次,4次のエントロピーを求めよ.可能であれば,n 次エントロピーを計算するプログラムを書け.
以下を示せ I(X; Y) = H(X) – H(X | Y) I(X; Y) = I(Y; X) I(X; Y) = H(X) + H(Y) – H(X, Y) に解答例,プログラム掲載 questions which were given at the end of the previous class. compute the third and the fourth order entropies of the Markov source given in the slide 12 (of the slides used in the previous class). If possible, try to write a program which computes the n-th order entropy of a given Markov source. Probe those equations which are related to the entropy and information. Answers to these questions can be found in the web page, but sorry that the answers are given in Japanese...

2 前回の補足 マルコフ情報源の(極限)エントロピー: 各状態について,その状態を記憶のない情報源と考え,
極限エントロピー(一次エントロピー)を計算すればよい ... 厳密には「なぜこれで良いか」を考える必要がある 説明資料(証明概要)を に掲載しました some additional comments to the previous class. In the previous class, I said that the entropy of a Markov information source is computable as a weighted average of entropies of each state. Formally saying, we need to discuss the validness of this computation,. Because the proof can be lengthy, the summary of the proof is given in the web page.

3 comments to foreign students
To help understanding this class, I try to provide English notes in the PowerPoint slide file. Download ppt files from and check notes in the file. English notes will be provided here.

4 第二部:情報をコンパクトに表現する 情報に含まれるムダを削り,情報を圧縮する 情報源から発生した通報を記録するときには...
計算機で扱いやすい文字を使って表現しないといけない ⇒ 適切に符号化 (encode)する必要がある どんな符号化でもOK? できるだけ正確に記録しないとダメ (⇒ 第二部 + 第三部) できるだけ効率的に記録しないとダメ 記録に必要となる手間,時間,データ量... を節約したい Chapter 2 of this class: represent information “compactly” We would like to eliminate redundancy in information, and to “compress” the information. When we record messages generated from an information source, we need to represent the message using symbols which are convenient for computer systems. To give an appropriate symbol representation is often called as “encoding”. There are so many ways of “encoding”, some are good, some are bad. the encoding must be as “precise” as possible (this is the subject of the Chapters 2 and 3 of this class) the encoding must be as “efficient” as possible, where efficiency may refer computational cost and time for encoding/decoding, and the amount of data needed to represent the message. The amount of data is the main subject of this Chapter 2 of the class.

5 情報源符号化 情報源符号化 (source coding): 情報源から発生した通報(または通報の系列)を,
記録系や処理系で利用可能な記号系列に変換すること 正確で手間のかからない符号化 一意復号可能性,瞬時復号可能性 効率の良い(圧縮率の高い)符号化 ハフマン符号 ハフマン符号の拡張,その他の符号化法 符号化効率の限界 関連する話題 source coding; is one of encoding techniques which mainly aim for translating messages (generated from an information source) to a sequence of symbols which are available in a computer system. There are several techniques to develop good source coding: precise and easy-to-handle...we will consider properties such as uniquely decodable and promptly decodable efficient (i.e. high compression) coding - Huffman codes (the above are discussed in today’s class, and the below will be discussed in the next class)--- - extensions of Huffman codes, other source coding techniques theoretical limits of the efficiency of source codings related advanced topics 本日の講義

6 情報源符号化と用語の定義 符号 ≠符合 例:天気予報(通報は「晴」「曇」「雨」)の符号化
記録系で利用できる記号(アルファベット)は 0 または 1 通報 符号語 00 10 11 符号 ≠符合 一個ずつが, それぞれ符号語 preliminaries related to source coding we define several terms using an example of weather forecasting which may generate three types of messages; “fine, F (晴)”, “cloudy, C(曇)” or “rainy, R(雨)”. We also consider that we can use only 0 and 1 in the system which we are going to use. The set {0, 1} is said to be an “alphabet” of the system. The table shows one way to represent the three messages in a currently investigating system. We are defining a one-to-one relation between messages and sequences of the alphabet. A “codeword” is a sequence of the alphabet which is related to one of messages. In the above example, 00, 10 and 11 are codewords, but 01 is not. A “code” is the set of codewords. In this case, {00, 10, 11} is the code. In the case the alphabet contains just two characters, the code is said to be a “binary” code. 通報とアルファベット系列の一対一関係を考える 符号語 (codeword):通報に対応するアルファベット系列 00 は符号語,10 は符号語,01 は符号語でない 符号 (code):符号語の集合,上例では {00, 10, 11} 2元符号... アルファベットが2種類の記号

7 符号化と復号 符号化 (encode):通報を符号語に変換すること 復号 (decode): 符号語を通報に変換すること 注意:
符号化の際,符号語と符号語の間には,分離記号を挟まない ○ 晴曇晴 ⇒ × 晴曇晴 ⇒ 分離記号が欲しい場合... 分離記号も,アルファベットや符号の定義に含めること 「空白文字でも,一文字は一文字」 “encode” is the operation to translate a message to a corresponding codeword “decode” is the operation to translate a corresponding codeword to amessage Remark: we cannot separate codewords by white spaces. To record the sequence of messages 晴曇晴 (FCF, fine-cloudy-fine), we MUST write “001000” rather than “ ”. You know, there is no “white space” in the logic-gate of the usual computer memory. If white spaces are used, then we are considering a “trinary” codes, not a binary code. In this case, we have the room to investigate the usage of the third character (i.e. white space); we might be able to use the third character much smarter than using it as a naive separator. (空白文字など)

8 符号に求められる性質 情報源符号化に求められる第一条件: 符号化した結果が,正しく一意に復号できること
符号語を,すべて異なるものにすればOK? C1 00 10 01 11 C2 011 111 C3 C4 needed property the most essential property which is required to a source code is that it is “uniquely decodable”. to achieve this requirement, we need to assign different codewords to different messages, obviously. The table shows several example of source codes which are to represent four weather messages; fine (F), cloudy (C), rainy (R) and snow (S). We can understand that C_4 is NOT “uniquely decodable”. How about the other codes? Consider to use C_3 for example; encode fine-rainy-fine (FRF) and we have 0110 encode SC and we also have 0110 Even though all codewords are different, we cannot distinguish the two message sequences from the sequence of the codewords. Thus C_3 is NOT “uniquely decodable”. C4 は明らかにダメ 他はOK? C3を使って「晴,雨,晴」を符号化すると 0110 C3を使って「雪,曇」を符号化すると 0110 ⇒ 両者の区別がつかない...一意には復号できない

9 一意復号可能性 符号が一意復号可能である (uniquely decodable): 異なる通報系列が,同一の符号語系列に符号化されないこと
C1 00 10 01 11 C2 011 111 C3 C4 C1, C2 は一意復号可能 C3, C4 は一意復号可能でない (特異符号,singular code) Uniquely decodable: should be defined with the “sequence” of codewords taken into considerations. A code is uniquely decodable if and only if the sequences of codewords which represent different sequences of messages always result in different alphabet sequences. C_1 and C_2 are uniquely decodable (the proof is needed but omitted here). C_3 and C_4 are NOT uniquely decodable. Codes which are not uniquely decodable are sometimes said to be “singular”. Both of C_1 and C_2 are uniquely decodable, but which shall be better? We will see in the next slide that C_2 is not easy-to-use. C1, C2 はどちらも一意復号可能だが...C2 はちょっと使いにくい

10 符号 C2 の問題点 C2で「晴,雪,雪,晴」を符号化 ⇒ 符号語系列は 01111110 この系列を一文字ずつ送る場合を考える
7文字目まで受け取った時点では ...復号結果として,2通りの候補が存在 C2 01 011 111 What’s wrong with C_2? Use C_2 and encode FCCF. The result is (this is a concatenation of four codewords , but we do not give separating white spaces). Assume that you receive the sequence character-by-character, and now you have received the seventh character of the sequence. You now know that the first seven characters of the sequence is but you cannot determine even the first message of the encoded weather sequences. If you receive 0 as the eighth character and this is the end of the sequence, then you can separate the eight characters into four codewords which represent FCCF. On the other hand, you me receive 1, not 0, as the eighth character. In this case the received sequence becomes , which is decomposed to representing RSS. The problem is that you cannot predict what the eight character is until you actually receive it. You need to keep the first seven characters in a certain “buffer” in the system. This makes the system complicated, and produce undesired decoding “delay”, which can cause problem in a real-time system. 晴 雪 雪 晴 次が... 雨 雪 雪 次が... 1 7文字目 8文字目以降を受け取るまで,最初の復号結果すら確定できない ⇒ 復号器内に大きなバッファが必要,大きな復号遅延の発生

11 瞬時復号可能性 符号が瞬時復号可能である (immediately decodable): 符号語系列の先読みをしなくても,復号を行えること
(符号語パターンが出現すれば,すぐに復号してよい) A B C D 01 010 011 100 左例:瞬時復号可能でない 0110 0 ⇒ (AD) 1 ⇒ (CA) The problem considered in the previous slide is related to the property of codes which is usually referred as the “immediately decodable” property. A code is immediately decodable (or promptly decodable) if we can decode a sequence of codewords (precisely, a sequence of alphabets which represents the sequence of codewords) without no look-ahead of the sequence. In this case, we can translate a codeword into a corresponding message immediately we find a codeword pattern. The table in the middle of the slide is an example of codes which is NOT immediately decodable. Even if we have 01 as a prefix of the encoded sequence, we cannot decode it to A because there is possibility that 101 is followed after 01 and in that case, the first character should be C. We have 01 at our hand, but we cannot decode it unless we look-ahead the sequence. The table bottom is another example of codes which is NOT immediately decodable. A B C D 01 10 11 011 右例:瞬時復号可能でない 0 ⇒ ? (ACB?) 1 ⇒ (DCA)

12 瞬時復号可能性と語頭条件 瞬時復号可能性は,工学上,非常に重要な性質 でも,すべての系列についてチェックするのは大変
どんなときに,瞬時復号可能でなくなるか? ⇒ ある符号語が,他の符号語の最初の部分と一致するとき 語頭となる A B C D 01 010 011 100 10 11 The immediately decodable property is significant in engineering aspects. We will be happy if there is an easy way to check if a given code satisfies this property or not, and we do have such a good criterion. A code is NOT immediately decodable if and only if there is a pair of codewords c_1 and c_2 such that c_1 is a prefix of c_2. Equivalently, a code is immediately decodable if and only if there is no pair of codewords c_1 and c_2 such that c_1 is a prefix of c_2. This is often said to be a prefix condition. The tables review the two example given in the previous slides. They are NOT immediately decodable, and we can see that they both violate the prefix condition. 符号Cが瞬時復号可能となる ⇔ Cのどの符号語も,他の符号語の語頭とならないこと (語頭条件,prefix condition)

13 余談:語頭条件といえば... 語頭条件の確保は,システム設計全般にわたって重要 PDAの手書き入力用文字ストローク Graffiti 1
break... The prefix condition is important in many applications, and we should always think about it, especially when we design a new “coding” scheme. In a decade ago, PDA terminals such as Palm terminals come to the market. To help users input characters rapidly, they define a special user interface which is similar to a shorthand writing. A user gives these strokes using a pen-like stylus at a designated input area. In the beginning, the strokes are so simple and all characters are associated with only one-stroke patterns. Later, they version-up the graffiti system, and several characters have two-stroke patterns as in this figure. We can see that the stroke for the “minus (dash)” is a prefix of the stroke for the “equal symbol”. To input two minus symbols in a row, we need to draw this pattern, wait a while, and then draw this pattern again. If the pause is too short, then the device recognizes the two strokes are for single equal symbol, not for two minus symbols. A user needs to control the waiting time, which is very painful when we would like to input characters in haste. This is a good example that the violation of the prefix condition make the system terrible one. Graffiti 1 すべて一筆書き prefix condition を満たす Graffiti 2 二筆文字も出現 prefix condition を満たさない

14 語頭条件を満たす符号の作り方 語頭条件を満たす符号の単純な作り方: すべての符号語を,同じ長さを持つ相違なる系列とする
(等長符号,equal-length code) 符号語の最後(最初)に特別なパターンを配置する (コンマ符号,comma code) ⇒ どちらも,取り扱いは楽だが効率が悪い 一工夫して,必要最小限の長さを持つ符号語を選んでくる (非等長符号,unequal-length code) 符号木を利用した構成法が便利 how to construct an immediately decodable code there are two simple algorithms for constructing immediately decodable codes: let codewords be different sequences with the same length (equal-length code) give a special pattern at the end of each codewords (comma code) both easy, but not efficient. We will see later that it is advantageous to allow codewords to have different length (such a code is called an unequal-length code). To construct an unequal-length immediately decodable code, make use of code trees.

15 符号木 a元符号の符号木 (code tree): 各節点が,最大 a 個の子を持つことのできる木構造 根...親を持たない節点
葉...子を持たない節点 code tree of an a-ary code a rooted-tree with each node can have “a” children at most the root is a special node which has no parent leaves are nodes which do not have a child

16 符号木を利用した符号の構成法 M 個の符号語を持つ符号の構成手順: 葉の数がM個であるような符号木 T を任意に構成する
Tの各枝に,0 から a の記号をラベル(名前)付けする ただし,一つの親が同一ラベルの枝を複数持ってはならない 根から葉までの経路を考える 経路上のラベルを順に連接したものが,一個の符号語となる construction of a code using a code tree Assume that we would like to construct an immediately decodable code which has M codewords. In this case, follow the three steps below. step 1. arbitrary construct a code tree T which has M leaves step 2. give labels between 0 and (a – 1) to the edges of T so that sibling edges have different labels. step 3. for each leaf node, define a codeword by concatenating labels on the edges from the root to the leaf

17 符号の構成例 符号語数が4の2元瞬時復号可能符号を構成する Step 1 Step 2 Step 3 1 1 00 01 10 11
1 1 00 01 10 11 example we construct a code with four codewords, which is immediately decodable. the figure illustrates the steps in the previous slides, and we have a code {00, 01, 10, 11}. this is immediately decodable. 構成された符号は {00, 01, 10, 11}

18 符号の構成例 符号語数が4の瞬時復号可能符号を構成する 符号木の選び方,ラベルの付け方には自由度がある 1 1
1 1 C1={0, 10, 110, 111} We can choose any code tree. These code tress all have four leaves, and we can choose any tree we like. Evan for a single code tree, we may label in different ways. Codes constructed in this way obviously satisfies the prefix condition, and they are all immediately decodable. C2={0, 11, 101, 100} 1 C3={01, 000, 1011, 1010} どのようにしても語頭条件を満たす ⇒ 瞬時復号可能な符号

19 符号語の長さと符号の効率 C1={0, 10, 110, 111} 1 1 C3={01, 000, 1011, 1010}
1 1 C1={0, 10, 110, 111} C3={01, 000, 1011, 1010} Now consider two of code examples in the previous slide. They are both immediately decodable and have four codewords. If you can choose one of C_1 or C_3, which do you choose? Maybe you will choose C_1, because it gives “more compact” representation than C_3. Here are simple questions: can we make an immediately decodable code whose codewords are very short, say, all codewords have length one? No, we cannot. The number of sequences with length one is just two, which is too short to accommodate four codewords. -the same question but we allow codewords to have length 1, 2, 2, 3 bits, which is slightly better than bits of C_1. ... Can we make it? Is there any criterion with which we can construct an immediately decodable codes? C1 vs. C3...どちらも,符号語数4の瞬時復号可能な符号 個々の符号語が短いほうが,効率が良い(コンパクトな表現) 4つの符号語の長さを,すべて 1 にできるか? 4つの符号語,長さを 1, 2, 2, 3 にできるか? 瞬時復号可能な符号が構成できる条件は?

20 クラフトの不等式 C = {w1, ..., wM}, |wi| = li であるa元符号とする
:クラフトの不等式 (Kraft’s inequality) Kraft’s inequality Let C = {w_1, ..., w_M} be an a-ary code and assume that the codeword w_i has l_i bits in length. Theorem: If C satisfies the prefix condition, then the inequality at the middle of this slide holds. The inequality is called the Kraft’s inequality. The converse of the theorem does not always hold, but... if l_1, ..., l_M satisfy the Kraft’s inequality, then we can construct a code which is immediately decodable and whose codewords have length l_1, ..., l_M bits in length. 定理の逆は必ずしも成立しないが... クラフトの不等式を満たす符号語長l1, ..., lMが与えられたとき, 語頭条件を満たし,この符号長をもつ符号を構成可能

21 符号語長からの符号構成 符号語長が1, 2, 3, 3 である瞬時復号可能な符号を作りたい
⇒ 深さ 1, 2, 3, 3 に葉がくるように符号木 T を構成すれば良い 1 C1={0, 10, 110, 111} example 1: construct an immediately decodable code whose codewords have length 1, 2, 3, 3 bits. ==> first construct a code tree whose leaves are positioned at the desired depth, and define a code using this code tree. In this case, C_1 = {0, 10, 110, 111} is a one of desired codes. example 2: construct an immediately decodable code whose codewords have length 1, 2, 2, 2 bits. ==> impossible. Put these code length to the left-hand side of Kraft’s inequality, and it goes more than 1. 符号語長が1, 2, 2, 2 である瞬時復号可能な符号を作りたい 2–1 + 2–2 + 2–2 + 2–2 = = 1.25 > 1 ⇒ 無理

22 符号の効率について ここまでの議論...取り扱い上の利点を持つ符号の構成方法 ここからの議論...取り扱い上の利点 + 効率の良さに着目 晴
確率 0.4 0.3 0.2 0.1 C1 10 110 111 C2 C1とC2,どちらが良い符号? 「晴晴曇雨晴」を符号化してみる C1: C2: We have discussed the codes which are “easy-to-handle”, and now we move our attention to the efficiency, where the efficiency here means that the “compactness”. Consider two codes C_1 and C_2. They have the same set of codewords, but codewords are associated to messages in a different way. In C_1, the codeword “0” is given to 晴 (Fine) which can occur with probability 0.4. For the same message, C_2 gives the codeword “111”. Assume that we have a typical sequence of messages which follow these probability distribution. The sequence should contain many 晴 (Fine), whose codeword representation is short in C_1, long in C_2. In this case, C_1 gives 8-bit representation, while C_2 gives 14-bit representation. If we need one-dollar to transmit one bit symbol, then we can save six dollars by using C_1. C_1 can compactly represent messages which are generated from this information source. A quantitative measure of the compactness is an “average codeword length” (which is sometimes referred as “average code length”, though this terminology is slightly confusing and misleading). C_1 uses a codeword with length one (i.e. codeword “0”) with probability 0,4, with length 2 (i.e. codeword “10”) with probability 0.3, and so on. In average, the length of the codeword is 1.9 bits. For C_2, the value is 2.6, which suggest that C_1 gives more copact representation than C_2. 8記号 14記号 C1のほうが,通報をコンパクトに表現可能 コンパクトさの定量的指標... 符号長の加重平均(平均符号長) C1:0.4 × × × ×3 = 1.9 C2:0.4 × × × ×1 = 2.6

23 情報源符号化の指標 良い符号の三条件: 一意復号可能であること 瞬時復号可能であること 平均符号長ができるだけ短いこと
最初2つの条件だけなら,符号木を使うだけで構成可能 最後の条件を満たすため,符号木の作り方に一工夫する ⇒ ハフマン符号 (Huffman code) とくに断らない限り,記憶のない定常情報源を考える Summary: three conditions for a code to be good: uniquely decodable immediately decodable short average code length To pursuit the last condition is the subject of the following discussion. For this sake, we construct a code tree according to a special algorithm. The obtained code is called a Huffman code. Meanwhile we consider memoryless information source only.

24 ハフマン符号 一般的な性質ではなく,符号木の構成法によって定式化: 各通報に対して節点を準備し,その発生確率を付与する
節点はそれぞれ,大きさ1の木であると考える 木の中で発生確率最小のものを2つ選出し,以下を行う 2つの木の根節点を子とするような,新しい根節点を作る 新しい根節点と古い根節点を結ぶ枝に,0, 1 をラベル付け 新しい根節点に,2つの木の発生確率の和を与える すべての節点がつながるまで,2 の操作を繰り返す Huffman code is not a name of a specific code. It is a name of codes whose code tree is generated according to a special algorithm (Huffman algorithm). The Huffman algorithm is an iterative algorithm whose basic computation has just three steps. step 1: Prepare a node for each message. The node is associated with a probability of the message. A node can be regarded as a tree with size one, and we have prepared a set of trees. step 2: Find two trees which are associated with the smallest probability and the second smallest probability. Adjoin the two trees by introducing a new node which has the two trees as its children. Give labels 0 and 1 to the newly defined edges (from the new node), and associate this new tree with the probability which is the sum of the probabilities of the two trees. step 3: continue the step 2 until all the trees are adjoined to one big tree.

25 ハフマン符号の構成例 0.05 D 0.1 C 0.25 B 0.6 A 0.05 D 0.1 C 0.15 0.25 B 0.6 A 0.05 D 0.1 C 0.15 0.25 B 0.4 0.6 A example. There maybe a certain analogy to the consolidation of companies. Two smallest companies in the market are consolidated, and become one company whose capital money is the sum of the capitals of the original two companies. The consolidation continues until big one company monopolies the market. The obtained tree is regarded as the “code tree” which we have consider to construct immediately decodable codes. 0.05 D 0.1 C 0.15 0.25 B 0.4 0.6 A 1.0 1 資本金による合併劇,と考えるとわかりやすいかも...

26 練習問題 A B C D E 確率 0.2 0.1 0.3 符号語 A B C D E F 確率 0.3 0.2 0.1 符号語
exercise construct Huffman codes for these information source, and compare its average codeword length to the equal-length code. 等長符号の場合と平均符号長を比較すると...

27 ハフマン符号構成上の自由度 ハフマン符号の構成においては, ラベル割当の自由度 確率の等しい節点選択の自由度 が存在するが...
⇒ どのように選択しても,平均符号長は変わらない Huffman codes are not unique: We have some degree of freedom in constructing a Huffman code. the choice of edge labels the choice of trees which have the same probability The average codeword length is NOT affected by these choices. A B C D E F 確率 0.3 0.2 0.1 符号語

28 本日のまとめ 情報源符号化に関する基本概念の整理 一意復号可能性 瞬時復号可能性 クラフトの不等式 符号木利用による符号構成 ハフマン符号
加重平均符号長による評価 具体的な符号構成法 Summary of today’s class: basic properties of source coding - uniquely decodable - immediately decodable - Kraft’s inequality - code construction using code trees -Huffman codes - average codeword length as a quantitative measure of the compactness - code construction algorithm

29 練習問題 右の情報源を効率よく符号化できる, 2元ハフマン符号を構成せよ 上で構成した符号の平均符号長を 求めよ
4元ハフマン符号の構成法を検討し, 右の情報源に適した符号を構成せよ A B C D E F G H 確率 0.363 0.174 0.143 0.098 0.087 0.069 0.045 0.021 quiz to check your understanding construct a Huffman code which is designed for the given information source. compute the average codeword length for the constructed Huffman code investigate an algorithm to construct 4-ary Huffman codes, and construct a 4-ary Huffman code for the above information source


Download ppt "前回の練習問題 12ページの例において,3次,4次のエントロピーを求めよ.可能であれば,n 次エントロピーを計算するプログラムを書け."

Similar presentations


Ads by Google