前回の練習問題 A B C D E F G H 確率 0.363 0.174 0.143 0.098 0.087 0.069 0.045 0.021 2元ハフマン符号を構成せよ 平均符号長を求めよ 0.363×1+0.174×3+...+0.021×5=2.660 1.000 1 0.637 answers.

Slides:



Advertisements
Similar presentations
北海道大学 Hokkaido University 1 情報理論 講義資料 2016/06/22 情報エレクトロニクス学科共通科目・2年次・第 1 学期〔必修 科目〕 講義「情報理論」第 5 回 第 3 章 情報源のモデル [ 後半 ] 3.5 情報源のエントロピー.
Advertisements

ベイズの定理と ベイズ統計学 東京工業大学大学院 社会理工学研究科 前川眞一. 2 Coffe or Tea 珈琲と紅茶のどちらが好きかと聞いた場合、 Star Trek のファンの 60% が紅茶を好む。 Star Wars のファンの 95% が珈琲を好む。 ある人が紅茶を好むと分かったとき、その人が.
だい六か – クリスマスとお正月 ぶんぽう. て form review ► Group 1 Verbs ► Have two or more ひらがな in the verb stem AND ► The final sound of the verb stem is from the い row.
て -form - Making て -form from ます -form -. With て -form, You can say... ~てもいいですか? (= May I do…) ~てください。 (= Please do…) ~ています。 (= am/is/are doing…) Connecting.
第 5 章 2 次元モデル Chapter 5 2-dimensional model. Contents 1.2 次元モデル 2-dimensional model 2. 弱形式 Weak form 3.FEM 近似 FEM approximation 4. まとめ Summary.
Essay writing rules for Japanese!!. * First ・ There are two directions you can write. ・よこがき / 横書き (same as we write English) ・たてがき / 縦書き (from right to.
VE 01 え form What is え form? え? You can do that many things with え form?
前回の練習問題 12ページの例において,3次,4次のエントロピーを求めよ.可能であれば,n 次エントロピーを計算するプログラムを書け.
前回の練習問題 ABACABADACABADを,LZ77法により符号化せよ 上記問題により得られた符号語系列を復号せよ
英語特別講座 疑問文 #1    英語特別講座 2011 疑問文.
All Rights Reserved, Copyright (C) Donovan School of English
The Bar バー.
英語勉強会.
第1回レポートの課題 6月15日出題 今回の課題は1問のみ 第2回レポートと併せて本科目の単位を認定 第2回は7月に出題予定
日本語の文法 文型(ぶんけい)をおぼえよう!
 辞書系(じしょけい).
Chapter 11 Queues 行列.
日本語... ジェパディー! This is a template for you to use in your classroom.
と.
Bellwork: English meaning? 1)はじめまして 2)どうぞ 3)すみません 4)おはようございます 5)しゅくだい
前回の練習問題 6個の情報記号を,以下のように配置し,水平垂直パリティ検査符号を構成する. この符号の検査行列を求めよ
Chris Burgess (1号館1308研究室、内線164)
じょし Particles.
疑問詞+ to 動詞の原形.
What did you do, mate? Plain-Past
Training on Planning & Setting Goals
前回の練習問題 a = とする(|a|=36) a を長さ2のブロックに区切り,x2値を計算せよ
前回の練習問題 通報とその発生確率を入力すれば,ハフマン符号を構成するプログラムを作成せよ
Only One Flower in the World
日本人の英語文章の中で「ENJOY」はどういうふうに使われているのか
Ch13-1 TB250 てフォーム.
Japanese verbs informal forms
項書換え系(3) 合流性 Term Rewriting Systems(3) Confluence 知能ソフトウェア特論
SP0 check.
How do you talk about Positions/ Locations?
前回の練習問題について 情報記号 (x1, …, xk) に対し,検査記号 p = x1+…+xk+1として与えられる奇パリティ符号を考える.この符号が線形符号とならないことを証明せよ. 解答例: 線形符号とならない反例を示せばよい. x1=1, x2=x3=...=xk=0 ⇒ p = 0,対応する符号語は
Tohoku University Kyo Tsukada
A 02 I like sushi! I like origami!
十年生の 日本語 Year 10 Writing Portfolio
Reasonので + Consequence clause
The future tense Takuya Mochizuki.
Chapter 4 Quiz #2 Verbs Particles を、に、で
The Sacred Deer of 奈良(なら)
Who Is Ready to Survive the Next Big Earthquake?
“You Should Go To Kyoto”
VTA 02 What do you do on a weekend? しゅうまつ、何をしますか。
Nihongo Japanese 日本ご ‘Numbers ’ & ‘Hiragana Revision’
ストップウォッチの カード ストップウォッチの カード
Input slides 1 – 11 – teacher says word - pupils repeat – word appears on click Ohayoo. おはよう。
Causative Verbs Extensively borrowed from Rubin, J “Gone Fishin’”, Power Japanese (1992: Kodansha:Tokyo) Created by K McMahon.
suppose to be expected to be should be
くれます To give (someone gives something to me or my family) くれました くれます
Term paper, Report (1st, first)
WELCOME TO THE WORLD OF DRAGON BALL
Where is Wumpus Propositional logic (cont…) Reasoning where is wumpus
Question Words….
いくらですか?.
2019年4月8日星期一 I. EPL 84, (2008) 2019年4月8日星期一.
2019/4/22 Warm-up ※Warm-up 1~3には、小学校外国語活動「アルファベットを探そう」(H26年度、神埼小学校におけるSTの授業実践)で、5年生が撮影した写真を使用しています(授業者より使用許諾済)。
第1回レポートの課題 6月24日出題 今回の課題は1問のみ 第2回レポートと併せて本科目の単位を認定 第2回は7月に出題予定
Genetic Statistics Lectures (4) Evaluation of a region with SNPs
知能ソフトウェア特論 Intelligent Software
ー生命倫理の授業を通して生徒の意識に何が生じたかー
Created by L. Whittingham
東北大 情報科学 田中和之,吉池紀子 山口大 工 庄野逸 理化学研究所 岡田真人
英語音声学(7) 音連結.
Cluster EG Face To Face meeting
第八課文法二 Chapter 8 Grammar 2
Grammar Point 2: Describing the locations of objects
Improving Strategic Play in Shogi by Using Move Sequence Trees
Presentation transcript:

前回の練習問題 A B C D E F G H 確率 0.363 0.174 0.143 0.098 0.087 0.069 0.045 0.021 2元ハフマン符号を構成せよ 平均符号長を求めよ 0.363×1+0.174×3+...+0.021×5=2.660 1.000 1 0.637 answers of the quiz Q1) construct a Huffman code for the messages which are given in the table, where the numbers represent the probabilities of the messages occur. A1) the code tree in the slide is an example of Huffman code. Note that the Huffman code is not unique. You may have obtain a Huffman code which is different from the one given in the slide. Q2) compute the average codeword length for the constructed code. A2) 2.660 bit per message. Even if your have constructed a Huffman code which is different from the above, the average codeword length must coincide with 2.660. 0.359 0.278 0.185 0.135 0.066 0.363 A 0.174 B 0.143 C 0.098 D 0.087 E 0.069 F 0.045 G 0.021 H 100 110 1010 1011 1110 11110 11111

前回の練習問題 A B C D E F G H 確率 0.363 0.174 0.143 0.098 0.087 0.069 0.045 0.021 4元ハフマン符号を構成せよ 確率の小さな4つの木を併合すれば良い? ⇒ 最後に,節点の個数が足りなくなるかも... ⇒ 一回の併合で,4 – 1 = 3 個の木が消失 ⇒ 最後,1 個の木になって終了するには, 最初に 3k + 1 個の節点が必要 answers (cnt’d) Q3) construct a 4-ary Huffman code. A3) Basically, we are going to construct a code tree with degree four (a node can have at most four children). One point we need to remark is the number of trees which is considered during the execution of the Huffman algorithm. At the beginning, we have eight trees (eight nodes) as we have eight messages. Then we select four trees, adjoin them all, and the number of trees becomes five (= 8 – 4 + 1). In the next iteration, the number of trees becomes two (= 5 – 4 + 1). In the last iteration we have only two trees, which results in a code tree whose root node has only two children. Remark here that a the code tree obtained in this way is not good, because the code tree does not fully make use of “shallow node positions”. If a code tree has a leaf at a shallow position, then we will have a codeword with a short code length. It is not good strategy to leave shallow leaf positions unused. To avoid this issue, we add some “dummy nodes” before we start the Huffman algorithm. In this case, we add two dummy nodes so that the number of trees becomes 3k + 1 at the beginning. The number of trees reduce by three in each iteration of the Huffman algorith. At the final iteration, we have four trees at our hand, which allow us making a code tree whose “shallow positions” are fully occupied. ? 確率 0 のダミー節点を 2 個加え,通報数を10 = 3×3 +1に 1.000 d 0.320 a b c 0.066 0.363 A 0.174 B 0.143 C 0.098 D 0.087 E 0.069 F 0.045 G 0.021 H * * a b c da db dc dda ddb

本日の講義 ハフマン符号の性能と限界 ハフマン符号化の拡張 ブロック符号化と平均符号長 情報源符号化定理 情報源符号化の限界 非等長ブロックハフマン符号化 ランレングス符号化 today’s lecture -the performance of the Huffman code and its limit -extensions of the Huffman code - block coding contribute to reduce the average codeword length -Shannon’s source coding theorem - theoretical bound of the source coding (or, data compression) -block Huffman code with unequal block length -run-length Huffman code

平均符号長について 情報源符号化に求められる性質: 瞬時復号可能であること(自動的に一意復号可能となる) 平均符号長ができるだけ小さいこと ⇒ ハフマン符号が有力候補 瞬時復号可能な 符号のクラス 一意復号可能な ハフマン符号 a source code should - be immediately decodable (this implies that the code is uniquely decodable) - have small average codeword length A Huffman code seems a good code, but “how good” is it? to answer to this question, we would like to investigate a theoretical bound on the average codeword length. ハフマン符号は,どれくらい 「良い」符号なのか? ⇒ 平均符号長の限界定理

平均符号長の限界定理 定常情報源 S から発生する通報を一個ずつ, 瞬時復号可能な符号Cにより符号化することを考える 通報は M 通り,各通報の発生確率は p1, ..., pM [定理] 任意の符号について,平均符号長は必ず L  H1(S) となる ...どんなに効率的でも,平均符号長は1次エントロピー以上 平均符号長が L < H1(S) + 1となる符号を構成できる ...平均符号長が1次エントロピーに迫る符号を作ることが可能 we consider a theorem which gives the lower-bound limit of the average codeword length Assume that we have an information source S which has M symbols, and also assume that the i-th messages occurs with probability p_i. We would like to construct an immediately decodable code C by defining a codeword for each message symbol. Theorem: part 1: the average codeword length L of C is H_1(S) or more. part 2: there is a code whose code length L is less than H_1(S) + 1 (a proof will be given later) This theorem implies we cannot make the average codeword length smaller than the entropy. Interestingly, this lower bound is achievable. We can construct a code whose code length L is smaller than the entropy plus one.

シャノンの補助定理 定理の証明にあたり,補助定理(補題)をひとつ導入 (シャノンの補助定理,Shannon’s lemma) [補題] q1 + ... + qM  1 を満たす任意の正数 q1, ..., qM に対し, To give a proof of the previous theorem, we need one lemma which is named after Shannon. In the lemman we use probabilities p_1, ..., p_M which are probabilities of messages of S. Lemma: for any positive numbers q_1,.., q_M with q_1 + ... +q_M <= 1, the inequality given in the slide holds. The inequality holds by equation if and only if p_i = q_i. 等号成立は,p1 = q1, ..., pM = qM のとき,かつそのときのみ.

シャノンの補助定理の証明 証明:不等式の左辺 – 右辺を考えると, y = – logex 1 O y = 1 – x 証明:不等式の左辺 – 右辺を考えると, proof of the Shannon’s lemma We would like to show the inequality which is given at the top of this slide. Consider (lhs – rhs) of the inequality and translate the base of the logarithm. The log-value is lower-bounded as illustrated in the graph, and we finally see that (lhs – rhs) >= 0. The inequality holds by equality if and only if q_1/p_1 = 1. y = – logex 1 O y = 1 – x qi / pi = 1 (i = 1, ..., M) のとき等号成立

符号長限界定理の証明(前半) Cの符号語長をl1, ..., lM とし,qi = 2–liとおく.クラフトの不等式より li = – log2qi なので,平均符号長は We come back to the theorem in the slide 5, and the part 1 is considered first. Assume that the code lengths of codewords of C are l_1, ..., l_M, and define qi = 2^{-l_i}. Use the Kraft’s inequality, and we have q_1 + ... + q_M <= 1. The average codeword length L is represented as in the slide, and now we use the Shannon’s lemma, The right hand side of the inequality equals to the entropy, and the part 1 has been shown. シャノンの補助定理を用いて, 等号成立は,p1 = q1, ..., pM = qM のとき,かつそのときのみ (前半証明終了)

符号長限界定理の証明(後半) 以下の不等式を満たすよう,整数 li を定める. これより 2–li  2log pi =pi であり, part 2 Define positive integers l_i so that it satisfies the first inequality. This makes 2^{-l_i} <= p_i, and l_i satisfies the Kraft’s inequality. We can construct an immediately decodable code with code length l_1, ..., l-M, and its average codeword length is upper-bounded by the last inequality. したがって l1, ..., lM はクラフトの不等式を満足し, この符号長を持つ(瞬時に復号可能な)符号を構成可能. この符号の平均符号長は (後半証明終了)

符号長限界定理とハフマン符号 [定理](再掲) 任意の符号について,平均符号長は必ず L  H1(S) となる ...実際には,もっと強い証明が可能 ハフマン符号よりも平均符号長の小さな瞬時符号は存在しない (ハフマン符号はコンパクト符号 (compact code)である) 証明は,符号木の大きさに関する帰納法による(以下略). Theorem (slide p.5): part 1: the average codeword length L of C is H_1(S) or more. part 2: there is a code whose code length L is less than H_1(S) + 1 The Huffman code always satisfies the part 2 of the theorem, and, more than that! There is no immediately decodable codes whose average codeword length is shorter than that of the Huffman code. For this characteristics, a Huffman code is said to be a compact code. The proof is by induction on the size of the code tree, but omitted here.

ハフマン符号とブロック化 ハフマン符号...一個の通報を,一個の符号語に変換する 平均符号長は,かならず1以上となる 2元情報源の符号化には向かない A B 10 C 11 通報 A B 平均符号長 確率 0.8 0.2 C1 1 1.0 C2 Huffman code and “blocks” Basically, we have considered to encode each message “symbol”. Because the length of a codeword is one or more, the average codeword length is one or more, obviously. This suggests that the Huffman code does not contribute to represent binary messages in a compact manner, as illustrated in the table. We may have two choices of the codes, C_1 or C_2, but they both do not reduce the average codeword length. Here we considet to encode “several symbols” instead of encoding a “single symbol”. For example, we consider to assign one codeword to a block of messages “ABA”. This may give us chance to reduce the average codeword length, and chance to “compress” binary data. 複数の通報をまとめて符号化(ブロック符号化)すると... 通報あたりの平均符号長を1以下にできるかも 2元情報源の符号化にも利用可能 A B A C C A 10 1101 01

ブロック符号化のイメージ 通報の系列 ABCBCBBCAA... 等長ブロック化 非等長ブロック化 ブロック詳細化法 ランレングス法 ブロック化された 通報の系列 AB CBC BB CAA... This slide sketches the process of the operations. We first separate a sequence of messages into several blocks. There can be many algorithms to separate the sequence into blocks, but the algorithms can be classified into two types; equal-length or unequal-length. The unequal-length algorithms have variations; block-partition scheme, run-length scheme, and so on. Anyway, we will have a sequence of blocks of messages. If we know the statistics of the original information source, then we can compute the statistics of this blocked information source. Use the statistics and we can construct a Huffman code which is optimized for this blocked information source. ハフマン符号化 符号語系列 01 10 001 1101...

等長ブロック符号化の例(1) A B C 確率 0.6 0.3 0.1 符号語 10 11 左表のとおりハフマン符号化をすると, 10 11 左表のとおりハフマン符号化をすると, 平均符号長は 0.6×1 + 0.3×2 + 0.1×2 = 1.4 長さ2のブロックを考える AAの発生確率 = 0.6×0.6=0.36 .... 左表のとおりハフマン符号化をすると, 平均符号長は 0.36×1 + 0.18×3 + ... + 0.01×6 = 2.67 通報一個当たりでは,2.67 / 2 = 1.335 ⇒ 少し効率が良くなった AA AB AC BA BB BC CA CB CC 確率 0.36 0.18 0.06 0.09 0.03 0.01 符号語 100 1100 101 1110 11110 1101 111110 111111 here is an example of the block Huffman decoding The first table shows the usual case in which we decode messages one by one. In this case, the average codeword length is 1.4 bits per message symbol. Next we consider a block of length two. We consider, for example, “AA” as a single message. The probability of “AA” is 0.36 = 0.6 x 0.6, assuming that the information source is memory-less. In a similar way, we can compute the probabilities of other block patterns. For these nine messages (AA-CC), we can construct a Huffman code. The average codeword length of the code is 2.67, but this is for two message symbols. Divide this value by two, and we can see that 1.335 bits are needed to represent a message symbol. Note that we could reduce the average codeword length slightly.

等長ブロック符号化の例(2-1) A B 確率 0.8 0.2 符号語 1 平均符号長は 0.8×1 + 0. 2×1 = 1.0 1 平均符号長は 0.8×1 + 0. 2×1 = 1.0 長さ2のブロックを考える AAの発生確率 = 0.8×0.8=0.64 .... 平均符号長は 0.64×1 + 0.16×2 + ... + 0.04×3 = 1.56 通報一個当たりでは,1.56 / 2 = 0.78 ⇒ 2元情報源でも,効率改善 AA AB BA BB 確率 0.64 0.16 0.04 符号語 10 110 111 another example. this is a binary information source. If we don’t block the messages, then the average codeword length is one as we have seen in previous slides. If we consider blocks of length two, then the average codeword length is reduced to 0.78. We will be able to “compress” binary data, even if the output alphabet is again binary.

等長ブロック符号化の例(2-2) AAA AAB ABA ABB BAA BAB BBA BBB 確率 0.512 0.128 0.032 0.008 符号語 100 101 11100 110 11101 11110 11111 長さ3のブロックを考える AAAの発生確率 = 0.83=0.512 .... 平均符号長は 0.512×1 +... + 0.008×5 = 2.184 通報一個当たりでは,2.184 / 3 = 0.728 continued from the previous slide, and we consider blocks of length three. If we construct a Huffman code for this information source, then the average codeword length per message symbol is 0.728. We can again reduce the length. From this, we can conjecture that using longer block length contributes to reduce the average codeword length. ブロック長 1 2 3 : 一通報あたり平均符号長 1.0 0.78 0.728 ブロック長を大きくすると, 一通報あたり平均符号長は 小さくなる(効率が良くなる)

ブロック符号の平均符号長 ブロック長を大きくすると,一通報あたり平均符号長は小さくなる ⇒ どこまで小さくなるのか? n 個単位でブロック化した通報=n 次拡大情報源Snの通報1個 ⇒ 講義前半で紹介した平均符号長の限界定理が適用できる ブロック符号の平均符号長をLnとすると,H1(Sn)  Ln < H1(Sn)+1. 一通報当たりの平均符号長に換算すると... Then we come to a natural question. Is there any bound on the average codeword length? In the previous slides, we consider to encode blocks of messages (of S) where each block has length n. In this case, a block can be regarded as one message generated from the n-th order extended information source S^n. This is to say, we are defining a codeword for each message of S^n, and the theorem in p.5 of this slide can be used in this context. Let L_n be the average codeword length of the Huffman code which are defined for blocks of length n. The average codeword length per message symbol is therefore L_n / n, which is upper and lower-bounded by the inequality which is given at the bottomo of this slide.

シャノンの情報源符号化定理 H1(Sn) / n は,情報源 S の n 次拡大エントロピー Hn(S) n → ∞のとき, H1(Sn) / n → H(S)...情報源 S の極限エントロピー 1 / n → 0 [シャノンの情報源符号化定理] 情報源Sから発生する通報は,一通報あたりの平均符号長が H(S)  L < H(S) + e である瞬時復号可能な符号に符号化可能 When n goes to infinity, H_1(S^n) / n gives the entropy H(S) of S, and 1 / n approaches to zero. This is summarized in this famous Shannon’s source coding theorem. Theorem: There is an immediately decodable code which can encode outputs S and whose average codeword length (per message symbol) is between H(S) and H(S) + e where e is an arbitrary small positive value.

情報源符号化定理の意味するところ 情報源Sから発生する通報は,一通報あたりの平均符号長が H(S)  L < H(S) + e である瞬時復号可能な符号に符号化可能 ブロック長を長くしてハフマン符号を使えば,効率的にはベスト どれだけ頑張っても,極限エントロピーの壁は越えられない Theorem: There is an immediately decodable code which can encode outputs S and whose average codeword length (per message symbol) is between H(S) and H(S) + e where e is an arbitrary small positive value. What does this mean? Theoretically saying, we will have a better code if we consider longer blocks and use the Huffman algorithm. However, we cannot go beyond the “great-wall” of the entropy. A B 確率 0.8 0.2 符号語 1 ブロック長 1 2 3 : 一通報あたり平均符号長 1.0 0.78 0.728 : 0.723 + e H(S) = 0.723

ブロック符号化法の実用性 効率面からは,ブロック化+ハフマン符号が最良な方法 ブロック長を大きくすれば,いくらでも理論限界値に近づける 実用面から見たデメリット: 通報の発生確率を,あらかじめ知っておく必要がある (⇒ 次回講義で触れる予定) 符号の定義(通報と符号語の対応表)が巨大になる 対応表1行を記憶するのに1バイト必要だとすると... ブロック長 8...256バイト ブロック長 16...64キロバイト ブロック長 32...4ギガバイト We can go close to the theoretical bound by using the Huffman code for blocked information source. However, this is a theoretical talk. In practice, there are some problems in using long blocks. - we need to know the probabilities of messages which are generated from the information source. this issue will be discussed in the next class. - another problem especially serious in using “long blocks” is that, the definition of the code can be too big. Encoders and decoders need to remember the relation between messages and codewords. If use a table representation to define the code, as we have used in previous slide, then the size of the table becomes so large. If, for example, we need one byte to record one entry of the table, then the size of the table is estimated as; 256 bytes, if the block length is 8 64 Kbytes, if the block length is 16 4 Gbytes, if the block length is 32 The size of the table affects the size of the implementation such as the size of circuits and programs. In practice, it seems difficult to obtain good codes by using not-a-short block length.

非等長ブロック符号化 ここまでのブロック符号化:符号化対象のパターンは同じ長さ ⇒ 発生確率の非常に小さなパターンにも符号語定義が必要 ⇒ 対応表の巨大化 符号化対象のパターンが,異なる長さを持つことを許す ⇒ 非等長 (unequal-length) ブロック化.ただし 各パターンの発生確率が,ある程度均衡すること 任意の通報系列が,パターンにブロック化できること ...どのようにパターンを定義するかが鍵となる A B A A B A B A The table size grows unnecessarily big because we need to define codewords for blocks which will not occur frequently. This is unavoidable as far as we consider all blocks which have a certain block length. The table needs to contain “useless” entries. We can get around the problem by allowing blocks to have different block length, and consider unequal-length block. In defining the set of blocks, we need to care that - blocks in the set have as close probabilities as possible - an arbitrary sequence can be represented by using the blocks in the set. A B A A B A B A

パターンの定義方法について パターン定義の戦略1:ブロック詳細化(頻出パターンの細分化) 長さ1のパターンを用意する 発生頻度の高いパターンを細分化し,分割する 所望のパターン数になるまで 2 を繰り返す 例:P(A) = 0.8, P(B) = 0.2,所望パターン数4のとき how can we define the set of blocks? a block partitioning is one of possible strategies. In this strategy, we partition block whose probability is bigger than other blocks. step 1: we start from block patterns of length one step 2: select a block pattern whose probability is large, and partition the block by appending one symbol at the end of the pattern. This introduces several patters with smaller probabilities. step 3: repeat the step 2 if you can allow increasing the size of the table The example sketches the case in which we are going to construt a code table of size four. We can check that any sequence of “A” and “B” can be “blocked” by using block patterns given in the table. A B 0.8 0.2 AA AB B 0.64 0.16 0.2 AAA AAB AB B 0.512 0.128 0.16 0.2 任意の系列を,{AAA, AAB, AB, B} にブロック化可能

非等長ブロック化時の平均符号長 パターン AAA AAB AB B 確率 0.512 0.128 0.16 0.2 符号語 100 101 100 101 11 前スライドのパターンに対し, ハフマン符号を定義(右表) n個の通報ブロック... 符号語系列の長さは 0.512n×1 + 0.128n×3 + 0.16n×3 + 0.2n×2 = 1.776n 通報系列の長さは 0.512n×3 + 0.128n×3 + 0.16n×2 + 0.2n×1 = 2.44n 一通報あたりの符号長は,1.776n / 2.44n = 0.728 ...ブロック長3(対応表8行)のときと同程度の効率 (cf. p.15) does the unequal-length block contribute to realize short codes? to answer to this question, we need to investigate the average codeword length, but it might be slightly complicated compared to the equal-length case. Consider the example given in the table. Read the table so that we assign a codeword “100” to the pattern “AAB” which occurs with probability 0.128. Assume that the information source generates many message symbols, and the sequence of message symbols are partitioned into n blocks. One block is encoded to one codeword, and we will have n codewords. Because block patterns occur with the probabilities given in the table, we can expect that 0.512n out of n codewords are “0”, a codeword with length one. This observation allows us estimating the expected number of symbols after the encoding operation; the length of the sequence of codewords amounts to 1.776n bits in total. This is the output of the encoder. Consider the input-side of the encoder. The encode receives n blocks, and each block will occur with the probability in the table. Similar to the output-case, we can estimate how many symbols are provided to the encoder. In summary, the encoder receives 2.44n symbols and outputs 1.776n symbols. So we are useing 1.776n / 2.44n = 0.728 bits to represent one message symbol. This is almost the same performance as the code in p.15, in which the size of the table double to this unequal-case.

パターンの定義方法について:ラン長の利用 パターン定義の戦略2:記号のランによりパターン定義 (ランレングス法) 記号のラン(run):系列中の,同一記号からなる部分系列 A B B A A A A A B A A A B another strategy to define good block patterns is to make use of “runs”. encoding based on this strategy is sometimes referred as “run-length” encoding. A run is a sequence of symbols consists of a single type of character. Consider the string given at the middle of slide, which has many A’s but small number of B’s. The first two characters “AB” can be regarded as a run of “A” of length one, because one “A” is followed by “B”. The “B” at the third position can be regarded as a run of “A” of length zero, because we can think of it as zero “A” followed by “B”. If the length of each run is given, then we can reconstruct the original sequence. So we consider to encode the length of runs. 長さ1のAのラン 長さ3のAのラン 長さ5のAのラン 長さ0のAのラン 特定記号のランの長さだけ与えられれば,元の系列を復元可能 ⇒ ランの長さを符号化しよう!

ラン長の上限 非常に長いランが発生することも... ⇒ランの長さに上限を設け,上限以上のランは短く分割して表現 上限を3とした場合 ラン長 1 2 3 4 5 6 7 : 表現方法 3+0 3+1 3+3+0 3+3+1 ABBAAAAABAAAB を表現する: Aが1個発生し,Bが発生 Aが0個発生し,Bが発生 Aが3個以上発生 Aが2個発生し,Bが発生 one small remark is that the length of run can be very long, and it is not practical to define a codeword to those very long runs, because they will not be used so often. To get around this problem, we consider to represent a long run as a concatenation of short runs. The table illustrates the case that we use three as an upper bound limit. In this example, we consider four patterns of runs of “A”; a run of “A” of length zero (0) a run of “A” of length one (1) a run of “A” of length two, (2), and “AAA” but the character next to the last “A” is not read yet (3+) any string can be represented as a concatenation of such runs of “A”.

ランレングスハフマン符号 ランレングスハフマン符号 (run-length Huffman code) 通報系列中の特定記号のラン長を符号化したハフマン符号 ⇒ 通報の発生頻度に偏りがある場合,非常に有効 p(A) = 0.9, p(B) = 0.1 のとき ラン長 1 2 3以上 符号化されるパターン B AB AAB AAA 発生確率 0.1 0.09 0.081 0.729 符号語 10 110 111 run-length Huffman code is a Huffman code which is defined for the runs considered in the previous slide. From a different viewpoint, each run considered in the previous slide can be regarded as a block pattern, and we can say that we are considering one implementation of unequal-length block coding. The run-length Huffman code is especially effective when there is a big bias in the probability of messages. Consider the case that “A” occurs with probability of so much of 0.9. We can compute the probability of each run, and then construct a Huffman code. The bottom of the slide shows some encoding examples using this code. ABBAAAAABAAAB: 1, 0, 3+, 2, 3+, 0 ⇒ 110 10 0 111 0 10 AAAABAAAAABAAB: 3+, 1, 3+, 2, 2 ⇒ 0 110 0 111 111 AAABAAAAAAAAB: 3+, 0, 3+, 3+, 2 ⇒ 0 10 0 0 111

符号化例 S: 記憶のない2元定常情報源,p(A) = 0.9, p(B) = 0.1 エントロピーは H(S) = –0.9log20.9 – 0.1log20.1 = 0.469 単純なハフマン符号化: 平均符号長は1 等長ハフマンブロック符号化 (n = 3) 通報 A B 確率 0.9 0.1 符号語 1 to summarize today’s class, we consider several encoding techniques. We assume that A and B occur with probabilities 0.9 and 0.1, respectively. The entropy of this information source is 0.469. If we use the simplest Huffman code, then the average codeword length is 1. If we use equal-length block Huffman coding, then we need to manage a big code table but the average codeword length can be reduced to 0.55. 通報 AAA AAB ABA ABB 確率 0.729 0.081 0.009 符号語 100 110 1010 通報 BAA BAB BBA BBB 確率 0.081 0.009 符号語 1110 1011 11110 11111 平均符号長 1.661, 一通報あたりの平均符号長 1.661 / 3 = 0.55

符号化例(続) ランレングスハフマン符号化(符号語数を8に設定) ラン長 1 2 3 確率 0.1 0.09 0.081 0.073 符号語 1 2 3 確率 0.1 0.09 0.081 0.073 符号語 110 1000 1001 1010 ラン長 4 5 6 7+ 確率 0.066 0.059 0.053 0.478 符号語 1011 1110 1111 next consider to use the run-length Huffman code. We allow runs of length 0 to 6, and 7+. This will become unequal-length block coding, and we need to use the technique given in p.22 of this slide. (small remark: a run of length k has (k + 1) symbols, because it contains k “A”’s and one “B” which follows “A”’s). In this case, the average codeword length is 0.47, slightly better than the equal-length block case, with the same size of code tables. nブロックでは... 符号語系列 0.1n×3+...+0.478n×1=2.466n 通報系列 0.1n×1+...+0.053×7+0.478×7=5.216n (ラン長 k ⇒ k 個のAと一個のB なので通報数では k+1) 一通報あたりの平均符号長は 2.466n / 5.216n = 0.47

本日のまとめ 平均符号長の限界定理 シャノンの情報源符号化定理 ハフマン符号の拡張的利用法 ブロック符号化 非等長ブロック符号化 ブロック詳細化法 ランレングス法 (ブロック詳細化の特殊ケース,と考えることも可能) summary of today’s lecture theorem related to the average codeword length (p.5) Shannon’s source coding theorem (p.17) extended usage of the Huffman codes - block encoding (equal-length) - unequal-length block encoding - block partitioning strategy - run-length encoding (this can be regarded as a special case of the block partitioning).

練習問題 通報とその発生確率を入力すれば,ハフマン符号を構成するプログラムを作成せよ ブロック符号が構成できるよう,上記プログラムを改造せよ 適当な通報と発生確率を定め,一通報あたりの平均符号長がブロック長によりどう変化するか,プログラム実験を行え suggested practice write a computer program which determine a Huffman code for a given message statistics. modify the above program so that it can deal with (equal-length) block encoding. use the above modified program and observe the relation between the average codeword length and the block size.