VLIW（Very Long Instruction Word）& マルチスレッドプロセッサ（Multi-Thread Processor） Super Scalarのような命令レベル並列処理 Parallel processing with Instruction level like Super.

Slides:

Advertisements

Similar presentations

G ゼミ 2010/5/14 渡辺健人. パフォーマンスの測定 CUDA Visual Profiler CUDA の SDK に標準でついているパフォーマンス測定用のツール使い方： exe ファイルのパスと作業ディレクトリ指定して実行するだけ注意点 : GPU のコード実行後にプログラム終了前に,

Advertisements

福永力 ; Chikara Fukunaga 1 パイプライン構造（内容 1 ） Pipeline structure （ Contents 1 ）パイプラインの考え方 Background idea of a Pipeline DLX （仮想 RISC ）命令セット DLX （ virtual.

だい六か – クリスマスとお正月ぶんぽう. て form review ► Group 1 Verbs ► Have two or more ひらがな in the verb stem AND ► The final sound of the verb stem is from the い row.

VE 01 え form What is え form? え？ You can do that many things with え form?

スーパースカラ Super Scalar From CPI（Clock/Instruction）to IPC（Instruction/clock）スーパースカラ/Super Scalar 考え方 - 複数命令の同時実行構造 Basic idea: Simultaneous issues of several.

第2回　プロセス管理ジョブ、プロセスとは？プロセスの状態遷移プロセス制御ブロックスケジューリング.

情報理工学系研究科コンピュータ科学専攻上嶋裕樹

07. 値予測五島正裕.

07. 値予測五島正裕.

榮樂英樹 LilyVM と仮想化技術榮樂英樹

メモリに関する話題（1） - Cache Memory (1) - Cache

五段動詞の歌ごだんどうしのうた.

英語勉強会.

THE CONTINUOUS IMPROVEMENT MODEL called ADEC

Chapter 11 Queues 行列.

FORTRAN 科学技術計算用数値演算精度を重視したシステム K=0 DO 10 I=0,N,1 K=K+I 10 CONTINUE

高性能コンピューティング学講座三輪忍高性能コンピューティング論２高性能コンピューティング論２第4回投機高性能コンピューティング学講座三輪　忍

2010年7月9日　統計数理研究所　オープンハウス確率モデル推定パラメータ値を用いた市場木材価格の期間構造変化の探求 Searching for Structural Change in Market-Based Log Price with Regard to the Estimated Parameters.

Windows Summit /13/2017 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be.

What did you do, mate? Plain-Past

Explorations in Symbiosis on two Multithreaded Architectures

HP ProLiant DL980 G7 SQL Server 2008 R2 NUMA 環境ベンチマークテスト結果報告書

日本人の英語文章の中で「ENJOY」はどういうふうに使われているのか

Tohoku University Kyo Tsukada

Windows Summit /8/2017 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be.

A 02 I like sushi! I like origami!

十年生の日本語 Year 10 Writing Portfolio

The Sacred Deer of 奈良(なら)

Possible Damping Ring Timing

第7回 2006/6/12.

組み込み向けCPU 小型デバイスに搭載されるCPU 特徴携帯電話，デジタルカメラ，PDA，センサデバイスなど小型低消費電力多機能

ストップウォッチのカードストップウォッチのカード

Advanced Computer Architecture

4.1 Chapter Overview. 4.1 Chapter Overview 4.2 The History of the 80x86 CPU Family Intel製CPUの歴史.

アドバンストコンピュータアーキテクチャ RISC と命令パイプライン

非レイテンシ指向レジスタ・キャッシュ・システム

コンピュータを知る１Ｅ１６Ｍ００９－１梅津たくみ１Ｅ１６Ｍ０１７－８小沢あきら１Ｅ１６Ｍ０３５－０柴田かいと

Causative Verbs Extensively borrowed from Rubin, J “Gone Fishin’”, Power Japanese (1992: Kodansha:Tokyo) Created by K McMahon.

勉強会その３　　2016/5/1 10 8分35秒データの表現演算.

11. マルチスレッド・プロセッサ五島正裕.

Cache Organization for Memory Speculation メモリ投機を支援するキャッシュの構成法

WLTC Mode Construction

Microsoft Visual Studio 2005 Tools for

10. マルチスレッド・プロセッサ五島正裕.

Advanced Computer Architecture

Term paper, Report （1st, first）

Advanced Computer Architecture

Where is Wumpus Propositional logic (cont…) Reasoning where is wumpus

Advanced Computer Architecture

Windows Summit /24/2019 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be.

第24回応用言語学講座公開連続講演会後援：国際言語文化研究科教育研究プロジェクト経費

Question Words….

メモリ投機を支援する CMPキャッシュコヒーレンスプロトコルの検討

08. メモリ非曖昧化五島正裕.

超伝導回路を用いた物理乱数発生回路の研究

22 物理パラメータに陽に依存する補償器を用いた低剛性二慣性系の速度制御実験高山誠指導教員小林泰秀

2019/4/22 Warm-up ※Warm-up 1～3には、小学校外国語活動「アルファベットを探そう」（H26年度、神埼小学校におけるSTの授業実践）で、５年生が撮影した写真を使用しています（授業者より使用許諾済）。

09. メモリ・ディスアンビギュエーション五島正裕.

北大ＭＭＣセミナー第62回附属社会創造数学センター主催 Date: 2016年11月4日（金） 16:30～18:00

Windows Summit 2010 © 2010 Microsoft Corporation.All rights reserved.Microsoft、Windows、Windows Vista およびその他の製品名は、米国 Microsoft Corporation の米国およびその他の国における登録商標または商標です。

The Facilitative Cues in Learning Complex Recursive Structures

MO装置開発 Core part of RTR-MOI Photograph of core part.

Mondriaan Memory Protection の調査

Cluster EG Face To Face meeting

Grammar Point 2: Describing the locations of objects

Cluster EG Face To Face meeting 3rd

Improving Strategic Play in Shogi by Using Move Sequence Trees

１．２言語処理の諸観点（１）言語処理の利用分野

Windows Azure メディアサービス

Presentation transcript:

VLIW（Very Long Instruction Word）& マルチスレッドプロセッサ（Multi-Thread Processor） Super Scalarのような命令レベル並列処理 Parallel processing with Instruction level like Super Scalar MT (Multi-Thread) Processor スレッドレベル並列処理 Thread Level parallel processing 同時（スループット指向）MT Simultaneous (Throughput oriented) MT Super scalarの上にスレッドレベル並列処理を重ねる Thread level processing on top of a super scalar processors 混合型 Mixture（SMT & VLIW）福永　力；Chikara Fukunaga

Multi-Thread Processor MT=Multi-Thread（ここでthreadとはプログラム並列処理単位：ループ、関数） Thread is a unit for parallel processing in a process; Loop or Function ここではスパスカラーを利用せず従来型のスカラープロセッサを利用（no. of ways=1） A conventional scalar processor will be used for this MT processing. 自分自身でthreadを切り替える The hardware makes the context-switch of threads. 福永　力；Chikara Fukunaga

細粒度あるいはインタリーブMT Fine Granularity or Interleaved MT 細粒度（Fine Granularity or Interleaved）MT：クロックあるいは命令ごとにthread切り替え Threads are executed one by one with clock or instruction. MIPS® MT Principles of Operation Document Number: MD00452 2007より福永　力；Chikara Fukunaga

粗粒度あるいはブロックMT Coarse Granularity or Blocked MT 粗粒度（Coarse Granularity or Blocked）MT：キャッシュ・ミスなど長期ストール時にthread切り替え（別名（a.k.a.）：Switch on Event MT：SoEMT） Switching of Threads is done at a long stall like cache miss or IO access MIPS® MT Principles of Operation Document Number: MD00452 2007より福永　力；Chikara Fukunaga

スーパスカラの非効率さ Inefficiency of Super scalar processors （注意）今までのスパースカラプロセッサは命令レベルの並列性（ILP）に着目していた． A Super scalar is a parallel processor at the instruction level. しかし例えn命令同時並列処理（n-ways）の能力があってもIPCはせいぜいn/2程度、資源の半分は使われないままという研究結果がある． IPC is estimated to be only n/2 even if a super scalar has n-way pipelines. Efficiency can not be higher than 50%. 福永　力；Chikara Fukunaga

スループット指向 MT プロセッサ Throughput oriented MT Processor スーパスカラを多数のthreadを並列処理するプロセッサに改変させればその問題が解決できてスループット（throughput）も向上する（だろう）． If a super scalar is used as MT processor , throughput will be expected to increase. スレッドコンテキスト（PCとレジスタ）をthread分用意 Thread data units (PC and registers) are implemented with the number equal to the max. threads expected. 福永　力；Chikara Fukunaga

スループット指向MTプロセッサの考え方 Ideas of Throughput Oriented MT Processors 同時（Simultaneous） MT：複数thread同時実行 Multiple threads can be executed simultaneously. スーパスカラの埋まらないスロットを他のスレッドからの命令で充填させればよい． Empty slots of multiple ways can be used for instructions of other threads. Then the efficiency will be higher. MIPS® MT Principles of Operation Document Number: MD00452 2007より福永　力；Chikara Fukunaga

SMT Processor構造 structure Extension of a simple super scalar structure Instruction of a thread is selected Number of registers are increased Renaming structure for Multi-threads 福永　力；Chikara Fukunaga

スループットMTプロセッサ問題点 Problems about Throughput MT Processors レジスタ数の増加・増大（IPCかクロック周期かなんらかの犠牲が必要） Increase of number of Registers. We need to sacrifice performance (IPC and/or clock frequency) メモリ参照の増加 Increase of memory access （thread単位で独立メモリ領域参照→キャッシュミスヒット） threadごとのキャッシュ領域=1/thread数 (Cache for each thread must be independently implemented→ Cache size reduction → cache miss) Cache size ~ 1/no. of threads Thread数の限界：資源の半分が通常のスーパスカラで利用されているとしたら資源のフル利用にはthread数は多くても2、それ以上は無意味 How many threads are really needed? Some study indicated Resource were most-effectively used with only two. 多重スレッドプログラム処理要求は高い？（数値計算・メディア処理→‥‥） In which application, such a processor will be used. Numerical processing, Multimedia ? 福永　力；Chikara Fukunaga

MIPS32®34Kf™プロセッサProcessor MIPS社開発Multi-thread プロセッサ（2006） Multi-thread processor developed by MIPS Technologies (2006) 2段階の多重thread構造（OS level/User level） Multiple thread structure with two ranks OS level thread → VPE（Virtual Processing Elements）×Max. 2 User level thread → TC（Thread Context）×Max. 9 TCはそれぞれ独自のPCとRegister File（RF）を持つ Each TC has own PC and Register File (RF) 単OS、異なるOSを各VPEに導入できる Single OS with one VPE or Different OS can be deployed to each VPE いくつかのTCsをVPEに配置 Several TCs/VPE QoSがTC選択アルゴリズムを内蔵 QoS contains TC selection Algorithm MIPS technologies WWW siteから福永　力；Chikara Fukunaga

QoS in MT ASEによるTC選択 TC selection with QoS in MT ASE 9 stage pipeline基本構造（basic architecture）+ MT ASE & DSP ASE (Application Specific Extension) 利用者がASEを特定用途向けに改造 Customers can program ASE for their purpose TC優先順位付けPolicy manager in QoS of MT ASEの設定 Setup of Policy manager in QoS of MT ASE for TC selection 例えば（Examples:） Prioritization or Round Robin TC selection/ Cycle assignment to TCs Policy Manager 福永　力；Chikara Fukunaga

MIPS32®34Kf™ Block Diagram Fetch UnitはすべてTCから命令を受け取る Fetch Unit receives instructions from all TCs. Fetch Unit にはBranch Prediction Logic (512 entries) 各TCはIBFという8個命令収容のinstruction bufferをもつ Each TC has an instruction buffer (IBF) which can store 8 instructions TC Dispatch UnitはMT control block にあるpolicy managerからの決定によりどのTCからの命令をとってくるか決める TC Dispatch Unit selects TC to execute the instruction. MT control block specify the TC. MIPS Technologies, MIPS32® 34KfTM Processor Core Datasheet （2008）より福永　力；Chikara Fukunaga

EyeQ2 system with MIPS 34Kf Embedded System Europe 8/9 2006 issue 福永　力；Chikara Fukunaga

EyeQ2 道路交通状況の実時間ビジュアル解析システム Road traffic condition vision analysis with max. two cameras input data 福永　力；Chikara Fukunaga

EyeQ2 block diagram with 2 MIPS 34Kfs 2 MIPS 34Kfs （1つ将来拡張用）が使われている 2 MIPS 34Kfs are used (one is for future extension) 8 Visual Computing Engines CE : Classifier Engine Image scaling & Preprocessing Unit Pattern classifier Tracker: Image warping and and Motion analysis PW: Pre-process Window Image converter and pyramid unit Computation of Vertical and horizontal edge map Filter: Feature based classify unit Dfinder (Display Finder) Stereo engine Programmable scan 2pixel/clock 3 Vector Microcode machines (VLIW) 福永　力；Chikara Fukunaga

VLIW or EPIC VLIW =Very Long Instruction Word or EPIC =Explicitly Parallel Instruction Computer 複数のInstructionを1つの非常に（Very）長い（Long）語（Instruction Word）に詰め込み並列処理を目論む Several instructions are put into a very long instruction word at once. Word（100〜200bitより構成）はいくつかのブロック（プロセッサの処理ユニットに対応）に分解され，各ブロックは数10bitでそのユニットの命令（＋オペランド）を形作る． The word is divided into some blocks (O(10) bit size). One block corresponds to the instruction word of a conventional processor. つまり非常に長い命令語に複数命令を詰め込めるだけ詰め込みそれらを並列動作させようとするもの． Multiple instructions are stored in each block of the instruction word amap. The block has own computing system so that instructions in the word are able to be executed in parallel. 福永　力；Chikara Fukunaga

VLIW or EPIC（2）現在までにさまざまなプロセッサが開発されていった．i860という64bitのLIWがある．整数計算浮動小数点計算を並列実行． Several LIW/VLIW were developed so far such as i860 which is LIW processor with 64bit instruction word. さらにIntel（主要開発はHPによる）のItanium（2001.5発売），TransmetaのCrusoe（2000.1）などが市場に出回った．ともにx86命令を展開するのでWindows系PCに使われている． Intel (although the main developer was HP) VLIW called Itanium has been put on the market in May 2001 on the market, Transmeta has had Crusoe in Jan. 2001. Both processors are VLIW of x86 instructions. They have been used for Windows PCs. 福永　力；Chikara Fukunaga

VLIWの概念 Ideas of VLIW 例えば148bitで1語の命令長．いくつかのブロックに分けられ各部分に対応する命令をプログラムから取ってきて組み込む．それらは並列処理される． Instruction word length=148bit with several blocks. Each block is used for a particular kind of instructions. Instructions are processed in parallel. スケジューリングハードウェアがないので回路は簡単，低消費電力 A simple circuit and low power consumption because no complicated scheduling logic. プログラムから並列並びにハザードを考慮しながら命令を並べるのはソフトウェア（コンパイラあるいは変換ソフト）の仕事 A compiler/translator software arranges the instructions into appropriate blocks. 福永　力；Chikara Fukunaga

命令のVLIWへの組み込み Instructions into VLIW blocks 対応する命令が見つからない場合はNOP（No OPeration）を詰め込む． NOP (No operation) must be put into a block if no instruction candidate 効率的な並列処理は期待できない．動画データの解凍，表示，圧縮などの反復操作に有効か． Expected not efficient parallel processing. Effective for Moving picture? コンパイラに非常に多くのものを要求，依存しなければならない． Heavy duty for the compiler 構造上インオーダ発行、インオーダ終了となる． In order issue/in order completion 福永　力；Chikara Fukunaga

VLIWとスーパスカラ VLIW and Superscalar VLIWはハードウェアの構成は単純、しかし The Hardware structure of VLIW is simple, but 硬直化したスケジューリング：命令のインオーダ発行、インオーダ終了 inflexible scheduling with in-order issue and completion Compiler instructions Processors Scheduler 福永　力；Chikara Fukunaga

VLIW例：富士通FR-V（550） Example of VLIW : Fujitsu FR-V（550） http://jp.fujitsu.com/microelectronics/products/assp/frv/ Parallel processing with max 8 instruction/28 calculations/clock 32kB I-cache (4 way set assoc.) and 32kB D-cache（4-set assoc.） 90nm CMOS Frequency 440MHz (upgraded in 2006) シングルチップFR-V550は2006年改変されマルチコア化されMB93577の型番で商品化されている FR-V550 has been upgraded to the multi-core chip with ID MB93577 64 福永　力；Chikara Fukunaga

FR-V 8-way VLIW architecture 福永　力；Chikara Fukunaga

FRV-550でねらう用途 Applications with FR-V550 cores Multimedia processing with low power consumption Concurrent processing of Movie decoding (MPEG4/H.264 or MPEG2 decoding with 3D graphics) Bioinformatics simulation: Molecular Dynamics of Protein High performance but low energy consumption and price. 福永　力；Chikara Fukunaga

VLIW問題点 NOPsの増大による非効率な並列処理世代交代による新規h/w拡張（つまりbit数/Wの増加）に簡単に対応できない．プログラムの再コンパイルあるいは改造インオーダ終了により非効率なサイクル数の増加（早く終わった命令のアイドリング）ロード/セーブ命令のスケジューリング見積もりの不確定さ細かく複雑な条件分岐を含むプログラムへの非力な適応福永　力；Chikara Fukunaga

EPIC（Explicitly Parallel Instruction Computing） Intel、HPはEPICなる概念のアーキテクチャ（IA64）を共同開発． Intel and HP together have developed a new architecture IA64 in the framework of EPIC VLIWだがその欠点を克服 It is VLIW in principle, but its drawback is tried to eliminate 両社はItaniumという名称のプロセッサを制作した（2001） They have put a new EPIC processor Itanium in market in 2001 福永　力；Chikara Fukunaga

Itanium Architecture 128ビットに3つの命令（各41ビット）と5ビットの「template」と呼ばれるフィールドをもつ． 3 instructions with each for 41 bit block + 5bit field called “template” この128ビットの命令ワードを「bundle」と呼ぶ 128 IW is called “bundle” この命令を実行する（1）かしない（0）かを保持するレジスタを指定predicate部に指定 A register to execute this instruction (1) or not (0) is specified in Predicate field 福永　力；Chikara Fukunaga

Itanium Instructions and Template 命令はハードウェアユニットの利用からいくつかのタイプに分類できる． Instructions can be divided into several types according to the main hardware unit. templateでハードウェア資源ユニットの数を考慮した並列できる命令の組み合わせを示す．templateで規定された組み合わせは独立であることが保証される． Template specifies a combination of instructions which can be put into three slots in bundle. templateの例Example MII, MI|I, MLX, MMI, M|MI, MFI, MMF, MIB, MBB, BBB, MFB … MFI = M inst. (slot 0) F in slot 1 I in slot 2 |はconflictが発生した場合の次のサイクルへの先送り（stop） | indicates that the instruction after it will be done in the next cycle if conflict may occur A型命令はI-unitでもM-unitでも処理可能Type A can be put into either I or M 命令型 Instruction type 意味 Description 実行ユニット Execution unit A Integer ALU I- or M-unit I Non-ALU integer I-unit M Memory M-unit F Floating -point F-unit B Branch B-unit L+X* Extended I-unit or B-unit * Long integer or Long Branch 福永　力；Chikara Fukunaga

Itanium ブロック図 Block diagram Itanium 1 分岐ユニット（B）×3、整数(I)×4、メモリ（M）×4、浮動小数点（F）×2個 Itanium 1 Branch unit (B)×3, Integer (I)×4, Memory (M)×4, Floating Point (F)×2 Itanium 2 分岐ユニット（B）×3、整数(I)×6、メモリ（M）×6、浮動小数点（F）×2個 Itanium 2 Branch unit (B)×3, Integer (I)×6, Memory (M)×6, Floating Point (F)×2 福永　力；Chikara Fukunaga

Itanium Bundle Rotationと並列処理 and Parallel Processing 命令の実行例（通常2 bundles同時実行） Example of instructions (Simultaneous execution of two bundles normally) template MFI & MIB Stop処理とbundle次回まわし（rotation） Stop process and bundle rotation template MII & M|MI ここで2サイクルで9個の命令が並列実行されている． 9/12=75% Nine instructions are executed in two cycles. Utilization=9/12=75% 福永　力；Chikara Fukunaga

Predicationの概念 Concept of Predicate 命令に6ビットのPredication Registerを指定（つまり全部で64個のregisters;各1ビット） 6bit Predication Register specification is added in the instruction. Total 64 registers. このRegister値が1であれば実行、0であればNOP、だから命令は以下のような形に If the value of this register = 1, execute, else =0, then NOP. (p1) ADD R1,R2,1 もしp1==1　ADD命令実行 else NOP if p1==1 ADD else NOP 福永　力；Chikara Fukunaga

Predication example/の例 … If (x==10) c=c+1 ; CMP.EQ P1, 0, R5 (P1) ADD C,1 If example If-else example CMP.EQ P1,P2, 10, R5 (P1) inst1 (P1) inst2 (P1) … (P2) inst3 (P2) inst4 …　… 福永　力；Chikara Fukunaga

x86とItanium コード比較 Code comparison between x86 and Itanium x86 code example CMP AX, 0 JE L1 CMP BX,0 JE L1 ADD J, 1 JMP L3 L1: CMP CX,0 JE L2 ADD K, 1 JMP L3 L2: SUB K, 1 L3: ADD I, 1 Itanium code example //Compare R1 with 0, if it is true then P=1 and P2=0, // otherwise P1=0 and P2=1 CMP.EQ P1, P2, 0, R1 (P2) CMP.EQ P1, P3, 0, R2 (P3) ADD J,1 (P1) CMP.NE P4, P5, 0, R3 (P4) ADD K, 1 (P5) SUB K, 1 ADD I,1 福永　力；Chikara Fukunaga