福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core.

Slides:

Advertisements

Similar presentations

だい六か – クリスマスとお正月ぶんぽう. て form review ► Group 1 Verbs ► Have two or more ひらがな in the verb stem AND ► The final sound of the verb stem is from the い row.

Advertisements

て -form - Making て -form from ます -form -. With て -form, You can say... ～てもいいですか？ (= May I do…) ～てください。 (= Please do…) ～ています。 (= am/is/are doing…) Connecting.

Essay writing rules for Japanese!!. ＊ First ・ There are two directions you can write. ・よこがき / 横書き (same as we write English) ・たてがき / 縦書き (from right to.

VE 01 え form What is え form? え？ You can do that many things with え form?

Report of recent DSSD status R. Kiuchi (SNU) 2012/10/20 E07

Windows Azure ハンズオントレーニング Windows Azure Web サイト入門.

第3回並列計算機のアーキテクチャと並列処理の実際

情報理工学系研究科コンピュータ科学専攻上嶋裕樹

07. 値予測五島正裕.

07. 値予測五島正裕.

メモリに関する話題（1） - Cache Memory (1) - Cache

小水力班/ Small Hydro Generation Group 研究背景 / Research background

英語勉強会.

第１回レポートの課題６月１５日出題今回の課題は１問のみ第２回レポートと併せて本科目の単位を認定第２回は７月に出題予定

THE CONTINUOUS IMPROVEMENT MODEL called ADEC

データベース工学データベースとはデータモデル関係データベースとＳＱＬ物理データベース編成とインデクス

CSWパラレルイベント報告ヒューマンライツ・ナウ　　　　　　　　後藤　弘子.

じょし Particles.

キャッシュ頻繁にアクセスされるデータを入れておく小規模高速なメモリ当たる(ヒット）、はずれる（ミスヒット）マッピング（割り付け）

ネットワーク構成法スケール第６回１１月１９日.

Training on Planning & Setting Goals

メモリに関する話題（2） - 仮想メモリ Memory (2) – Virtual Memory

Tohoku University Kyo Tsukada

Windows Summit /8/2017 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be.

コンピュータネットワークシステムトラック

Estimating Position Information by Detecting Network-Connection

十年生の日本語 Year 10 Writing Portfolio

Licensing information

Provisioning on Multiple Network(NIC) env

The Sacred Deer of 奈良(なら)

Who Is Ready to Survive the Next Big Earthquake?

Possible Damping Ring Timing

“You Should Go To Kyoto”

VTA 02 What do you do on a weekend? しゅうまつ、何をしますか。

組み込み向けCPU 小型デバイスに搭載されるCPU 特徴携帯電話，デジタルカメラ，PDA，センサデバイスなど小型低消費電力多機能

Microsoft Partner Network Office 365 社内使用ライセンスの有効化

Air Pen -- an introduction of my recent result --

ストップウォッチのカードストップウォッチのカード

これからが面白いプロセッサアーキテクチャ

Lazy Release Consistency

ISO 9001:2015 The process approach

芝野耕司 ISO/IEC JTC1/SC2 (Coded Character Sets)委員長東京外国語大学

Causative Verbs Extensively borrowed from Rubin, J “Gone Fishin’”, Power Japanese (1992: Kodansha:Tokyo) Created by K McMahon.

勉強会その５　　2016/6/15 マルチコア/マルチプロセッサキャッシュコヒーレンス 10 8分35秒.

11. マルチスレッド・プロセッサ五島正裕.

Cache Organization for Memory Speculation メモリ投機を支援するキャッシュの構成法

全国粒子物理会桂林 2019/1/14 Implications of the scalar meson structure from B SP decays within PQCD approach Yuelong Shen IHEP, CAS In collaboration with.

-Get test signed and make corrections

10. マルチスレッド・プロセッサ五島正裕.

Term paper, Report （1st, first）

Where is Wumpus Propositional logic (cont…) Reasoning where is wumpus

Windows Summit /24/2019 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be.

メモリ投機を支援する CMPキャッシュコヒーレンスプロトコルの検討

超伝導回路を用いた物理乱数発生回路の研究

22 物理パラメータに陽に依存する補償器を用いた低剛性二慣性系の速度制御実験高山誠指導教員小林泰秀

Term paper, report (2nd, final）

第１回レポートの課題６月２４日出題今回の課題は１問のみ第２回レポートと併せて本科目の単位を認定第２回は７月に出題予定

Genetic Statistics Lectures （４） Evaluation of a region with SNPs

ー生命倫理の授業を通して生徒の意識に何が生じたかー

英語音声学（７）音連結.

P P コンプレッサにおける能動騒音制御 19 Active noise control in compressor １. 研究背景

MO装置開発 Core part of RTR-MOI Photograph of core part.

九州大学のキャンパスネットワークを事例にL1~L3を学ぶ Study on L1,L2 and L3 with case of Campus Network of Kyushu Univ. 岡村耕二 Koji OKAMURA.

Cluster EG Face To Face meeting

Grammar Point 2: Describing the locations of objects

Term paper, report （2nd, final）

Cluster EG Face To Face meeting 3rd

全体ミーティング(9/15) 村田雅之.

Presentation transcript:

福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core 技術的課題点 Technical issues Cache 一貫性（コヒーレンス）制御 Cache coherence control むずかしいパラレル化 Parallelization

福永力； Chikara Fukunaga 2 マルチコア化の背景 Background of Multi-Core トランジスタ微細構造化の限界 Problems arisen from the fine structure process of transistors Un-ignorable Increase of Leak current （漏れ電流）（ coming from CMOS structure ） → Upper limit of Drive Frequency （動作周波数） Core2 has made with CMOS of Gate length 45nm→22nm 消費電力の限界 Problems arisen from the power consumption TDP ： limit of Thermal Design Power （最大放熱量） with present drive freq. Such a processor will no longer be adopted for mobile devices Heat generation （発熱量） >> Heat radiation power （放熱量）単体プロセッサ設計の問題 Problems arisen from the single core design Limit of h/w design complexity （複雑化設計） beyond Super- scalar/pipeline IPC will not be exceeded over four （ IPC>4 は無理か？）

CMOS 構造と原理 CMOS structure CMOS=Complementary Metal Oxide Semiconductor （相補的金属酸化膜半導体） pMOS と nMOS で論理回路を構成 Both pMOS and nMOS together makes logical circuits 福永力； Chikara Fukunaga 3 pMOS nMOS gate （ poly-silicon ） Oxygen Well Source Drain Substrate Metal wiring Insulator Guarde

福永力； Chikara Fukunaga 4 マルチコア化 /Towards Multi-Core デザインルールが小さくなり多数のコアを 1 チップに組み込める． Many cores can be put into a chip with lower design rule. マルチコアで性能向上を今までと同じように維持できる． Maintains performance upgrade with Multi-Core 単体に求められる演算性能は 1/ （コアの数）と低く抑えられる． Performance requirement for a core =1/number of cores 低電圧電源で低消費電力 Lower driving power and lower power consumption プロセスに余裕を持たせられる．例えばゲート酸化膜を厚くしリークカレントの低減をはかることができる． Sufficient space for a transistor (thick gate → low leak current) 日経エレクトロニクス（ Nikkei Electronics ） Pentium 4 180nm （ 2000 ） Pentium D 90nm （ 2005 ） Design rule (gate width) Processor performance Speed up with design rule→ Design rule helps no speed up→ Single core Multi-core Parallel & freq. 30% lower Parallel & freq. 50% lower

福永力； Chikara Fukunaga 5 マルチコア実装技術 Issues of Multi-Core Implementation 利点ばかりではなく技術的に注意すべき問題も山積している． Many issues for Multi-core designs beside various advantages マルチコア対応プログラミングについても課題が多くある Software technology (parallelization) for Multi-Cores is still problematic Original: 日経エレクトロニクス（ Nikkei Electronics ） 2004 年 8 月 30 日

福永力； Chikara Fukunaga 6 マルチコアの構成例 Multi-Core configuration 共有バス結合 Common bus coupling type 集中共有メモリ方式 Shared memory type 分散メモリ方式 Distributed memory type 相互結合ネットワーク Mutual coupling network 例えば TPcore のネットワーク Tpcore is a Flagship processor developed by Fukunaga’s lab. since 2005 Tpcore とは福永研のフラッグシッププロセッサ； 2005

福永力； Chikara Fukunaga 7 共有バス結合（ 1 ） Shared Bus coupling (1) 集中共有メモリ方式 Shared Memory type データの共有によるプログラミングの容易さ Relatively easier programming due to shared data handling バスの負荷増加によるスケジューリングとバス主導権の調停の困難さ Heavy load of the shared bus and difficulty to control bus initiative among cores (Arbitration) cache のコヒーレンシ（各コア間，共有メモリのデータ一致度） Difficulty to maintain the cache coherency もし MPU1…n が同種のプロセッサであれば、これを対称マルチプロセッサ（ SMP ）構成と呼ぶ．あるいは UMA （ Uniform Memory Architecture ） This is called Symmetric Multi-processor (SMP) Architecture if all the MPUs are homogeny or UMA (Uniform Memory Architecture)

共有バス結合（ 2 ） Shared bus coupling (2) 分散メモリ方式 Distributed Memory structure 共有バスのアクセス競合を減らす Try to reduce access conflict with own memory space プログラミングの負荷はやや増す．分散配置されているメモリは仮想的に統一されて扱う． Load of program will increase. Memory localized are treated as if a part of shared memory virtually. Called also as NUMA （ Non Uniform Memory Architecture) 多くは共有メモリと分散メモリ方式の混合として存在する． Normally actual chips are realized as mixture of shared memory and distributed memory architectures 福永力； Chikara Fukunaga 8

Multi-Core バス構成例（ 1 ） Examples of Multi-Core Architecture (1) ルネサス /Renesas SH4 （ RISC ） Multi-Core SH7786 SH-4A Core×2 （ SMP or Anti-SMP configurable ） Local Memory & Shared Memory mixed architecture 福永力； Chikara Fukunaga 9 26bit Address & 32bit Data bus External Memories 533MHz

Multi-Core バス構成例（ 2 ） Examples of Multi-Core Architecture (2) CELL chip (IBM, Toshiba, Sony, Sony Computer Ent.; SCEI) PowerPC Processor Element; PPE (main) (×1) Synergetic Processor Element; SPE (sub) (×8) Asymmetric Multi Processor (ASMP) configuration EIB (Element Interconnect Bus) 128bit×4 福永力； Chikara Fukunaga 10

CELL chip processor elements PPE (64bit PowerPC) For execution of OS or Application main Control of External main memory, IO and SPE In-order 2-way Super scalar, 2-way Multi-thred SPE for Arithmetic calculation, multi-media 128 bit SIMD type RISC, In-order 2 way 福永力； Chikara Fukunaga kB 32kB 512kB 256kB Local Memory for access of other SPE data

福永力； Chikara Fukunaga 12 マルチコア下での Cache 構成の問題点 Cache problem with Multi-Core m1,m2 は MS m のそれぞれのプロセッサで cache コピーとする． Assume m1 and m2 as cache copies of m in MS. （ 1 ） MPU1 は m1 を a に変更（ store ）， m2 はどうすべきか？ What action MPU2 should take for m2 if MPU1 write “a” on m1? （ 2 ） MPUn が共有メモリからオリジナル m のアドレスを cache に読み込みたいが（ 1 ）の後ではどれを参照すべきか？ MPUn needs to refill m in MS into own cache. What it should do after (1)? （ 3 ） MPUn がオリジナル m のアドレスへの write アクセスで cache ミスしたため直接（共有）メモり上で（ライトスルーなので）データを書き換えたい，どうするか？ MPUn made cache miss at writing to original m (under the Write through mode), what should MPUn do to the original m?

Cache Coherency （一貫性）プロセッサが任意のメモリ（共有 or 分散）を read access して常に最新のデータが取得できることが必須． A processor should get always the newest data if it makes read access to memory (shared or distributed). これはプロセッサ h/w で保証されなければならない． This rule must be guarantied with the processor h/w. キャッシュ書き込み制御機構： Restoring rules for cache 1.Write Back が Multi-core cache で通常採用される．共有バスに負荷かからない． Write Back cache architecture is normally used for Multi-cores in order to reduce load to the shared bus. 2.Cache R/W miss での Refill 時に /At refill for Cache R/W miss Write Update ：その block をキャッシュにもつすべてのプロセッサに対して update をリクエスト The “Update” request sent to processors which share the block. Write Invalidate: その block をキャッシュにもつ全プロセッサに invalidate リクエスト The “invalidate” request sent to all processors which share the block. 福永力； Chikara Fukunaga 13

福永力； Chikara Fukunaga 14 ディレクトリによる Cache 変更の連絡・確認 Communication and confirmation with Directory system for cache coherency control ディレクトリ方式（一元管理） Directory Control Method (unified control) 各プロセッサは自分の memory copy がどのプロセッサで共有されているか登録する table をもつ． Each processor has a table which contains the the processor numbers with which the own memory block is shared copied. もしあるプロセッサが block を変更したらどのプロセッサにその変更を連絡すればよいか素早く確認できる． If a processor modified a block, the processor can quickly identify the processors to whom this status modification should send. しかしこれは分散メモリ形式に有効．共有バス方式では次の方法 snooping による分散管理が主に利用される． This directory method is mainly applied to the multi-core with completely distributed memory architecture. Snooping is used normally for shared bus type architecture

Snoop による Cache 状態の確認 Check of a cache block with snoop ブロードキャスト・スヌープ / Broadcast Snoop （ Snoop= 詮索，かぎ回り） Coherent request が r/w cache ミス時にバスを通してなされる Coherent request to all the processors via the shared bus at cache r/w miss. どのプロセッサも cache snoop を行いリフィルされる block があるかないか、あれば clean か dirty かチェック Every processor makes cache snooping to check any block to be refilled is in the cache or is even clean or dirty if exists. もしその block が dirty で見つかったならそのデータを Write back で返すべき．その block がオーナ状態となる． If the block is found but dirty, the data should be written back, the block is transited to owner state. Clean であれば invalid か shared state としておく． If the block is found but clean, the block is transited into shared or invalid state. 福永力； Chikara Fukunaga 15

福永力； Chikara Fukunaga 16 ストアイン（ライトバック）キャッシュの状態遷移（シングルコア） Local cache Main Storage Dark blue cells dirty Right blue cells clean Cache with Direct Map Architecture

Multi-core Cache 状態遷移図による管理 Management of Multi-core cache with state transition diagram M(odified): Data mismatch btwn MS (Main Storage) and cache (dirty) the block not found in caches of other processors S(hared): Data match btwn MS and cache (clean) the block found in caches of other processors E(xclusive): Data match btwn MS and cache (clean) the block not found in caches of other processors I(nvalid): state right after reset or one with command “invalidate” （ no data available ） 0 （ wner or Owned): Data mismatch btwn MS and cache (dirty) the block found in caches of other processors 福永力； Chikara Fukunaga 17 new →

福永力； Chikara Fukunaga 18 MSI プロトコル MSI protocol あるメモリブロックが Clean 状態を他のプロセッサのキャッシュと共有している、していないを区別しない． Clean (Share) state is not distinguished with shared or not shared with cashes of other processors Read/Write Miss ともに bus snoop が必要 Bus snoop is necessary at both R/W miss もし block keep が自分のみで他のプロセッサの cache にはないにもかかわらず bus snoop するので無駄な snoop （バストラフィック）が発生する． Even if only this processor has a copy of block in cache, it asks always bus snoop with read cache miss. Many unnecessary snoop on shared bus.

福永力； Chikara Fukunaga 19 MESI プロトコル Clean 状態を 2 つに分ける． Shared 、 Exclusive ほとんどが Exclusive だと想定される．その場合 Read Miss 時も snoop せず Bus に無駄なトラフィックを発生させない． Two states Shared and Exclusive for Clean state. No snoop at Read Miss to keep reduce bus traffic. 多くの Multi-Core で採用されている． PowerPC, Intel Core 2 Many Multi-Cores uses this protocol presently.

福永力； Chikara Fukunaga 20 同時マルチスレッド Review （ Simultaneous Multi-Thread; SMT ） SMT は Single core 内で OS あるいは専用 h/w が複数 Thread 実行を制御していた． OS or some specific h/w controls the multi-thread execution SMT より Super scalar の有効利用が進み眠っている各種資源を同時に独立に実行させることができ IPC が向上した． Effective usage of a super-scalar has been established by introduction of SMT, several independent resources can work in parallel for every purpose. スレッドレベル並列化（ TLP ）の推進がさらなる SMT プロセッサの効率を高めると期待される． Development of Thread level parallelization technique will enhance the effectiveness of an SMT processor.

福永力； Chikara Fukunaga 21 同時マルチスレッドから Multi-Core Multi-Core from SMT 元来マルチプロセス（タスク）システムは OS で制御され，複数プロセスで資源の取り合いなどを防ぐ技術が開発されてきた（スピンロック，セマフォア， CSP ）． Originally multi-task process execution has been controlled under an OS, and developed technology to avoid conflict in multi-process environment is applied in OS (Spin rock, semaphore, CSP etc.) この技術を OS レベルからハードウェアレベルに引き下げ，多くのスレッドを適切にマルチコアを構成するプロセッサに分散配置させて割当て最適化された並列処理環境を実現できるかどうかが課題 This technology must be implemented in hardware of a Multi-Core system or individual core. There is an issue to make an optimized parallel processing system totally in h/w environment of MC もちろんこの技術開発には古くて新しい課題である並列処理システムのさまざまな問題を解決していかなければならない We need to solve various old and new problems inherent in parallel processing system for the above issue ．

福永力； Chikara Fukunaga 22 山積する並列プログラム化への課題 Many problems for parallel programming Hotchips2006 での Sun Microsystems の Y.Lin 氏のスライドより，彼が指摘したマルチスレッド並列処理プログラムのさまざまな課題． Y.Lin of Sun-Microsystems specified various issues to construct an MT program as → 並列できるタスクをどう見いだすか，作りだすかタスクのスレッド化への写像スケーラビリティをどのように達成するか． (English → photo) が議論されている．