福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core.

福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core 技術的課題点 Technical issues Cache 一貫性（コヒーレンス）制御 Cache coherence control むずかしいパラレル化 Parallelization

福永力； Chikara Fukunaga 2 マルチコア化の背景 Background of Multi-Core トランジスタ微細構造化の限界 Problems arisen from the fine structure process of transistors Un-ignorable Increase of Leak current （漏れ電流）（ coming from CMOS structure ） → Upper limit of Drive Frequency （動作周波数） Core2 has made with CMOS of Gate length 45nm→22nm 消費電力の限界 Problems arisen from the power consumption TDP ： limit of Thermal Design Power （最大放熱量） with present drive freq. Such a processor will no longer be adopted for mobile devices Heat generation （発熱量） >> Heat radiation power （放熱量）単体プロセッサ設計の問題 Problems arisen from the single core design Limit of h/w design complexity （複雑化設計） beyond Super- scalar/pipeline IPC will not be exceeded over four （ IPC>4 は無理か？）

CMOS 構造と原理 CMOS structure CMOS=Complementary Metal Oxide Semiconductor （相補的金属酸化膜半導体） pMOS と nMOS で論理回路を構成 Both pMOS and nMOS together makes logical circuits 福永力； Chikara Fukunaga 3 pMOS nMOS gate （ poly-silicon ） Oxygen Well Source Drain Substrate Metal wiring Insulator Guarde

福永力； Chikara Fukunaga 4 マルチコア化 /Towards Multi-Core デザインルールが小さくなり多数のコアを 1 チップに組み込める． Many cores can be put into a chip with lower design rule. マルチコアで性能向上を今までと同じように維持できる． Maintains performance upgrade with Multi-Core 単体に求められる演算性能は 1/ （コアの数）と低く抑えられる． Performance requirement for a core =1/number of cores 低電圧電源で低消費電力 Lower driving power and lower power consumption プロセスに余裕を持たせられる．例えばゲート酸化膜を厚くしリークカレントの低減をはかることができる． Sufficient space for a transistor (thick gate → low leak current) 日経エレクトロニクス（ Nikkei Electronics ） 2004.8.30 Pentium 4 180nm （ 2000 ） Pentium D 90nm （ 2005 ） Design rule (gate width) Processor performance Speed up with design rule→ Design rule helps no speed up→ Single core Multi-core Parallel & freq. 30% lower Parallel & freq. 50% lower

福永力； Chikara Fukunaga 5 マルチコア実装技術 Issues of Multi-Core Implementation 利点ばかりではなく技術的に注意すべき問題も山積している． Many issues for Multi-core designs beside various advantages マルチコア対応プログラミングについても課題が多くある Software technology (parallelization) for Multi-Cores is still problematic Original: 日経エレクトロニクス（ Nikkei Electronics ） 2004 年 8 月 30 日

福永力； Chikara Fukunaga 6 マルチコアの構成例 Multi-Core configuration 共有バス結合 Common bus coupling type 集中共有メモリ方式 Shared memory type 分散メモリ方式 Distributed memory type 相互結合ネットワーク Mutual coupling network 例えば TPcore のネットワーク Tpcore is a Flagship processor developed by Fukunaga’s lab. since 2005 Tpcore とは福永研のフラッグシッププロセッサ； 2005

福永力； Chikara Fukunaga 7 共有バス結合（ 1 ） Shared Bus coupling (1) 集中共有メモリ方式 Shared Memory type データの共有によるプログラミングの容易さ Relatively easier programming due to shared data handling バスの負荷増加によるスケジューリングとバス主導権の調停の困難さ Heavy load of the shared bus and difficulty to control bus initiative among cores (Arbitration) cache のコヒーレンシ（各コア間，共有メモリのデータ一致度） Difficulty to maintain the cache coherency もし MPU1…n が同種のプロセッサであれば、これを対称マルチプロセッサ（ SMP ）構成と呼ぶ．あるいは UMA （ Uniform Memory Architecture ） This is called Symmetric Multi-processor (SMP) Architecture if all the MPUs are homogeny or UMA (Uniform Memory Architecture)

共有バス結合（ 2 ） Shared bus coupling (2) 分散メモリ方式 Distributed Memory structure 共有バスのアクセス競合を減らす Try to reduce access conflict with own memory space プログラミングの負荷はやや増す．分散配置されているメモリは仮想的に統一されて扱う． Load of program will increase. Memory localized are treated as if a part of shared memory virtually. Called also as NUMA （ Non Uniform Memory Architecture) 多くは共有メモリと分散メモリ方式の混合として存在する． Normally actual chips are realized as mixture of shared memory and distributed memory architectures 福永力； Chikara Fukunaga 8

Multi-Core バス構成例（ 1 ） Examples of Multi-Core Architecture (1) ルネサス /Renesas SH4 （ RISC ） Multi-Core SH7786 SH-4A Core×2 （ SMP or Anti-SMP configurable ） Local Memory & Shared Memory mixed architecture 福永力； Chikara Fukunaga 9 26bit Address & 32bit Data bus External Memories 533MHz

Multi-Core バス構成例（ 2 ） Examples of Multi-Core Architecture (2) CELL chip (IBM, Toshiba, Sony, Sony Computer Ent.; SCEI) PowerPC Processor Element; PPE (main) (×1) Synergetic Processor Element; SPE (sub) (×8) Asymmetric Multi Processor (ASMP) configuration EIB (Element Interconnect Bus) 128bit×4 福永力； Chikara Fukunaga 10

CELL chip processor elements PPE (64bit PowerPC) For execution of OS or Application main Control of External main memory, IO and SPE In-order 2-way Super scalar, 2-way Multi-thred SPE for Arithmetic calculation, multi-media 128 bit SIMD type RISC, In-order 2 way 福永力； Chikara Fukunaga 11 32 kB 32kB 512kB 256kB Local Memory for access of other SPE data

福永力； Chikara Fukunaga 12 マルチコア下での Cache 構成の問題点 Cache problem with Multi-Core m1,m2 は MS m のそれぞれのプロセッサで cache コピーとする． Assume m1 and m2 as cache copies of m in MS. （ 1 ） MPU1 は m1 を a に変更（ store ）， m2 はどうすべきか？ What action MPU2 should take for m2 if MPU1 write “a” on m1? （ 2 ） MPUn が共有メモリからオリジナル m のアドレスを cache に読み込みたいが（ 1 ）の後ではどれを参照すべきか？ MPUn needs to refill m in MS into own cache. What it should do after (1)? （ 3 ） MPUn がオリジナル m のアドレスへの write アクセスで cache ミスしたため直接（共有）メモり上で（ライトスルーなので）データを書き換えたい，どうするか？ MPUn made cache miss at writing to original m (under the Write through mode), what should MPUn do to the original m?

Cache Coherency （一貫性）プロセッサが任意のメモリ（共有 or 分散）を read access して常に最新のデータが取得できることが必須． A processor should get always the newest data if it makes read access to memory (shared or distributed). これはプロセッサ h/w で保証されなければならない． This rule must be guarantied with the processor h/w. キャッシュ書き込み制御機構： Restoring rules for cache 1.Write Back が Multi-core cache で通常採用される．共有バスに負荷かからない． Write Back cache architecture is normally used for Multi-cores in order to reduce load to the shared bus. 2.Cache R/W miss での Refill 時に /At refill for Cache R/W miss Write Update ：その block をキャッシュにもつすべてのプロセッサに対して update をリクエスト The “Update” request sent to processors which share the block. Write Invalidate: その block をキャッシュにもつ全プロセッサに invalidate リクエスト The “invalidate” request sent to all processors which share the block. 福永力； Chikara Fukunaga 13

福永力； Chikara Fukunaga 14 ディレクトリによる Cache 変更の連絡・確認 Communication and confirmation with Directory system for cache coherency control ディレクトリ方式（一元管理） Directory Control Method (unified control) 各プロセッサは自分の memory copy がどのプロセッサで共有されているか登録する table をもつ． Each processor has a table which contains the the processor numbers with which the own memory block is shared copied. もしあるプロセッサが block を変更したらどのプロセッサにその変更を連絡すればよいか素早く確認できる． If a processor modified a block, the processor can quickly identify the processors to whom this status modification should send. しかしこれは分散メモリ形式に有効．共有バス方式では次の方法 snooping による分散管理が主に利用される． This directory method is mainly applied to the multi-core with completely distributed memory architecture. Snooping is used normally for shared bus type architecture

Snoop による Cache 状態の確認 Check of a cache block with snoop ブロードキャスト・スヌープ / Broadcast Snoop （ Snoop= 詮索，かぎ回り） Coherent request が r/w cache ミス時にバスを通してなされる Coherent request to all the processors via the shared bus at cache r/w miss. どのプロセッサも cache snoop を行いリフィルされる block があるかないか、あれば clean か dirty かチェック Every processor makes cache snooping to check any block to be refilled is in the cache or is even clean or dirty if exists. もしその block が dirty で見つかったならそのデータを Write back で返すべき．その block がオーナ状態となる． If the block is found but dirty, the data should be written back, the block is transited to owner state. Clean であれば invalid か shared state としておく． If the block is found but clean, the block is transited into shared or invalid state. 福永力； Chikara Fukunaga 15

福永力； Chikara Fukunaga 16 ストアイン（ライトバック）キャッシュの状態遷移（シングルコア） Local cache Main Storage Dark blue cells dirty Right blue cells clean Cache with Direct Map Architecture

Multi-core Cache 状態遷移図による管理 Management of Multi-core cache with state transition diagram M(odified): Data mismatch btwn MS (Main Storage) and cache (dirty) the block not found in caches of other processors S(hared): Data match btwn MS and cache (clean) the block found in caches of other processors E(xclusive): Data match btwn MS and cache (clean) the block not found in caches of other processors I(nvalid): state right after reset or one with command “invalidate” （ no data available ） 0 （ wner or Owned): Data mismatch btwn MS and cache (dirty) the block found in caches of other processors 福永力； Chikara Fukunaga 17 new →

福永力； Chikara Fukunaga 18 MSI プロトコル MSI protocol あるメモリブロックが Clean 状態を他のプロセッサのキャッシュと共有している、していないを区別しない． Clean (Share) state is not distinguished with shared or not shared with cashes of other processors Read/Write Miss ともに bus snoop が必要 Bus snoop is necessary at both R/W miss もし block keep が自分のみで他のプロセッサの cache にはないにもかかわらず bus snoop するので無駄な snoop （バストラフィック）が発生する． Even if only this processor has a copy of block in cache, it asks always bus snoop with read cache miss. Many unnecessary snoop on shared bus.

福永力； Chikara Fukunaga 19 MESI プロトコル Clean 状態を 2 つに分ける． Shared 、 Exclusive ほとんどが Exclusive だと想定される．その場合 Read Miss 時も snoop せず Bus に無駄なトラフィックを発生させない． Two states Shared and Exclusive for Clean state. No snoop at Read Miss to keep reduce bus traffic. 多くの Multi-Core で採用されている． PowerPC, Intel Core 2 Many Multi-Cores uses this protocol presently.

福永力； Chikara Fukunaga 20 同時マルチスレッド Review （ Simultaneous Multi-Thread; SMT ） SMT は Single core 内で OS あるいは専用 h/w が複数 Thread 実行を制御していた． OS or some specific h/w controls the multi-thread execution SMT より Super scalar の有効利用が進み眠っている各種資源を同時に独立に実行させることができ IPC が向上した． Effective usage of a super-scalar has been established by introduction of SMT, several independent resources can work in parallel for every purpose. スレッドレベル並列化（ TLP ）の推進がさらなる SMT プロセッサの効率を高めると期待される． Development of Thread level parallelization technique will enhance the effectiveness of an SMT processor.

福永力； Chikara Fukunaga 21 同時マルチスレッドから Multi-Core Multi-Core from SMT 元来マルチプロセス（タスク）システムは OS で制御され，複数プロセスで資源の取り合いなどを防ぐ技術が開発されてきた（スピンロック，セマフォア， CSP ）． Originally multi-task process execution has been controlled under an OS, and developed technology to avoid conflict in multi-process environment is applied in OS (Spin rock, semaphore, CSP etc.) この技術を OS レベルからハードウェアレベルに引き下げ，多くのスレッドを適切にマルチコアを構成するプロセッサに分散配置させて割当て最適化された並列処理環境を実現できるかどうかが課題 This technology must be implemented in hardware of a Multi-Core system or individual core. There is an issue to make an optimized parallel processing system totally in h/w environment of MC もちろんこの技術開発には古くて新しい課題である並列処理システムのさまざまな問題を解決していかなければならない We need to solve various old and new problems inherent in parallel processing system for the above issue ．

福永力； Chikara Fukunaga 22 山積する並列プログラム化への課題 Many problems for parallel programming Hotchips2006 での Sun Microsystems の Y.Lin 氏のスライドより，彼が指摘したマルチスレッド並列処理プログラムのさまざまな課題． Y.Lin of Sun-Microsystems specified various issues to construct an MT program as → 並列できるタスクをどう見いだすか，作りだすかタスクのスレッド化への写像スケーラビリティをどのように達成するか． (English → photo) が議論されている．

福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core.

Similar presentations

Presentation on theme: "福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core."— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

福永 力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core.

Similar presentations

Presentation on theme: "福永 力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core."— Presentation transcript:

Similar presentations

About project

フィードバック

福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core.

Presentation on theme: "福永力； Chikara Fukunaga 1 マルチコア /Multi-Core マルチコア化の背景 Background of Multi-Core CMOS トランジスタ CMOS transistor マルチコアプロセッサの一般的構成 Typical structures of Multi-Core."— Presentation transcript: