Virtualizing a SMP on loosely-coupled computers

Virtualizing a SMP on loosely-coupled computers
Kenji Kaneda

漠然とした問題意識 (1/2) 個人でもCPU負荷の高い仕事はそれなりにある利用可能なCPUを駆使するのが理想ではあるけれど…
例）コンパイル，マルチメディア系アプリ利用可能なCPUを駆使するのが理想ではある遊休なPCは多いはずけれど…

漠然とした問題意識 (2/2) 実際に複数PCを駆使するのは非常に困難一般の人に使われるような分散システムは？
Single Software Imageを提供するソフトウェアも，現実にはあまり使われていない例）Distributed OS, Score, … 一般の人に使われるような分散システムは？

Background Virtual Machine ⇒複数の（遊休）CPUを利用することにより高速化例）VMWare 普及してきた
まだまだ遅い ⇒複数の（遊休）CPUを利用することにより高速化

Distributed Virtual Machine Monitor
Emulates virtual multiprocessor machine on physical loosely-coupled computers Memory Processor Memory Processor Memory Processor virtual physical Memory Processor Memory Processor Memory Processor Network

イメージ図

Virtual Machine Monitor
System Overview (1/3) Process mapping between physical and virtual Guest OS Memory Processor virtual Virtual Machine Monitor physical guest OS VMM physical machine Target Processor: IA32 host OS kernel userland Host OS Host OS Processor Processor Memory Memory Network

System Overview (2/3) gcc gcc Scheduling by guest OS Linux Memory Processor 複数CPUの利用により高速化 Real execution virtual Virtual Machine Monitor physical Windows Windows Processor Processor Memory Memory Network

System Overview (3/3) Target CPU Relaxed ◎ Intel® Itanium®
○ Pentium 4, Intel® XeonTM , P6 △ Pentium® , Intel486TM Relaxed Memory Ordering Tight ntel® Itanium® Processor - Manuals

Implementation Issues
Virtualization E.g.) CPU privileged level, I/O, Device … Universal problem with VM implementation Shared memory emulation Special problem with multi-processor emulation

Shared Memory Emulation
仮想計算機からは，物理計算機のメモリが一つのメモリとしてみえる必要がある

… write x to p … read from p Processor Processor Memory virtual Virtual write Virtual read Virtual Machine Monitor physical Physical memory access Host OS Host OS Communication Processor Processor Memory Memory Network

Rest of This Talk 本当にどれくらい可能か，簡単に考察してみる IA-32 Memory Model
VM Memory Model Naïve Implementation Preliminary Experiments Summary

IA32 Memory Model

IA32 Memory Model Processor-ordered Memory Model
Relaxed memory ordering Instructions for synchronization E.g.) CMPXCHG, FENCE, … キャッシュについて言及する必要がある What is memory ordering model? special instructions for multi-processor memory management  sequential ordering

Processor-ordered Memory Model ~ Basics ~
Read Can be carried out speculatively Write Cannot be carried out speculatively

Processor-ordered Memory Model ~ Store-buffer (1/3) ~
Write can be buffered (i.e., delayed) Read may go ahead of buffered write Write Read Write Processor Processor まだ実行されない Read プロセッサはwriteの完了をまたずに先に進める Store buffer Shared Bus Memory

Store-buffer中のwriteは　ローカルプロセッサのみに反映される Write 1 to p Read from p Read from p Write 1 to p Processor Processor Read from p Read from p 1をread intra-processor snooping Store-bufferにはいっているwriteの結果は，ローカルプロセッサのみ読み込み可能リモートプロセッサにはwrite結果は反映されていない write結果が，ローカルプロセスのみ反映されて，リモートプロセスにはまだ反映されていないローカルプロセス内ではself-consistent can forward bufferd write value to read write x to p read x from p local bypassが起こるリモートプロセッサに対してはwriteのorderのみ保存して，反映される Store buffer 0をread Shared Bus アドレスpの初期値は0 Memory

Writeがstore-bufferからはきだされるのは Store-bufferが溢れたとき or 同期命令を実行するとき（後述） …　

Processor-ordered Memory Model ~ Inter-processor coherence ~
Writes by a single processor are observed in the same order by all processors ローカルプロセッサ内での順番が保存されるそれ以外の順番は，保存されない store-bufferingの説明との整合性が取れていないかもしれないプロセッサ間のwriteはWrites form the individual processors on the system bus are NOT ordered w.r.t. each other

Order of writes from individual processors
write to A write to B write to C write to A write to B write to C 異なるプロセッサ間では，orderが異なる可能性あり異なるプロセッサ間ではorderが異なる可能性あり Order of writes observed by processors processor #3 processor #4 write to A write to B write to C write to A write to B write to C 許可される例と，されない例がほしい同一プロセッサ内のwriteのorderは保存

Processor-ordered Memory Model ~ Unexpected Executions (1/2) ~
read-modify-write instruction Race-condition may occur Need atomic memory access read from p read from p Processor Processor read from p increment write to p read from p increment write to p 同じアドレスに対して同時に実行すると，問題 write to p write to p Shared Bus 0を読み込み 1を書き込み 0を読み込み 1を書き込み Memory アドレスpの初期値は0

Processor-ordered Memory Model ~ Unexpected Executions (2/2) ~
Dekker’s Algorithm Two processors enter the critical section simultaneously Need to serialize memory access flag1 = 1; if (flag2 == 0) critical section read from flag2 write to flag1 flag2 = 1; if (flag1 == 0) critical section write to flag2 read from flag1 Processor Processor E.g.) writeが全てのプロセッサに対して反映されてから，次の命令を実行したい Store buffer Shared Bus Memory flag1，flag2の初期値は0

Instructions for Synchronization
Bus Locking Atomic instructions E.g.) LOCK prefix, exchange instruction Serializing instructions E.g.) FENCE instructions

Bus Locking LOCK instruction prefix Exchange instruction
このprefixのついたread-modify-write命令は，atomicに実行される E.g.) LOCK; CMPXCHG Exchange instruction E.g.) XCHG命令 Exchange Register/Memory with Register ※カーネル内部の同期処理やpthreadなどが利用

Serializing Instructions (1/2)
E.g.) MFENCE命令プログラムの文面上でこの命令より前に位置するメモリアクセス命令が，全プロセッサに反映される store-bufferにたまったwriteが発行される最近のCPUだとmemory ordering がゆるいので，それを強めるために用いる Enforce ordering and cache coherency ensures data is written to memory before subsequent stores プログラム Write MFENCE Read Write結果を全プロセッサに反映してからReadを実行

Serializing Instructions (2/2)
実行例 Dekkerのアルゴリズムが正しく動作する flag1 = 1; MFENCE; if (flag2 == 0) critical section write to flag1 flag2 = 1; MFENCE; if (flag1 == 0) critical section Processor Processor write to flag1 Store buffer Shared Bus flag1の値は0 flag2の値は0 1 Memory

All writes are buffered until a serializing instruction is issued
Remark The following satisfies the specification All writes are buffered until a serializing instruction is issued Processor write … MFENCE Processor write write write Shared Bus Memory

VM Memory Model

Requirements for VM Memory Model
Must satisfy specification of IA-32 To run existing OS/applications without modification Can be emulated efficiently

Memory Model Provided by VM
Store-bufferを無限長とする全てのwrite命令は，serializing instructionが実行されるまでbuffering それ以外の点は実機と同じ 効率的にエミュレーション可能一回writeするたびに，通信したりする必要はない FENCE命令を実行時のみ，他のPCにstore結果が反映される Bus Lockingの動作はIA-32と同様atomicity

Naïve Implementation

Naïve Implementation 以上に述べたメモリモデルを提供するVM 通常の命令は，Native実行
共有メモリのエミュレーションに必要な最低限の命令のみ，エミュレーション実行

システムの構成 Preprocessor Virtual Machine 実行コードを（静的に）変換する
Native実行とエミュレーション実行からなる基本はNative実行 Native実行中にシグナルを受信すると，　　　　　　　　エミュレーション実行に移る

Preprocessor 同期命令の直前（直後）に不正な命令を挿入 ※今現在は，アセンブラを手動で編集
命令の実行直前（直後）にシグナルをとばすため ※今現在は，アセンブラを手動で編集 … mfence … .byte 0x8e, 0xc8 mfence 変換

Virtual Machine (1/2) プログラムからは，0~n番地までが共有メモリとしてみえる ptrace & mmapなどを利用
※ページテーブルや特権命令の仮想化などは未実装

Virtual Machine (2/2) 以下をエミュレーション Store-buffer
Instructions for Synchronization Bus locking Serializing instructions

Emulation of Store-buffer (1/3)
Serializing instruction実行時に以下を行う一つ前のserializing instructionから今までの間のwrite結果を，他のPCに反映メモリ状態のバックアップ Memory MFENCE … write Address Value 1 2 … Address Value 1 2 … Address Value 1 2 … mf1からmf2の間でmodifyされたページごとに， modifiyされる前の状態とmodifyされた後の状態を保存するこれを他のPCに反映させる Diff アドレス1の値1が変更

各ページのアクセス保護を変更することにより実現 mprotectを利用 ※オーソドックスなSoftware DSMの実装手法

3.アクセスされたページを保存 2.シグナルが発生・トラップ Backup MFENCE write to page 1 writeを禁止 Page番号 write 禁止 1 2 … Page番号 write 禁止 1 許可 2 … Page番号 write 許可 1 2 … Page番号 write 禁止 1 2 … 5. diffを作成・他のPCに送信 4.writeを許可ページ番号1の状態 6. writeを再禁止

Emulation of Bus Locking
LOCK; CMPXCHG PC1 PC2 1. lock獲得要求 2. 実行停止 3. ack 4. CMPXCHGを実行 5. lock開放要求 6. 実行再開

Emulation of Serializing Instruction
MFENCE PC1 PC2 1. diffの送信 2. diffをメモリに反映 3. ack

Preliminary Experiments

Preliminary Experiments
目的本当に性能が出るか評価する実験内容：以下を測定同期命令エミュレーションのオーバヘッド全実行命令中の同期命令の割合

Overhead of Instruction Emulation
Repeat MFENCE 10,000 times Time (sec.) 比率 Native 1 Emulation Without memory access 0.84 1325 With memory access 2.57 4052 comm.+ ptrace CPU: Pentium4 Xeon 2.4GHz (Dual processor) CPU: Pentium4 Xeon 2.4GHz (Dual processor) comm. + ptrace + mprotect

Ratio of Sync. Instruction
LOCK prefixつき命令の割合 file & console access pthread OS起動中 ls実行 Linux on Bochs (x86 emulator)

Discussion (1/2) Can achieve high-performance?
Heavily depend on OS/applications Depend on overhead of virtualization ※UML上でシステムコールを頻繁に呼ぶと，数千倍遅くなる My VM 4 Speed up 3 2 VMWare １ # of processors １ 2 3 4

Discussion (2/2) Can work correctly? Bus locking may be unfair
E.g.) 一つのプロセッサがロックし続ける Some applications may assume a memory model differing from IA-32 memory model?

Related Work VMWare ESX Server [Carl A. Waldspurger, OSDI’02]
Target physical machine: SMP Target virtual machine: uni-processor Carl A. Waldspurger

Summary Distributed Virtual Machine
Emulates virtual multiprocessor machine on physical loosely-coupled computers

Future Work Performance Improvement Virtualization
E.g.) Privileged level, inter-processor interrupt Utilization of idle PCs Migration of VM Dynamic addition/deletion of PCs Adaptation to heterogeneous environments Grand challenge?

References (1/4) ~ VMWare ~
Virtualization system including a virtual machine monitor for a computer with a segmented architecture Devine Scott W, Bugnion Edouard, 　　　　　　　　　　　　　　and Rosenblum Mendel US Patent , 1999 Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor Jeremy Sugerman, Ganesh Venkitachalam, 　　　　　　　and Beng-Hong Lim USENIX Annual Technical Conference, 2001

References (2/4) ~ Software DSM ~
TreadMarks: Shared Memory Computing on Networks of Workstations C. Amza, A.L. Cox, S.Dwrakadas, P.kelher, H. Lu, R. Rajamonoy W. yu, and W. Zwaenepoel IEEE Computer Vol.29, No.2, pp , 1996

References (3/4) ~ Processor Specification ~
IA-32 Intel® Architecture Software Developer's Manual A Formal Specification of Intel® Itanium® Processor Family Memory Ordering 2002

References (4/4) ~ Misc. ~ 機械語命令の静的な変換によるオペレーティング・システムのユーザ・プロセスとしての実行
榮樂英樹 and 新城靖 SPA 2003 Shared Memory Consistency Models: A Tutorial Adve, Sarita V, and Kourosh Gharachorloo IEEE Computer Vol.29, No.12, pp , 1996

Appendix Sequential memory ordering Uni-processorと同じordering
All memory access occurs one at a time Read returns the value of the “last” write to the same location

Processor-ordered Memory Model ~ Summary ~
Read can be performed in any order Writes are always performed in program order program order=プロ部グラムの Writes and following reads can be reordered ※抜けている点について考える

Bus Lockingのエミュレーション他のプロセッサの動作を止めてから実行を行う
fenceと同様に命令の直前に不正な命令を挿入することによってトラップ

STORE系命令のエミュレーション書き込みの起こったページを把握しておく mprotectを利用[3] twinの作成

FENCE命令のエミュレーション (2/2) 命令をトラップする
図例を入れる ※とりあえず今は手動で挿入以前のfence命令からのメモリ状態の差分を計算し，他の計算機にその情報を送信するプログラムカウンタを増やして次の命令の実行を再開一つ前のfenceから今までの書き込みを　　　　　リモートメモリに反映反映が完了するまで次の命令は実行しない終了したらtwinなどをクリアする

Order of writes from individual processors
write A.1 write B.1 write C.1 write A.2 write B.2 write C.2 Order of actual writes from all processors to memory write A.1 write B.1 write A.2 write B.2 write C.1 write C.2 許可される例と，されない例がほしい

Virtualizing a SMP on loosely-coupled computers

Similar presentations

Presentation on theme: "Virtualizing a SMP on loosely-coupled computers"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

Virtualizing a SMP on loosely-coupled computers

Similar presentations

Presentation on theme: "Virtualizing a SMP on loosely-coupled computers"— Presentation transcript:

Similar presentations

About project

フィードバック