Predictable Design of Embedded Systems using Networked Architectures Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Embedded Systems Rockanje, March 2006 Outline Trends and design problems Unpredictability Platforms Predictable design Proposed design flow Open issues Note: this lecture is not about a solved problem ASCI Winterschool 2006 Henk Corporaal (2) Outline Trends and design problems Embedded systems everywhere Design practice Design complexity Memory wall Unpredictability Platforms Predictable design Design flow Open issues ASCI Winterschool 2006 Henk Corporaal (3) Embedded systems everywhere Convergence of 3 Cs computers, communications and consumer electronics The computer enters the 3rd fase computing power - networking - intelligent processing The world is 1 network wherever, whenever, all information and communication available We get a smart environment ASCI Winterschool 2006 Henk Corporaal (4) Design practice: Informal system specification System Task people Task Task Paper spec Hardware vhdl people verilog C ASM Software people Integration ASCI Winterschool 2006 Henk Corporaal (5) Design practice Behavioral specification System Algorithm Structure description R/T Logic circuit Y-Chart (Gajski-Kuhn) Design Flow is path in Y chart Physical realization Till RT-level largely manual flow ASCI Winterschool 2006 Henk Corporaal (6) Design complexity problem complexity Process technology + 58% 103 102 HW gap HW design productivity +21 % SW gap 101 SW productivity + 8 % 4 ASCI Winterschool 2006 8 12 16 year Henk Corporaal (7) Hitting the memory wall Performance µProc: 55%/yea r 1000 10 Processor-Memory Performance Gap: (grows 50% / year) CPU 100 “Moore’s Law” DRAM: 7%/year DRAM 1 1980 1985 1990 1995 2000 2005 Time [Patterson] ASCI Winterschool 2006 Henk Corporaal (8) Outline Trends and design problems Unpredictability Platforms Predictable design Proposed design flow Open issues ASCI Winterschool 2006 Henk Corporaal (9) Unpredictability at all levels applications architectures DSM VLSI design Uncertainty increases at all levels ASCI Winterschool 2006 Henk Corporaal (10) Application: Two forms of unpredictability mem Txt Video In1 Video In2 NR NR HSRC HSRC gen VSRC VSRC mix 100Hz mem HSRC Peak Matrix VSRC mix mem resources Applications can be data dependent Applications may have different scenarios time ASCI Winterschool 2006 Henk Corporaal (11) In addition: dynamic changing set of applications Multi-standard modem operation Several applications have to be activated simultaneously Too many combinations for an analysis at design time (non deterministic events) [Philips EVP] SCH = SCH search SCH 100 SCH CPICH search Compute load 125 75 50 25 SCH Initial acquisition ASCI Winterschool 2006 SCH Inter-system handover SCH CPICH search SCH CPICH search RAKE chip-rate processing RAKE chip-rate processing RAKE sym-rate proc. RAKE sym-rate proc. WLAN acquisition UMTS connected UMTS connected/ WLAN acquisition SCH CPICH search WLAN receiver WLAN connected/ UMTS monitoring time Henk Corporaal (12) Architecture unpredictability ext. mem mem arb. Local schedulers: cpu $ OS task switching interrupts IP interconnect busses, bridges networks memory controllers IP … IP external memory e.g. RR, TDMA, FCFS, LRU, EDLF, FIFO, priority, … IP IP … IP IP IP … IP IP IP interconnect cache pollution IP interconnect IP interconnect cache strategy $ cpu IP … IP IP What is the global behavior (end-to-end), composed of interacting local solutions ? ASCI Winterschool 2006 Henk Corporaal (13) DSM VLSI Unpredictability Global wiring delay becomes dominant over gate delay (timing closure) Gate delay vs. wire delay 400 350 300 ps 250 wire delay (ps/mm) 200 gate delay (ps) 150 100 50 0 0.5 0.35 0.25 0.18 0.13 0.1 technology (micron) ASCI Winterschool 2006 Henk Corporaal (14) DSM VLSI Unpredictability Length of Isosynchronous zone as function of frequency Other DSM problems: Clock distribution, skew VDD and VSS voltage drop Signal integrity, cross-talk Variance in process parameters increases ASCI Winterschool 2006 Henk Corporaal (15) Unpredictability: Design Closure problems Design closure = a realization meets all requirements, including functionality, speed, power, area, yield, etc., without design iterations application mapping & scheduling architecture placement & routing Closure problem at all levels ASCI Winterschool 2006 FPGA realization VLSI realization Henk Corporaal (16) Computational Requirements → Unpredictability: Design Closure problems 1200% 1000% 800% 600% 400% Orders of Magnitude 200% 0% Time → Mapping with performance guarantees looks impossible !! ASCI Winterschool 2006 Henk Corporaal (17) Solution ingredients: Higher abstraction levels SW and HW IP reuse / PnP principle Standards Avoid large design iterations Design correct by synthesis Avoid worst case resource requirements How do we achieve all of this? ASCI Winterschool 2006 Henk Corporaal (18) Outline Trends and design problems Unpredictability Platforms Predictable design Design flow Open issues ASCI Winterschool 2006 Henk Corporaal (19) What is a platform? Definition: A platform is a generic, but domain specific information processing (sub-)system • Generic means that it is flexible, containing programmable component(s). • Platforms are meant to quickly realize your next system (in a certain domain). • Single chip? ASCI Winterschool 2006 Henk Corporaal (20) Platforms, why? - Reuse - Short Time-to-Market - High Quality • • • • • Flexible and Programmable Large software component Standardization Optimized for specific domain and you do not have to solve this design closure problem !! ASCI Winterschool 2006 Henk Corporaal (21) Platforms separate the design communities ! SDT system design technology PDT platform design technology Design technology Applications Platform Enabling technologies ASCI Winterschool 2006 Henk Corporaal (22) Platform examples: Digital camera Sanyo [Okada99] ASCI Winterschool 2006 Henk Corporaal (23) TI OMAP Up to 192Mbyte off-chip memory 192Kbyte shared SRAM 8Kb data cache (2-way, 512 lines of 16 bytes) Write buffer (17 elements) 16Kb (2-way) 16Kb (2-way) 8Kb mem (2x 4K) 64Kb dual port (8x 4K x 16b) 96Kb single port (12x 4k x 16b) 32Kb ROM ASCI Winterschool 2006 Henk Corporaal (24) SpaceCake (Philips research) Homogeneous: set of equal tiles Per tile e.g.: n * MIPS m * TriMedia Accelerators k * L2 Cache bank Shared memory Cache coherency Big interconnect switch switch L2 cache memory banks Inter Tile: Router Message passing Working on inter tile cache coherence ASCI Winterschool 2006 Single tile Henk Corporaal (25) IMAGINE Stream Processor (Stanford) IMAGINE = SIMD of VLIWs It is controlled by a host processor, which send it stream instructions (Load, store, receive, send, VLIW op, load microcode) ASCI Winterschool 2006 Henk Corporaal (26) Hybrid FPGAs: Xilinx Virtex 4-Pro GHz IO: Up to 16 serial transceivers PowerPCs Memory blocks & Multipliers PowerPC ReConfig. logic Reconfigurable logic blocks Courtesy of Xilinx (Virtex II Pro) ASCI Winterschool 2006 Henk Corporaal (27) Fundamental platform design decisions Homogeneous versus Heterogeneous ? Bus versus Network ? Shared memory versus Message passing ? QoS support, Guarantees built-in ? Generic versus Application specific ? What types of parallelism to support ? ILP, DLP, TLP Focus on Performance, Power or Cost ? Memory organisation ? HW or SW reconfigurable ? And further: OS support, Middleware ? Mapping support? ASCI Winterschool 2006 Henk Corporaal (28) Homogeneous or Heterogeneous Homogenous: replication effect memory dominated any way solve realization issues once and for all less flexible ASCI Winterschool 2006 Henk Corporaal (29) Homogeneous or Heterogeneous Heterogeneous more flexible better fit to application domain smaller increments no tile reuse ASCI Winterschool 2006 Henk Corporaal (30) Homogeneous or Heterogeneous Middle of the road approach Flexibile tiles Fixed tile structure at top level tile router ASCI Winterschool 2006 Henk Corporaal (31) Reconfiguration time HW or SW reconfigurable? reset FPGA Spatial mapping loopbuffer context Temporal mapping Subword parallelism 1 cycle fine ASCI Winterschool 2006 Data path granularity VLIW coarse Henk Corporaal (32) Outline Trends and design problems Unpredictability Platforms Predictable design Current practise Predictability Architecture consequences Design consequences Design flow Open issues ASCI Winterschool 2006 Henk Corporaal (33) How should we design ? Trajectory, from Idea to Realization Desicions based on models Abstract from implementation details (not all known yet) Relatively cheap to create, validate and simulate Idea Concepts Requirements Design Problem • Generate Ideas Design Time • Construct Models “Steers” • Evaluate Properties • Make Design Decisions Realization ASCI Winterschool 2006 Henk Corporaal (34) Current practice Mapping, easy, but........... Given reference C code for application e.g. MPEG-4 Motion Estimation platform: SUPERDUPER-LX50 Idea a=b*5+d; for (...) {.. } Task map application on architecture But … wait a moment me@work> CC –o2 mpeg4_me mpeg4_me.c Thank you for running SUPERDUPER-LX50 compiler. Your program uses 257321886 bytes memory, 78 Watt, 428798765291 clock cycles ASCI Winterschool 2006 Henk Corporaal (35) Current design process application mapping constraints OK ? yes Post analysis: check constraints after mapping no Simulation based Does it still work for other data ? Does it still work when other applications are active ? Too many iterations Easy to program, hard to tune Can this be improved ? e.g. Constraints = input ASCI Winterschool 2006 Henk Corporaal (36) Predictable design What is it? Being able to reason at a high level about a design (in terms of functional and non-functional properties) and Being able to realize this design without time consuming iterations in the design flow (design closure) How: Predictable architecture Making resources predictable Proper modeling of less predictable elements Predictable design flow Compositionality Composability Design time analysis Run time analysis ASCI Winterschool 2006 Henk Corporaal (37) Making architectures predictable Getting rid of all unpredictable elements Caches ? No problem, but WCET estimation may be big and unacceptable ! Software controlled locked cache lines non-cachable memory controlled replacement Shared memory Communication ASCI Winterschool 2006 Henk Corporaal (38) Making architectures predictable: NoC Philips AETHEREAL Router provides both guaranteed throughput (GT) and best effort (BE) services to communicate with IPs. Router Network Combination of GT and BE leads to efficient use of bandwidth and simple programming model. R IP ASCI Winterschool 2006 Network Interface R R R R R R R R Network Interface IP Network Interface IP Henk Corporaal (39) Making the NoC predictable: how to support GT traffic? Time wheel concept control injection traffic at network interface 8 7 2 6 3 5 ASCI Winterschool 2006 time 1 4 Henk Corporaal (40) Making the design flow predictable : Compositionality High level design a b y x z P(x,y) if [P(a,b),...] ! Low level design a b y x z P(x,y) if [P(a,b),...] ? ASCI Winterschool 2006 Henk Corporaal (41) Making the design flow predictable Design time Determine of upper bounds on time and resources pareto curves Scenario discovery: Freq separate your application in parts for which upper bounds not too far from worst case Sc1 Sc2 Sc3 Load ASCI Winterschool 2006 Henk Corporaal (42) What do we want ? Design time analysis Single application Reasoning about end-to-end timing constraints (for given resources and quality) = predictability Which local arbitration mechanisms are needed ? How to translate this to the global level ? Example: Given Comp. Resources Bandwidth Buffer size Throughput Pareto curve A5 A1 P1 A2 P2 A4 A3 P3 P4 1/Throughput (q1,c1) ASCI Winterschool 2006 Cost (resources) Henk Corporaal (43) Scenarios: MP3 ASCI Winterschool 2006 Henk Corporaal (44) What do we want ? Composability Multiple applications If app. 1 and app. 2 fit each individually, what can be said about the combination ? Concept of virtual platform A1 A2 Proc1 A3 ASCI Winterschool 2006 Proc2 A4 Henk Corporaal (45) Predictability: Composability Can we add Pareto points? application 1 application 2 Q Q (q1,c1) (q2,c2) Cost (resources) Cost (resources) + (q1+q2,c1+c2) ? ASCI Winterschool 2006 Henk Corporaal (46) Problem: Predictable Resource utilization? 50 A 50 50 50 B 50 50 Mapping & Scheduling P1 ASCI Winterschool 2006 P2 P3 Henk Corporaal (47) Problem – Predictable Resource utilization? 50 A 50 50 50 B 50 50 Add ordering dependences (edges) P1 A P2 B P3 t0 t1 t2 Only 50% processor utilization ! t3 Scheduling conflict! ASCI Winterschool 2006 Henk Corporaal (48) Where is the problem? Different throughput obtained for different order of actors Possibilities of overall graph increases exponentially with number of actors and individual graphs Very difficult to do a complete analysis to obtain an optimal order Hard to model and analyze different arbitration strategies realistically ASCI Winterschool 2006 Henk Corporaal (49) Problem – Too many possibilities! 3 A 3 3 3 B 1 5 3 5 C 1 ASCI Winterschool 2006 Henk Corporaal (50) So, what is Composability? The degree to which we can analyze the applications in isolation: Throughput, Latency, Resource utilization, Deadlock, Switching / reconfiguration overhead, etc. Design time analysis for complete system is too expensive and often infeasible Each job should be executed as if it had access to its own dedicated resources – Virtualization Consider applications separately and then reason about the behavior of overall system ASCI Winterschool 2006 Henk Corporaal (51) Providing a Bound for Resources Arbitration strategy plays an important role in determining resource requirement A naive strategy leads to over-estimation of resources Worst-case estimate is not always possible Need predictable arbitration mechanism More ‘realistic’ worst case bounds Handle dynamism in the system An overall quality versus resources Pareto curve needed ASCI Winterschool 2006 Henk Corporaal (52) Making the design flow predictable: Run-time aspects Scalable applications QoS management Application n Application n / Scenario m Local manager Local manager QoS protocol Global manager Platform ASCI Winterschool 2006 Henk Corporaal (53) Quality-1 → Match quality with resources Computational Requirements → ASCI Winterschool 2006 Henk Corporaal (54) Outline Trends and design problems Unpredictability Platforms Predictable design Design flow Open issues ASCI Winterschool 2006 Henk Corporaal (55) Design flow Idea C Requirements spec Models Spec Reactive Process Network POOSL/SystemC Kahn Process Network (YAPI) BDF SDF correct by synthesis Platform ASCI Winterschool 2006 Henk Corporaal (56) RPN (Reactive Process Networks): events and streaming Event_in • Processing of events •Finite State Machine • Controlling host-CPU (e.g. ARM) • RTOS; hard real-time • ‘classical’ SW complexity mode Stream_in ASCI Winterschool 2006 • Soft Real-time • Compute intensive • Special hardware Event_out status Stream_out Henk Corporaal (57) POOSL Modeling Language Mathematically defined semantics Allows formal analysis of model properties Can formally describe: concurrency synchronous communication timing (delay statements) functionality P1 P2 delay 1; ASCI Winterschool 2006 Henk Corporaal (58) POOSL: Phases of Model Execution State space State space State space Synchronous time passage Asynchronous actions execution model time ASCI Winterschool 2006 Henk Corporaal (59) From Model to Realization a S1 delay d1 S2 b S3 S5 c Possible execution (timed) traces: delay d2 S4 S6 (S1, t1), (S2, t1), (S3, t1+d1), (S5, t1+d1) (S1, t1), (S2, t1), (S4, t1+d2), (S6, t1+d2) a()(); (S1, t1), (S2, t1+wcet(a)), (S3, t1+d1), (S5, t1+d1+wcet(b)) (S1, t1), (S2, t1+wcet(a)), (S4, t1+wcet(a)+wcet(c)), (S6, t1+d2) ASCI Winterschool 2006 sel delay d1; b()(); or c()(); delay d2; les; Henk Corporaal (60) -Hypothesis: property preservation If the time-deviation between two timed execution traces is less than , then, if one trace satisfies a realtime property, that property, weakened upto , is preserved in the second one as well a d1 b Model time t1 t2 d1 - ε1 t’1 ASCI Winterschool 2006 ε1, ε2 < ε t’2 a b t’1 + ε1 t’2 + ε2 Physical time Henk Corporaal (61) Extending SDF SADF: Scenario Aware Data Flow Can deal with dynamism Still possible to reason about deadlock, resource utilization, latency and throughput Currently implemented in POOSL ASCI Winterschool 2006 Henk Corporaal (62) SADF example: MPEG-2 Decoder Pipelined MPEG-2 decoder for I and P frames d VLD and IDCT fire per macro-block VLD MC and RC fire per frame a 1 FD (frame detector) models control part of VLD that determines frame type b c c Image size = 176x144 1 I-frame 99 macro-blocks No motion vectors Px-frame x macro-blocks Motion vectors from VLD to MC Previous frame from RC to MC P0-frame (still video) Copy previous frame FD model based on occurrence probability of frame types Execution time distributions of kernels determined with profiling tool ASCI Winterschool 2006 d 1 IDCT d 1 1 1 MC 1 1 FD 1 1 1 1 e RC 1 3 Rate I P0 Px a 0 0 1 b 0 0 x c 99 1 x d 1 0 1 ex = {30, 40, 9950 ,60, 70, 0 80, 99} x Henk Corporaal (63) Results for MPEG-2 Decoder Time unit = 1 kCycle Process Throughput VLD 0.063 rel. error ≤ 0.036% IDCT 0.063 rel. error ≤ 0.036% MC 0.00106 rel. error ≤ 0.190% RC 0.00106 rel. error ≤ 0.191% Average Latency between Successive Firings Accuracy results based on confidence levels of 0.95 Process Max. Latency between Successive Firings Variance in Latency between Successive Firings VLD 710 15.99 rel. error ≤ 0.031% 75.38 rel. error ≤ 0.18% IDCT 698 15.99 rel. error ≤ 0.031% 56.45 rel. error ≤ 4.99% MC 3305 940.3 rel. error ≤ 0.017% 2.4·105 rel. error ≤ 3.46% RC 2216 940.3 rel. error ≤ 0.017% 1.5·105 rel. error ≤ 4.99% Channel Memory between Processes Maximum Occupancy VLD and IDCT 9 1.910 rel. error ≤ 0.064% 0.528 rel. error ≤ 1.99% IDCT and RC 154 60.19 rel. error ≤ 0.178% 671.8 rel. error ≤ 4.55% VLD and MC 133 34.73 rel. error ≤ 0.517% 698.4 rel. error ≤ 4.39% MC and RC 1 0.577 rel. error ≤ 0.561% 0.244 rel. error ≤ 3.27% ASCI Winterschool 2006 Time-Average Occupancy Time-Variance in Occupancy Henk Corporaal (64) Design flow Run-time Combine pareto points exploit pareto algebra QoS management / scalable application ASCI Winterschool 2006 Henk Corporaal (65) Mapping multiple jobs T0 T1 T2 Multiple jobs can be active simultaneously. When can a second job start ? Are the requested resources available ? If not, can the quality level be lowered ? If not, can other jobs go for a lower quality ? If yes, independent from other jobs ? How to give guarantees? resources 100% time reconfiguration ASCI Winterschool 2006 Henk Corporaal (66) Combining Pareto points Cost Application 1 80 Cost 100 Cycle Budget Cycle Budget + Cost ASCI Winterschool 2006 Application 2 •A new thread frame coming •20 cycle budgets available Application 3 Cycle Budget Henk Corporaal (67) Combining Pareto points Cost Application 1 80 Cost Application 2 100 Cycle Budget Cycle Budget Cost Application 3 feasible, but optimal? 20 ASCI Winterschool 2006 Cycle Budget Henk Corporaal (68) Combining Pareto points Cost Application 1 Application 2 Cost cost increase 1 80 80 100 Cycle Budget Cycle Budget Cost Application 3 cost decrease and 2 > 1 20 ASCI Winterschool 2006 40 a better solution Cycle Budget Henk Corporaal (69) Outline Trends and design problems Unpredictability Platforms Predictable design Design flow Open issues ASCI Winterschool 2006 Henk Corporaal (70) Open issues Gap between specification and architecture modeling High level modeling use of modeling pattern library Incorporate multiple pareto solutions into DSE Pareto Algebra Get synthesis correct for control applications including compute intensive tasks mapping to multi-processor Managing QoS Scenario detection, merging, prediction and exploitation Runtime resource manager optimizing overall quality Measuring overall quality ASCI Winterschool 2006 Henk Corporaal (71) Open issues (cont'd) Architecture modeling how to deal with local memory (scratch pad / cache) Modeling scheduling and arbitration make things composable ! Definition NAL (run-time services) Automatic partitioning e.g., SPRINT tool of IMEC is a good start (C to SystemC) VLSI tiling …. and many more ….. e.g. see: Ogras e.a.: Key research problems in NoC Design A holistic perspective CODES – ISSS 2005 ASCI Winterschool 2006 Henk Corporaal (72) ASCI Winterschool 2006 Henk Corporaal (73)