Skip to content

Latest commit

 

History

History
388 lines (358 loc) · 17.2 KB

cs209.md

File metadata and controls

388 lines (358 loc) · 17.2 KB

$$ \def\empty{\mathcal E} $$

CS206

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2015: Architecture on a chip

[TOC]

Computer organization

Inputs and outputs

  • IO interface
    • to processor through memory
    • keyboard rate : 0.01 KB/s
    • loudspeaker rate : 0.60 KB/s
    • display rate : 60 000 KB/s
  • IO adressing modes
    • memory mapped : section of memory reserved for IO devices
      • 00000000 to 7FFFFFFF : $2^{31}$ bytes of memory
      • 80000000 to FFFFFFFF : unuse
      • FFFFFF00 to FFFFFFFF : 256 IO ports
      • sw $a0, 123($a1) # $a1 = 0xFFFFFF00
    • IO mapped : special dedicated instructions (x86, pentium)
      • OUT <port>, <reg>
  • Analogic/Digital converter
    • start (START) : input, when active begins a new conversion
    • data valid (DV) : output, when active, D7-D0 are valid
    • data (D7-D0) : output, last conversion result
  • simple bus interface : 8-bit processor
    • address (A23—A0): output, address bus
    • data (D7—D0): input/output, data bus
    • address strobe (A/S): output, signals the presence of a valid address on the Address bus during a memory access cycle
    • read/write (R/W): output, signal the direction of the data flow
    • data acknowledge (/DTACK): input, must be activated at the end of a memory access, when the written data have been latched or the read data are ready
  • programmed IO : modern peripherals own register to send and receive data (and issue some commands and read the status)
  • peripherals attention
    • IO polling : processor keeps scanning all peripherals for attention (very expensive)
    • IO interrupts : peripherals ask for attention
      • need to know who : either pooling after IREQ or peripheral send id
      • priority issues
        • Daisy chain arbitration : highest priority the closest (can hide lower request) but slow and hard priorities
        • interrupt controller : manage IREQ with priority and history, allow nesting and abstract peripherals id
      • impact on current exception
      • example sequence
        • IREQ
        • IACK : processor ready to deal
        • send id
        • processor transfer control to appropriate exception handler
        • processor revert to interrupted task
  • direct memory access (DMA) : peripheral handle memory directly (R/W) for device such as network or disks
    • all IO go through DMA before going into system bus
    • example sequence :
      • processor tells DMA which device, where to read/write, and bytes size to transfer
      • DMA controller becomes bus master by controlling address and control busses
      • DMA controller sends an interrupt to the processor termination (or errors)
  • buses
    • processory <-> memory : extremely fast, short, no standardization needed
    • backplane (between bus adapter) : high-speed (disk), several centimeters, standard
    • IO : various speed, meters, standardization needed
  • communication
    • synchronous : clock signal, each cycle fixed by protocol, fast but short, need every device to operate at same speed
    • asynchronous : handshaking to syncrhonize, slower but more flexible
  • IO performance metrics : latency (delay between request/response), data throughput (KB/s), transaction throughput (request per unit time)

Exceptions

  • exceptions : control flow changes (not program explicit and need special conditions)
    • IO interrupt
    • arithmetic problems
    • memory protection iolation
    • unsupported/undefined instructions
    • hardware malfunctions
    • power failures
    • tracing instructions (breakpoints)
  • naming : exceptions = general name & interrupts = exceptions generated outside processor
  • 5 dimensions
    • synchronous ? (related to a given instruction)
    • user maskable ? (programmer diable interruption)
    • within or between instruction ? (prevents completion of instruction, e.g. synch)
    • resume or terminate ? (after)
    • user requested or coerced (forced) ?
  • exemples
    • overflow depend on instruction : add, addi raise but not addu, addui (or it can be manage like saturation)
    • undefined instruction : real do not need to implement all instructions (float addf, addd or conversion cvti2f, cvti2d can be emulated)
    • memory protection violation : not everyone should see everything
      • privilege : user mode and OS mode (Kernel, supervisor, executive, etc.)
      • not all instruction avaible
      • switch mode : syscall (for software exception handler) and rfe to returns
  • exception handler
    • convention : located at precise point in memory (MIPS 0x80000080) and very special register
    • type of exception
      • single : one handler that dispatch
      • vector of handler adress : jump to mem[Exception Vector Address + (4 x Exception Number)] + rejump after
      • vector of adress : jump to Exception Vector Address + (32 x Exception Number)
    • e.g. undefined instruction
      • read the cause
      • call a subroutine to emulate
      • faulty instruction address and users registers are pass by parameters
      • if case of any other execption, terminate with Fatal_Error
    • processor task
      • save processor status (PC, configuration, modes, etc.)
      • mask further interrupts (priority that are below)
      • modify privilege level (to highest: part of OS)
      • set processor in default state
      • save information on the reason
      • free up some register (copying them to shadow registers)
      • decide what is next instruction
      • (reverted automatically at end with rfe but further exception have to be unmasked with caution by the handler)
    • initially stack cannot be used (e.g. stackoverflow)
    • exception handler cannot be interrupted
    • system cannot allow no serving for long time (buffer limited)

CPU Interrupts support

  • CPU changes
    • on 32 signal (irq[31..0])
    • register EA : (exception address r29)
    • current instruction : never completed
    • registers
      • ctl0 : status (processor interrupt enable, bit 0 = PIE)
      • ctl1 : estatus (save PIE in case of exception during another, bit 0 = EPIE)
        • ctl3: ienable (indidual interrupt enable, bit 0-31 = IENABLE)
      • ctl4 : ipending (a 1 indicate IRQ is asserted, bit 0-31 = IPENDING)
    • instruction
      • load : rdctl rC, ctlN
      • write : wrctl ctlN, rA
      • eret : jump to ea and restores PIE and EPIE
    • controls FSM
      • must obey IRQ
      • must be Mealy : IPENDING dependence
      • need PC change
      • need one new states : TRAP after having saved PC+4 into EA
      • interrupts can awake processor
  • dispatcher
    • responsible for re-enabling the approprate interrupts before calling handler
				.org 0x80 ; place this table at 0x0000’0080
handlers_vector:
	isr_0: 	ret
				.align 5 ; align to a multiple of 25 = 0x20
	isr_1: ret
				.align 5 ; align to a multiple of 25 = 0x20
				
				.org 4 ; place this code at 0x0000’0004
interrupt_handler:
				addi sp, sp, -120 ; increment the stack pointer
				stw ra, 0 (sp) ; store ra
				stw at, 4 (sp) ; store temp assembler
				stw t0, 8 (sp) ; store *ALL* registers
				stw t1, 12 (sp)

				rdctl t0, ienable ; load the current mask in t0
				stw t0, 112 (sp) ; store the current mask
				stw ea, 116 (sp) ; store the exception address
				rdctl s0, ipending ; s0 = ipending vector
				addi s1, zero, 31 ; s1 = id of served irq

; look for the interrupt source
loop: 			blt s0, zero, goto_routine ; if s0(31)=1 goto_routine
				slli s0, s0, 1 ; shift the ipending vector
				addi s1, s1, -1 ; decrement the id
				bne s0, zero, loop
				br end ; we should not end up here

goto_routine:
				addi t1, zero, -2 ; t1 = 0xFFFF'FFFE
				sll t1, t1, s1 ; disable lower priorities
				and t0, t0, t1 ; t0 = current mask & new mask
				wrctl ienable, t0 ; set ienable with the new mask
; now that the mask is set we can safely enable interrupts
				addi t0, zero, 1
				wrctl status, t0
; compute routine address
				slli t0, s0, 5 ; t0 = 32 * N
				addi t0, t0, handlers_vector ; t0 += handlers address
				callr t0 ; call the routine
				
end:
; after interrupt routine, prepare to return from main handler
				wrctl status, zero ; disable interrupts (critical!)
				ldw t0, 112 (sp) ; restore mask
				wrctl ienable, t0
				ldw ea, 116 (sp) ; restore ea
				addi ea, ea, -4 ; correct ea
; restore ra + user registers from stack
				ldw ra, 0 (sp)
				ldw at, 4 (sp)
				ldw t0, 8 (sp)

				addi sp, sp, 120
				eret ; eret will reactivate interrupts

Increasing performance

Performance & basic pipelining

  • processor frequency
    • better INTEL pentium IV at 3.2GHz or AMD opteron at 2.2GHz ?
  • memory speed & cache efficiency
    • 8M of 4-set associative or 16M of direct mapped ?
    • large single-level cache or large slower L2 and small fast L1 cache ?
  • time : unix command
    • 0.79u : user CPU time (processor spent executing instructions of the program)
    • 0.17s : system CPU time (proccesor spent executing instruction on behalf of the program)
    • 1.20s : elapsed time (time to complete the job)
    • 80% : of elapsed time spent on the job (leftover : system IO, others jobs, etc.)
    • execution time = elasped time on an unloaded system
  • speed
    • CPI (cycles per instruction) = ( execution time / clock period ) / ( total instruction count )
    • IPC (instruction per cycle) = 1 / CPI (normally below unit, unless parallel instructions)
    • performance = 1 / execution time = 1 / ( instruction count * CPI ) = $f_{clock}$ * IPC / instruction count
    • instruction count : depends on compiler (need to be used effectively)
    • CPI : depends on cache performance
    • Amdahl's law
  • pipeline
    • critical path delay : limit the clock period
    • reduce it by adding intermediate registers : divide the clock period per rows
    • latency : time between computation starts and ends
      • original circuit : T
      • pipelined circuit : T / N * N = T
      • practical : $\lambda_{pipe}=N·max _{i=0..N-1}(T_i + T _{FF})= N·T _{CLK,pipe}=N / f _{pipe}&gt;\lambda _{orig}$
    • throughput : number of result available in the unit time
      • original circuit : 1 / T = f
      • pipelined circuit = 1 / ( T / N ) = N / T = N * f
      • practical : $\phi_{pipe}=1 / max _{i=0..N-1}(T_i+T _{FF}) = f _{pipe}$
    • ideally
      • $T_i\approx T_{CLK,comb} / N$ : but circuit split may not be even
      • $T_i\approx T_{CLK,pipe}$ : but flip-flops introduce a delay

Pipelining

  • simple 5-stage MIPS pipeline
    • two memory interface : instruction cache during F and data cache during M
    • nop : no operation
  • dependences : R=read, W=write, A=after
    • RAW : true data dependence
    • WAR, WAW : name dependence
  • data hazards : in case of true dependence
    • adding forwarding path : with MUX
      • E -> E
      • M -> E
      • W -> D : register file forwarding allow read/write on same cycle
        • during W registers are written on first half of cycle
        • during D register are read on second half of cycle
    • adding a nop instruction decided by D (stalling the pipeline)
    • limits usability of pipeline
  • structural hazards : compete for same resource (e.g. a stage, but cannot append as nop delays also previous ops otherwise stalls)
  • control hazards : fetch an instruction before we know which one (e.g. branch)
    • invalidate next instruction and stall the pipeline (nop) until checked (lost 2 cycles)
    • delay slots : modifiy architecture definition by executing them in any case (rare in real)
      • best is to put previous instruction with no dependence on branch in delays slots
    • branch prediction : guess and fetch next or branch destination
      • if correct, no cycle lost
      • if wrong, squash what has been done
      • dynamic predictors : learn from previous executions (e.g. loop) and correct up to 95-99%

Dynamic scheduling

  • dynamic scheduled processor : run program out of order (OOO)
    • continue fetching & decoding even if one cannot be executed
    • keep writebrack waiting if structural hazard
    • split the independent task
      • fetch & decode
      • execute
      • writeback
    • what each unit does each cycle is decided at execution time in hardware : large amout of logic & complex
  • hazards
    • structural (new) : require resource avaible
    • RAW : operands ready to start execution
    • WAR (often cannot occur) & WAW (new) : new data overwrite something still needed (solvable by avoiding the use of the same name, either compiler or processor)
  • reservation station
    • check whether operands are available (RAW)
    • check execution unit is free (structural hazard) then start execution
    • tag operation : all execution units tag operation and their result
    • unavailable operands : identified by name of reservation station in charge of originating instruction
    • implicit register renaming : remove WAR & WAW
    • writeback : either in order or out of order
    • exception : need for reordering at commit (otherwise it is not precise, not acceptable in real system)
  • commit unit : reorder buffer

Superscalar & VLIW

  • superscalar
    • fetching more instruction in a cycle
    • cache need to be non-blocking : cache controller
    • branch prediction : static (never-taken, always-taken-backwards, etc.) or dynamic (history)
    • control speculation : right prediction, good otherwise squash it
  • branch history table
  • VLIW (very long instruction word) processor : alternative idea to superscalar
    • scheduling complexity : of order square in the issue rate (strong limit to ILP instruction level parallelism)
    • statically scheduled WLIW : what each unit does each cycle is decided at compile time
    • compiler technology
    • code bloating : occupy memory space
    • binary incompatibility : no fully satisfactory solution (even if dynamic binary translation is emerging)
    • loop unrolling : can repeat code to optimize the space holes
    • information missing at compile time : forwarding may even hide the memory latency
  • code compression
  • summary
    • dynamically scheduled superscalar processor : Pentium, PowerPC, MIMPS
    • VLIW/EPIC processor : Itanium 2 (emerging alternative)
    • emerging ideas
      • simultaneous multithreading : ILP extracted from several threads at same time
      • dynamic binary translation : emulation by binary translator

IA-32 & IA-64

Multiprocessing

Cache coherence

  • Single/Multiple Instruction/Data Streams (programs)
    • SISD : uniprocessors
    • SIMD : single program run on multiple data sets at once (vector architectures)
    • MIMD : each processor executing its own program with its own datas
    • shared memory multiprocessor with uniform access : limited scalability (4-16 processors ?)
      • easier : leader on the market
      • synchronization complex
    • distributed memory multiprocessor : scalable but communication is complex
      • message passing
      • physically distributed but logically shared (through virtual memory system)
  • coherent memory system
    • preservation of program order
    • coherent view
    • write serialization : all processor see the writes in the same order
    • write-through vs write-back
    • invalidate vs update
  • snoopy : cache cohenrence protocols
    • bus provide serialization
    • each cache snoops all bus transaction : relevant transactions affect data invalidation, update and supply value
    • often have duplicate cache tags
    • simultaneous operation of independ controllers
    • 2-state protocol : hard on bandwidth (every write goes on bus)
    • fast but unnecessary traffic
  • MSI : modified/shared/invalid
    • must invalidate all other copies before entering modifed state
    • requires bus transaction (order and invalidate)
    • 3-state protocol
    • bus read (BusRd) : read without intent to modify, data come from memory or another cache
    • bus read exclusive (BusRdX) : read with intent to modify, invalidate all other caches copies
    • writeback (BusWB) : cache controller puts contents on bus and memory is updated
    • cache to cache transfer : another cache satisfies BusRd or BusRdX
    • single writer, multiple reader protocol
    • if not in another cache : read/write produces 2 bus transactions
  • MESI : modified/exclusive (clean unshared)/shared/invalid
    • require shared signal to detect if other caches have a copy of block
    • cache flush for cache-to-cache transfer
    • 4-state protocol
  • directory-based cache coherence
    • needed in distributed memory architecture
    • keep unique & central track of existing cache copies and state
    • latency issues
    • granularity issues
  • multilevel cache
    • can repliacte snooping at all levels
    • inclusion property
      • content of L(n-1) cache is always a subset of L(n) cache (bus transaction from L(n-1) is also relevant for L(n), snoop at L(n) is enough)
      • if cache line is marked as owned/modified in L(n-1), then it should also be marked in L(n)
      • not always naturally maintained : L2 evict that is also in L1 (violation)
      • essentially : propagte invalidations & evictions up the hierarchy to keep L1 informed

Memory consistency