GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.
Spring 2015: Architecture on a chip
[TOC]
-
IO interface
- to processor through memory
- keyboard rate : 0.01 KB/s
- loudspeaker rate : 0.60 KB/s
- display rate : 60 000 KB/s
-
IO adressing modes
- memory mapped : section of memory reserved for IO devices
- 00000000 to 7FFFFFFF :
$2^{31}$ bytes of memory - 80000000 to FFFFFFFF : unuse
- FFFFFF00 to FFFFFFFF : 256 IO ports
sw $a0, 123($a1) # $a1 = 0xFFFFFF00
- 00000000 to 7FFFFFFF :
- IO mapped : special dedicated instructions (x86, pentium)
OUT <port>, <reg>
- memory mapped : section of memory reserved for IO devices
-
Analogic/Digital converter
- start (START) : input, when active begins a new conversion
- data valid (DV) : output, when active, D7-D0 are valid
- data (D7-D0) : output, last conversion result
-
simple bus interface : 8-bit processor
- address (A23—A0): output, address bus
- data (D7—D0): input/output, data bus
- address strobe (A/S): output, signals the presence of a valid address on the Address bus during a memory access cycle
- read/write (R/W): output, signal the direction of the data flow
- data acknowledge (/DTACK): input, must be activated at the end of a memory access, when the written data have been latched or the read data are ready
- programmed IO : modern peripherals own register to send and receive data (and issue some commands and read the status)
-
peripherals attention
- IO polling : processor keeps scanning all peripherals for attention (very expensive)
-
IO interrupts : peripherals ask for attention
- need to know who : either pooling after IREQ or peripheral send id
- priority issues
- Daisy chain arbitration : highest priority the closest (can hide lower request) but slow and hard priorities
- interrupt controller : manage IREQ with priority and history, allow nesting and abstract peripherals id
- impact on current exception
- example sequence
- IREQ
- IACK : processor ready to deal
- send id
- processor transfer control to appropriate exception handler
- processor revert to interrupted task
-
direct memory access (DMA) : peripheral handle memory directly (R/W) for device such as network or disks
- all IO go through DMA before going into system bus
- example sequence :
- processor tells DMA which device, where to read/write, and bytes size to transfer
- DMA controller becomes bus master by controlling address and control busses
- DMA controller sends an interrupt to the processor termination (or errors)
-
buses
- processory <-> memory : extremely fast, short, no standardization needed
- backplane (between bus adapter) : high-speed (disk), several centimeters, standard
- IO : various speed, meters, standardization needed
-
communication
- synchronous : clock signal, each cycle fixed by protocol, fast but short, need every device to operate at same speed
- asynchronous : handshaking to syncrhonize, slower but more flexible
- IO performance metrics : latency (delay between request/response), data throughput (KB/s), transaction throughput (request per unit time)
- exceptions : control flow changes (not program explicit and need special conditions)
- IO interrupt
- arithmetic problems
- memory protection iolation
- unsupported/undefined instructions
- hardware malfunctions
- power failures
- tracing instructions (breakpoints)
- naming : exceptions = general name & interrupts = exceptions generated outside processor
- 5 dimensions
- exemples
- overflow depend on instruction : add, addi raise but not addu, addui (or it can be manage like saturation)
- undefined instruction : real do not need to implement all instructions (float addf, addd or conversion cvti2f, cvti2d can be emulated)
- memory protection violation : not everyone should see everything
- privilege : user mode and OS mode (Kernel, supervisor, executive, etc.)
- not all instruction avaible
- switch mode : syscall (for software exception handler) and rfe to returns
- exception handler
- convention : located at precise point in memory (MIPS 0x80000080) and very special register
- type of exception
- single : one handler that dispatch
- vector of handler adress : jump to mem[Exception Vector Address + (4 x Exception Number)] + rejump after
- vector of adress : jump to Exception Vector Address + (32 x Exception Number)
- e.g. undefined instruction
- read the cause
- call a subroutine to emulate
- faulty instruction address and users registers are pass by parameters
- if case of any other execption, terminate with Fatal_Error
- processor task
- save processor status (PC, configuration, modes, etc.)
- mask further interrupts (priority that are below)
- modify privilege level (to highest: part of OS)
- set processor in default state
- save information on the reason
- free up some register (copying them to shadow registers)
- decide what is next instruction
- (reverted automatically at end with rfe but further exception have to be unmasked with caution by the handler)
- initially stack cannot be used (e.g. stackoverflow)
- exception handler cannot be interrupted
- system cannot allow no serving for long time (buffer limited)
- CPU changes
- on 32 signal (irq[31..0])
- register EA : (exception address r29)
- current instruction : never completed
- registers
- ctl0 : status (processor interrupt enable, bit 0 = PIE)
- ctl1 : estatus (save PIE in case of exception during another, bit 0 = EPIE)
- ctl3: ienable (indidual interrupt enable, bit 0-31 = IENABLE)
- ctl4 : ipending (a 1 indicate IRQ is asserted, bit 0-31 = IPENDING)
- instruction
- load : rdctl rC, ctlN
- write : wrctl ctlN, rA
- eret : jump to ea and restores PIE and EPIE
- controls FSM
- dispatcher
- responsible for re-enabling the approprate interrupts before calling handler
.org 0x80 ; place this table at 0x0000’0080
handlers_vector:
isr_0: ret
.align 5 ; align to a multiple of 25 = 0x20
isr_1: ret
.align 5 ; align to a multiple of 25 = 0x20
.org 4 ; place this code at 0x0000’0004
interrupt_handler:
addi sp, sp, -120 ; increment the stack pointer
stw ra, 0 (sp) ; store ra
stw at, 4 (sp) ; store temp assembler
stw t0, 8 (sp) ; store *ALL* registers
stw t1, 12 (sp)
…
rdctl t0, ienable ; load the current mask in t0
stw t0, 112 (sp) ; store the current mask
stw ea, 116 (sp) ; store the exception address
rdctl s0, ipending ; s0 = ipending vector
addi s1, zero, 31 ; s1 = id of served irq
; look for the interrupt source
loop: blt s0, zero, goto_routine ; if s0(31)=1 goto_routine
slli s0, s0, 1 ; shift the ipending vector
addi s1, s1, -1 ; decrement the id
bne s0, zero, loop
br end ; we should not end up here
goto_routine:
addi t1, zero, -2 ; t1 = 0xFFFF'FFFE
sll t1, t1, s1 ; disable lower priorities
and t0, t0, t1 ; t0 = current mask & new mask
wrctl ienable, t0 ; set ienable with the new mask
; now that the mask is set we can safely enable interrupts
addi t0, zero, 1
wrctl status, t0
; compute routine address
slli t0, s0, 5 ; t0 = 32 * N
addi t0, t0, handlers_vector ; t0 += handlers address
callr t0 ; call the routine
end:
; after interrupt routine, prepare to return from main handler
wrctl status, zero ; disable interrupts (critical!)
ldw t0, 112 (sp) ; restore mask
wrctl ienable, t0
ldw ea, 116 (sp) ; restore ea
addi ea, ea, -4 ; correct ea
; restore ra + user registers from stack
ldw ra, 0 (sp)
ldw at, 4 (sp)
ldw t0, 8 (sp)
…
addi sp, sp, 120
eret ; eret will reactivate interrupts
- processor frequency
- better INTEL pentium IV at 3.2GHz or AMD opteron at 2.2GHz ?
- memory speed & cache efficiency
- 8M of 4-set associative or 16M of direct mapped ?
- large single-level cache or large slower L2 and small fast L1 cache ?
- time : unix command
- 0.79u : user CPU time (processor spent executing instructions of the program)
- 0.17s : system CPU time (proccesor spent executing instruction on behalf of the program)
- 1.20s : elapsed time (time to complete the job)
- 80% : of elapsed time spent on the job (leftover : system IO, others jobs, etc.)
- execution time = elasped time on an unloaded system
-
speed
- CPI (cycles per instruction) = ( execution time / clock period ) / ( total instruction count )
- IPC (instruction per cycle) = 1 / CPI (normally below unit, unless parallel instructions)
- performance = 1 / execution time = 1 / ( instruction count * CPI ) =
$f_{clock}$ * IPC / instruction count - instruction count : depends on compiler (need to be used effectively)
- CPI : depends on cache performance
- Amdahl's law
-
pipeline
- critical path delay : limit the clock period
- reduce it by adding intermediate registers : divide the clock period per rows
- latency : time between computation starts and ends
- original circuit : T
- pipelined circuit : T / N * N = T
- practical :
$\lambda_{pipe}=N·max _{i=0..N-1}(T_i + T _{FF})= N·T _{CLK,pipe}=N / f _{pipe}>\lambda _{orig}$
- throughput : number of result available in the unit time
- original circuit : 1 / T = f
- pipelined circuit = 1 / ( T / N ) = N / T = N * f
- practical :
$\phi_{pipe}=1 / max _{i=0..N-1}(T_i+T _{FF}) = f _{pipe}$
- ideally
-
$T_i\approx T_{CLK,comb} / N$ : but circuit split may not be even -
$T_i\approx T_{CLK,pipe}$ : but flip-flops introduce a delay
-
- simple 5-stage MIPS pipeline
- dependences : R=read, W=write, A=after
- RAW : true data dependence
- WAR, WAW : name dependence
- data hazards : in case of true dependence
- adding forwarding path : with MUX
- E -> E
- M -> E
- W -> D : register file forwarding allow read/write on same cycle
- during W registers are written on first half of cycle
- during D register are read on second half of cycle
- adding a nop instruction decided by D (stalling the pipeline)
- limits usability of pipeline
- adding forwarding path : with MUX
- structural hazards : compete for same resource (e.g. a stage, but cannot append as nop delays also previous ops otherwise stalls)
- control hazards : fetch an instruction before we know which one (e.g. branch)
- invalidate next instruction and stall the pipeline (nop) until checked (lost 2 cycles)
- delay slots : modifiy architecture definition by executing them in any case (rare in real)
- best is to put previous instruction with no dependence on branch in delays slots
- branch prediction : guess and fetch next or branch destination
- if correct, no cycle lost
- if wrong, squash what has been done
- dynamic predictors : learn from previous executions (e.g. loop) and correct up to 95-99%
- dynamic scheduled processor : run program out of order (OOO)
- hazards
- structural (new) : require resource avaible
- RAW : operands ready to start execution
- WAR (often cannot occur) & WAW (new) : new data overwrite something still needed (solvable by avoiding the use of the same name, either compiler or processor)
- reservation station
- check whether operands are available (RAW)
- check execution unit is free (structural hazard) then start execution
- tag operation : all execution units tag operation and their result
- unavailable operands : identified by name of reservation station in charge of originating instruction
- implicit register renaming : remove WAR & WAW
- writeback : either in order or out of order
- exception : need for reordering at commit (otherwise it is not precise, not acceptable in real system)
- commit unit : reorder buffer
- superscalar
- branch history table
- VLIW (very long instruction word) processor : alternative idea to superscalar
- scheduling complexity : of order square in the issue rate (strong limit to ILP instruction level parallelism)
- statically scheduled WLIW : what each unit does each cycle is decided at compile time
- compiler technology
- code bloating : occupy memory space
- binary incompatibility : no fully satisfactory solution (even if dynamic binary translation is emerging)
- loop unrolling : can repeat code to optimize the space holes
- information missing at compile time : forwarding may even hide the memory latency
- code compression
- summary
- Single/Multiple Instruction/Data Streams (programs)
- SISD : uniprocessors
- SIMD : single program run on multiple data sets at once (vector architectures)
- MIMD : each processor executing its own program with its own datas
- shared memory multiprocessor with uniform access : limited scalability (4-16 processors ?)
- easier : leader on the market
- synchronization complex
- distributed memory multiprocessor : scalable but communication is complex
- message passing
- physically distributed but logically shared (through virtual memory system)
- coherent memory system
- preservation of program order
- coherent view
- write serialization : all processor see the writes in the same order
- write-through vs write-back
- invalidate vs update
- snoopy : cache cohenrence protocols
- bus provide serialization
- each cache snoops all bus transaction : relevant transactions affect data invalidation, update and supply value
- often have duplicate cache tags
- simultaneous operation of independ controllers
- 2-state protocol : hard on bandwidth (every write goes on bus)
- fast but unnecessary traffic
- MSI : modified/shared/invalid
- must invalidate all other copies before entering modifed state
- requires bus transaction (order and invalidate)
- 3-state protocol
- bus read (BusRd) : read without intent to modify, data come from memory or another cache
- bus read exclusive (BusRdX) : read with intent to modify, invalidate all other caches copies
- writeback (BusWB) : cache controller puts contents on bus and memory is updated
- cache to cache transfer : another cache satisfies BusRd or BusRdX
- single writer, multiple reader protocol
- if not in another cache : read/write produces 2 bus transactions
- MESI : modified/exclusive (clean unshared)/shared/invalid
- require shared signal to detect if other caches have a copy of block
- cache flush for cache-to-cache transfer
- 4-state protocol
- directory-based cache coherence
- needed in distributed memory architecture
- keep unique & central track of existing cache copies and state
- latency issues
- granularity issues
- multilevel cache
- can repliacte snooping at all levels
- inclusion property
- content of L(n-1) cache is always a subset of L(n) cache (bus transaction from L(n-1) is also relevant for L(n), snoop at L(n) is enough)
- if cache line is marked as owned/modified in L(n-1), then it should also be marked in L(n)
- not always naturally maintained : L2 evict that is also in L1 (violation)
- essentially : propagte invalidations & evictions up the hierarchy to keep L1 informed