TODO current landscape TODO merge in notes from 6.824 TODO beep TODO csfq core stateless fair queueing TODO mrd multi radio diversity TODO srm scalable reliable multicast: rand # from (C1+C2 r) d TODO igmp tree-building; slotting/damping; routers can prune its subtree out of the group TODO crossbar
- LPM
- sched: all are maximal matching
- wavefront
- PIM: rand
- iSLIP: round-robin
TODO wireless mac
- hidden terminal problem: A -> B <- C
- exposed terminal problem: A <- B C -> D
- approaches
- 802.11 CSMA: rand in [0, cw], ack
- reserv: MACA: RTS/CTS; MACAW: adds more features (DS, etc)
- TDMA: when high traffic
TODO i3, doa
TODO etx
- expected # transmissions is
$\frac{1}{d_f d_r}$ , where$d_f, d_r$ are probs - better than hop count
routing
- DSDV: seq no
- DSR: best; "plz tell me route to x"; store in mem; src-route
- OAODV: ad-hoc route discovery; uses DV
people
- Bob Metcalfe: inventor of Ethernet
- Van Jacobson: primary contributor to TCP/IP
basic performance facts
- latency
- typical backbone takes 60 ms from SF to NYC
- EDGE: 150 ms best, 500 ms real-world avg
- ref: an engineer's guide to bandwidth on ydn
general terms
- anycast: send to 1 of many; useful for HA/load-bal/geoloc; used in BGP, DNS
subnet notations
- /0 = mask 0.0.0.0 (any address)
- /24 = mask 255.255.255.0
- /32 = mask 255.255.255.255 (specific host)
- 0.0.0.0/0 or 123.123.123.123/0: any address
- 192.168.0.1/24: typical LAN
- 4.3.2.1/32: specific host
MACs
- time division multiple access (TDMA)
- frequency division multiple access (FDMA)
- carrier sense multiple access (CSMA): verify channel clear before xmit
- with collision detection (CSMA/CD): terminate xmission as soon as collision is detected, reducing prob of collision on retry
- with collision avoidance (CSMA/CA): eg RTS/CTS
dns
- 13 root NSs; maintained by ICANN
- either recursive or non-recursive queries
- protocol: stateless
- udp port 53, or tcp if response > 512B or if doing zone xfer
- sections: header, question, answer, authority, add'l
- question: the only section in queries
- name (no wildcards), type (incl. specials), class
- answers: matching RRs; if any CNAMEs, then target RRs should also be included
- auth: NS RRs pointing to more authoritative NSs; clients should cache this
- add'l: RRs that may be useful
- zone xfers: commonly done by secondary NS updating its DB
- glue records: specify the IP, not just the hostname, of another NS, to avoid circular deps (if under same domain as the one being resolved)
- caching: TTL 0 to 68 yrs; also have negative caching (non-existence)
- some apps (like browsers) maintain own DNS cache, to reduce resolves
- zones: subtrees in domain name tree; don't necessarily extend to leaves
- may delegate authority to another NS for a zone
- an NS may be authority for multiple zones
- other uses
- many hostnames -> 1 IP: virtual hosting
- 1 hostame -> many IPs: fault tolerance, load distribution, geolocation
- MX records: where MTAs should send email for this domain
- mailers try sending to the MX-indicated servers in the order that they're listed
- backup MXs do store-and-fwd (tmp intermediate hold)
- internet-scale cached storage, eg email blacklists, software updates (latest version checks), sender policy framework/domainkeys (use TXT records)
- reverse lookups: maintain a reverse lookup hierarchy under in-addr.com using
pointer records
- eg to find name of 1.2.3.4, lookup 4.3.2.1.in-addr.com (note topological addr reversal)
- populated along with forward entries
- impractical since introduction of CIDR; see RFC 2317 (uses CNAMEs)
- resource record (RR): the basic data elt
- contains: NAME (FQDN), TYPE, TTL, CLASS, type-specific data (RDLENGTH,
RDATA)
- NAME: may have wildcard
- CLASS: IN for internet, CH for chaos, HS for hesiod
- TYPE
- A (answer): translate name to IP
- NS: list FQDNs of (more) authoritative NSs for zones
- MX: specify the mail server
- CNAME: canonical name; specifies a FQDN; no other RRs allowed if this is present; hence root domains (blah.com) can't be CNAMEs since they must also have at least NS, SOA (problem for amazon ELB)
- PTR (pointer): like CNAMEs, but semantically means just a pointer; used for reverse lookups
- SOA (start of authority): start of DNS zone; the first record in a NS for that domain
- RRs of same type define a resource record set; this is what's returned
- RRS order undefined, but servers implement round-robin for load bal
- contains: NAME (FQDN), TYPE, TTL, CLASS, type-specific data (RDLENGTH,
RDATA)
- DNSSEC: works on complete RRSs in canonical order
- domain names have no char restrictions, but hostnames must be LDH (letters, digits, hyphen; ASCII subset)
- internationalized domain names in applications (IDNA): unicode in LDH
- based on punycode
- dynamic DNS: use
UPDATE
DNS op to add/remove RRs dynamically from a zone DB on an authoritative NS - 240+ country code TLDs (ccTLDs)
- registration
- domain name registrars: accredited by ICANN; they feed registries
- TLDs maintained by different orgs, operating a registry
- registries, registrars charge annual fees
- WHOIS DB held by many domain registries; provides contact info
- WHOIS: tcp port 43; query/response
- no std for determining authoritative WHOIS server for a domain
- commonly used for TLDs
- whois-servers.net provides DNS CNAMEs for TLD WHOIS servers of the form
<TLD>.whois-servers.net
- some TLDs publish a server referral (SRV record) for the WHOIS protocol
in their zone, which identifies their WHOIS server, in the format
_nicname._tcp.TLD
- whois-servers.net provides DNS CNAMEs for TLD WHOIS servers of the form
- COM and NET: thin registry used; domain registry holds basic WHOIS
- verisign's sitefinder directed unregistered domains to verisign site, breaking many non-web apps like email (which relies on non-existence to mark msgs undeliverable)
- general advice
- should not mix CNAMEs and MX, or mail servers will just use CNAME http://www.ferris.com/2008/09/08/why-you-shouldnt-mix-cname-and-mx/
SMTP
- cmds
- HELO: greeting
- EHLO: greeting + list server capabilities/extensions
- STARTTLS: upgrade to TLS
- MAIL FROM: set return addr
- RCPT: set dest addr
- DATA: actual hdrs + msg
- notes
- BCCs are just RCPT TOs that aren't in the data
- routing protocol deals w transient failures
- sender policy framework (SPF): use DNS to list IPs allowed to send from a
domain (a "senders' MX")
- thus, gmail (which can send from other domains) requires they be included, if this is set
- implemented by spamassassin, courier, exchange, patches for postfix/sendmail/exim/qmail
- may set Received-SPF header
- domainkeys identified mail (DKIM): pub key
- context
- used by gmail, many others
- DomainKeys is different; pushed by yahoo
- comparable to sender id, which is a MS technology
- each SMTP relay signs whole mail and specifies a selector name
- PKs stored in DNS TXT record in SELECTOR._domainkey.example.com
- may set DomainKey-Status header
- DKIM-Signature
h=
headers included in hash; ordered, so can't reorder headers, despite RFC2822c=
canonicalization (simple
is strict,relaxed
tolerates whitespace changes, header rewrapping, etc)- ``
- author domain signing practices (ADSP)
- optional extension to DKIM
- author address is From/Sender
- author domain signature: DKIM sig by domain that is author's domain
- practices: published in _adsp._domainkey.workping.com TXT
- unknown: same as not defining any record
- all: all mail is signed with author domain sig
- discardable: all, and drop any mail that isn't
- registered with Authentication-Results as
dkim-adsp
- none, pass, unknown, fail, discard, nxdomain, temperror, permerror
- context
- internet message format RFC5322
- comments are parenthesized strings
- security
- SASL: the usual authentication mechanism
- SMTP port 25: no TLS enforcement
- SMTPS port 465: unofficial & obsolete
- Submission port 587: require STARTTLS
- Authentication-Results RFC5451
- subsumes Received-SPF and DomainKey-Status
- always prepended
- Authentication-Results: AUTHSERV-ID; METHOD=RESULT PTYPE.PROPERTY=PVALUE ...; ...
- DKIM/DomainKeys
- results: none, pass, fail, policy, neutral, temperror, permerror
- DKIM properties: header.d, header.i
- DomainKeys properties: header.d, header.from, header.sender
- SPF/Sender-ID
- results: none, neutral, pass, policy, hardfail, softfail, temperror, permerror
- SPF properties: smtp.mailfrom, smtp.helo
- SMTP AUTH
- results: none, pass, fail, temperror, permerror
- properties: smtp.auth
- examples:
-
no auth done: Authentication-Results: example.org; none
-
could've been combined:
Authentication-Results: example.com; auth=pass (cram-md5) smtp.auth=[email protected]; spf=pass smtp.mailfrom=example.com Authentication-Results: example.com; sender-id=pass header.from=example.com Received: ...
-
diff MTAs
Authentication-Results: example.com; sender-id=hardfail header.from=example.com; dkim=pass (good signature) header.i=[email protected] Received: from mail-router.example.com (mail-router.example.com [192.0.2.1]) by auth-checker.example.com (8.11.6/8.11.6) with ESMTP id i7PK0sH7021929; Fri, Feb 15 2002 17:19:22 -0800 Authentication-Results: example.com; auth=pass (cram-md5) smtp.auth=[email protected]; spf=hardfail smtp.mailfrom=example.com Received: from dialup-1-2-3-4.example.net (dialup-1-2-3-4.example.net [192.0.2.200]) by mail-router.example.com (8.11.6/8.11.6) with ESMTP id g1G0r1kA003489; Fri, Feb 15 2002 17:19:07 -0800
-
SNMP
- agents running on managed devices comm w network mgmt system (NMS) on UDP
- packet types: get requests, set requests, get next requests, get bulk requests, responses, traps (asynchronous notifications), inform requests (trap acks)
zeroconf
- techniques for creating IP networks automatically
- displaces DHCP and DNS services
- link-local addr assignment: standard in IP; addr autoconfig; 169.254.0.0/16
- uses multicast DNS (mDNS) for name resolution, service discovery
(printers, shares)
- sent to bcast IP 224.0.0.251 on UDP port 5353
- apple, DNS-SD: TODO
- making IETF std
- look up SERVICEID.SERVICENAME.domain.com
- impls
- bonjour: apple's
- avahi: posix
CIDR TODO
ethernet
- segment: single physical layer
- bridge: connects segments into multi-hop network (LAN), forming a
broadcast domain
- maintain forwarding table mapping MAC addrs to ports
- for unknown MAC addrs, flood all outgoing ports; cache incoming packet addrs
- each host has 48-bit MAC address (MAC-48; also used in other PHYs)
- first 3 bytes: organizationally unique identifier (OUI)
- latter 3 bytes: NIC specific
- multiple bridges are arranged into spanning tree to avoid loops since
ethernet pkts have no TTL
- admins choose a single root bridge
- ARP (bcasting) is used to resolve IPs to MACs
tcp
- tcp reno: most widely used today
- tcp vegas: newer than reno/newreno
- interacts poorly with reno (gets unfair slice of total bandwidth, detecting more congestion than exists)
- inherits reno's problems dealing with wifi: packet loss isn't necessarily due to congestion, but may be due to channel error (interference, echoes, etc.)
- tcp westwood: measures bw st can maintain high rates when channel err causes pkt loss, but handles congestion-based losses like reno does
- http://blog.case.edu/bes7/2008/07/15/congestion_control_algorithms_for_a_wireless_router
- http://technet.microsoft.com/en-us/library/bb878127.aspx
more tcp
- nagle's: wait for acks, in order to batch sends
- oob: send urgent data out of stream; unused; unreliable; poorly supported
- delayed ACKs for batching; by default (on linux), wait up to 40ms
IP
- explicit congestion notification (ECN): RFC 3168; set bit in header instead
of dropping packet
- only meaningful with active queue mgmt (AQM), of which early impl's were RED
misc
- flow table
- uses: ACL in firewalls, NAT, QoS, ...
nat traversal
- nat types
- direct mapping btwn internal host:port and external host:port / no
dependency on remote host/port
- full cone: once internal addr (iaddr:iport) is mapped to external addr (eaddr:eport), pkts from iaddr:iport go through eaddr:eport, and pkts to eaddr:eport go to iaddr:iport
- (address) restricted cone: accept pkts on any port from a host if already sent pkts to that host on any port
- port-restricted cone: accept pkts on any port if from a host and port if already sent pkts there
- symmetric: sending packets creates distinct external ports for each distinct host:port; does depend on remote host:port, not just internal host:port; impossible to relay this info
- direct mapping btwn internal host:port and external host:port / no
dependency on remote host/port
- simple traversal of udp through nat (STUN): relays just the port info but doesn't work for symmetric NAT
- traversal using relay nat/traversal using relays around nat (TURN)
- traverses symmetric NAT but uses intermediate server for all data/bandwidth; written for SIP
- ICE: direct connection with another peer
- written for SIP; tries STUN first, then falls back on TURN
tun/tap
- virtual network devices backed by software; mainly for VPNs
- tun is IP (L3); tap is ethernet (L2); usu want tun, more efficient
- but tap will let remote hosts coexist on same subnet; no routing table changes
vpn
- L2TP: the main use of ipsec (L2TP:ipsec::HTTP:TLS)
- ipsec expects key distribution to have been solve (public keys to have been exchanged already); this is reasonable for vpn's
- vpn creates a new interface backed by the vpn software daemon (which does encryption); the routing table is updated to route through this interface
ipsec
- authenticated header (AH) vs. encapsulating security payload (ESP)
- authentication vs. authentication and/or encryption
- these are both protos above IP alongside TCP/UDP/ICMP/IGMP
- can't nat vs. can nat: AH includes src/dst IP address in HMAC vs. ESP doesn't authenticate src/dst IP
- encapsulation: precede vs. surround
- transport vs. tunneling
- encapsulation: packet body vs. whole IP packet (IP proto = IP(-in-IP))
- endpoints: hosts vs. gateways
- internet key exchange (IKE): establishes shared session secret; uses PKI or pre-shared key; builds upon the ISAKMP/Oakley protocol; basically d-h
- main mode vs. aggressive mode
- 6 packets back and forth to init vs. 3 but with slightly less security
- security parameters index (SPI) is an identifying number in each packet; think of like a randomly chosen port numbered: indexes into state about session details, called a security assocation (SA), stored in a SADB
- nat-t: encapsulates IPsec IP packets; runs on UDP port 4500
- read somewhere (perhaps in NIST doc), unverified: ipsec doesn't use pki/certs, but this was planned
- refs
- implementations
- freeswan: orig ipsec for linux; dead
- openswan: freeswan fork; targets more than linux
- strongswan: freeswan fork; focus on good support for certs, smartcards
- netkey: native linux ipsec; in linux 2.6+; see ipsec-tools aka kame-tools
- windows server
- refs: http://www.jacco2.dds.nl/networking/freeswan-l2tp.html
PPP
- login protocol for dial-up access
- some encryption: CHAP and PAP password encryption algos
RADIUS
- runs over UDP; used by VPN servers, AP, NAS, etc.
Wi-Fi
- 802.11-legacy
- 802.11a: 54Mbps, 5GHz band, OFDM (orthogonal frequency division multiplexing), 35m
- 802.11b: 11Mbps, 2.4GHz band, DSSS, 38m
- CSMA/CA (carrier sense multiple access/collision avoidance)
- 802.11g: 54Mbps, same 2.4GHz band as/back-compat w 802.11b, OFDM, 38m
- back-compat w 802.11b but presence of b lowers bandwidth to 11Mbps
- only 3 channels that don't overlap
- 10 beacons per second
- implementations
- ath5k is the newest atheros driver
Cellular
-
CDMA vs GSM: two main competing techs
-
general packet radio service (GPRS): packet data service for GSM; TODO
-
generic terms for
$n$ -th generation networks- 1G: all-analog
- 2G: digital, with small provision for data networking at slower-than-dial-up speeds
- 2.5G: stopgap of 100-200-Kbps data networking
- 3G: data and voice managed separately; 2+-Mbps downstream data
- 4G: all IP w low latency; targeting voice & video
- http://arstechnica.com/telecom/news/2010/03/faster-mobile-broadband-driven-by-congestion-not-speed.ars
-
CDMA: a 2G network; launched in 1995
- pioneered/still very controlled by qualcomm
- leader in USA until recent surge by competitor GSM
- used by Verizon, Sprint (eg Virgin Mobile), MetroPCS
- users register their phones; no SIM cards
-
CDMA2000: succeeds CDMA
- comprises 1xRTT (2.5G), 1xEVDO (3G), 1xEVDV (3G)
-
Enhanced Data-rates for GSM Evolution (EDGE): 2.5G replacement for GSM
- theoretical top speed of 200Kbps
-
Evolution Data-Only (EVDO): subset of CDMA2000
- peaks at 2.4Mbps; averages at 300-600Kbps
- used by Sprint, Verizon
-
Global System for Mobile Communications (GSM): 2G (9.6Kbps) network
- popular because it's international standard (eg europe, asia)
- used by ATT, T-Mobile
- has security vulns even when using KASUMI
- users use portable SIM cards; can switch phones wo hassle
-
High-Speed (Downlink) Packet Access (HSDPA/HSPA): enhancement for 3G UMTS networks
- used by ATT
-
Universal Mobile Telephone Service (UMTS): 3G network that GSM carriers use
- developed by GSM community; one of ITU's family of 3G systems
- peaks at 2Mbps; averages at 300-400Kbps
- W-CDMA
- used by ATT
-
WiMax: high-perf Wi-Fi 802.16e open standard; 12Mbps; OFDMA; lost to LTE
- once potential cornerstone of 4G networks
- once backed by sprint/clearwire/intel
-
long term evolution (LTE): 4G tech winner; 100/50Mbps; proprietary standard
- successor to HSPA; OFDMA
- http://images.intomobile.com/wp-content/uploads/2010/05/wimax-lte-4g-mobile-broadband-shootout.jpg
-
Advanced Wireless Service (AWS): 3G
- used by T-Mobile
-
cell tower area: ~10 sq mi
wireless
- WiFi
- 802.11a: 54 Mbps, 20MHz-wide channels
- 802.11b: 11 Mbps, 20MHz-wide channels
- 802.11g: 54 Mbps (20-30 actual), 20MHz-wide channels
- 802.11n: 150 Mbps (75-100 actual) per radio stream (MIMO)
- 2.4GHz or 5GHz, but 2.4GHz is crowded & can't take adv of future improvements
- 20MHz- or 40MHz-wide channels (2x bandwidth)
- 802.11ac: 1 Gbps @5GHz; due 2012
- up to 160MHz-wide channels
- multi user MIMO (MU-MIMO): xmit simultaneous streams to diff users on same channels
- WiGig: 60GHz + WiFi; 7 Gbps
- beyond 10 m, switch to 600-Mbps WiFi, which has 100-m range
DNS
- start of authority (SOA) record: info about a zone
- mname: FQDN of primary NS
- rname: responsible person's email with
@
replaced by.
- serial: an incrementing version number (uint32_t) for this record; usually an encoding of timestamp
- refresh: seconds until secondary NS tries to refresh; 1 hr
- retry: seconds beween secondary NS retries if refreshes failed; 15 min
- expire: seconds until secondary NS stops answering queries for this zone; long time
- minimum: seconds that records of this zone are valid; 1 day
- NS record
- A record: main record type
- MX record: mail exchange record
- CNAME record: alias
- ref: http://rscott.org/dns/soa.html
TCP
- segment: a packet (as in max seg size, MSS)
- push: send immediately
- Nagle's algorithm: delay packets sends for batching
- rfc793: TCP
- options: nop, mss
- rfc1323: TCP Extensions for High Performance
- introduces 2 tcp options
- wscale: scaled windows (larger than 2^16)
- timestamps: for estimating RTT
- rfc2018: TCP Selective Acknowledgment Options
- sacks: ack particular bytes; rexmit what's needed instead of entire tail
- rfc2581: TCP Congestion Control
- slow start, congestion avoidance, fast retransmit, fast recovery
- rfc2861: TCP Congestion Window Validation
- problem: after long periods of silence, window may not reflect current network congestion
- decay window cwnd after transition from a sufficiently long application-limited period
- http://www.tcpipguide.com/
- http://www.iana.org/assignments/tcp-parameters/
Congestion Control in Linux TCP
- TODO
network topologies
- clos network
- insight: number of crosspoints required can be much fewer than were the entire switching system implemented with one large crossbar switch
- reasonable approximate indication of total cost of switching system
- 3 stages: ingress, middle, egress; each stage has some crossbar switches
- very-large-scale integration (VLSI): integrated circuits by combining thousands of transistor-based circuits into a single chip
- fat tree (charles leiserson): upper links are fatter
- make nonblocking my making upper links fat enough to aggregate lower links
- this is basic def; "A Scalable, Commodity Data Center Network Architecture" uses hypertree version of this
Why can't I innovate in my wiring closet? (Nick McKeown's talk, 4/17)
- worthless talk
- current
- software approach: XORP (and Zebra; or Linux)
- flexible, easy
- limited perf
- point solution, limited port count
- hardware approach: NetFPGA (http://www.netfpga.org/)
- small "router", OSS hardware/software
- line-rate perf
- point solution, limited port count
- first two are aimed more at the backbone
- network researchers tend to focus more on the core
- OpenWRT
- embedded Linux for wireless APs and routers
- aimed more at our more local networks
- most of us interact at the edges
- software approach: XORP (and Zebra; or Linux)
- idea: change the substrate
- IP is the current narrow waist
- virtualization layer would allow us to introduce alternatives to IP
- one way to characterize the GENI approach
- "innovation from the top down"
- see: WUSTL SPP, VINI, ...
- proposed approach: OpenFlow, "innovation from the bottom up"
- complementary, not a replacement for that approach
- flow layer
- deployment
- work with switch and AP vendors to add OpenFlow to products
- no immediate motivation for them, besides better relations with researchers
- deploy on university campuses
- stand back and watch innovations
- work with switch and AP vendors to add OpenFlow to products
- desirable features
- isolation: regular production traffic untouched by experimental stuff
- virtualized and programmable: different flows processed in diff ways
- can't add more hardware to the closet, "stuff will catch on fire!"
- equipment we can trust in our wiring closet
- open development env for all researchers (e.g. Linux, Verilog, etc.)
- flexible defs of a flow
- the router talks to a controller PC (when it sees a new flow?)
- flow table entry, type 0
- rule: match on header fields
- action: fwd to port, encap and fwd to controller, drop, send to normal processing pipeline
- stats: # packets, bytes
- type 1 (future)
- more flexible header/matching
- add'l actions: rewrite headers, map to queue/class, encryption
- support multiple controllers: load-balancing and robustness
- status
- Stanford CS building will have this
GENI
Floodless in SEATTLE (princeton)
- status quo
- ethernet
- simple to manage/reconfig: MAC IDs are of flat namespace
- not scalable: relies on flooding/bcasting to learn network info
- fwding table grows linearly with network size
- flooding for basic operation
- flood all ports for unknown MAC addrs
- ARP (bcasting) is used to resolve IPs to MACs
- multiple bridges arranged into spanning trees; 1 path btwn any 2 nodes
- packets "routed" through tree don't necessarily follow shortest path
- also can't choose alt paths for scalability/reliability
- IP
- shortest-path routing; smaller routing tables based on prefixes/subnets
- config overhead due to hierarchical addressing
- must specify subnets on router interfaces
- DHCP servers must be consistent with router subnets
- limited mobile host support (migratable VMs)
- VLANs
- group hosts into single broadcast domain
- defined logically, not based on locations
- hosts can retain IP addrs while moving btwn bridges
- extending across multiple bridges required trunking (provisioning) at each bridge; deciding which bridges should be in a given VLAN must take into acct traffic & mobility patterns for efficient operation, hence done manually
- limited control-plane scalability: must maintain fwd tables on every host in every VLAN
- insufficient data-plane efficiency: single spanning tree in each VLAN
- ethernet
- SEATTLE
- switch topology is replicated to each switch using link-state protocol
- enables shortest path routing
- sensible: switch topoligy changes much less frequently than indiv hosts
- 1-hop network-level DHT: IP -> MAC -> physical loc
- modified ARP to query DHT
- routers cache DHT lookups; argues that reactive caching results in smaller tables than proactive distribution
- define "groups" (on DHT) instead of doing network-wide bcasts
- switch topology is replicated to each switch using link-state protocol
- http://everythingisdata.wordpress.com/2009/09/17/floodless-in-seattle/
A Scalable, Commodity Data Center Network Architecture (amin vahdat, ucsd, sigcomm08)
- use an instance of a clos network they call a "fat tree" [not sure why they call it a fat tree, since that already has another meaning]
- straightfwd application of clos network
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric (amin vahdat, ucsd, sigcomm09)
- problems: DCs have: no scalability, mgmt difficulty, inflexible comm, limited support for VM migration
- assumptions: modern DCs are "fat trees" or multi-rooted hierarchies
- core, agg, and edge switches
- location discovery protocol (LDP): used by switches to determine position in hierarchy and communicate this to the other switchesg
- propose pseudo MAC (PMAC) addrs [add layer of indirection]
- PMACs are hierarchical and specify locations; reduce table sizes
- translated back to MACs by switches; routing is done by MACs
- VMs keep stable MACs (and IPs)
- centralized fabric mgr to resolve ARP queries, and to simplify
multicast/fault tolerance
- switches fwd ARP reqs to FM
- usually can resolve bc PMAC-IP mappings are eagerly sent to FM as PMACs are assigned and as VMs are migrated (new PMAC)
- rarely, can't resolve, so bcast down fat tree and cache result
- fault tolerance simplified
- switches send keepalive (LDP) msg to neighbors every 10 ms
- if no keepalive for 50 ms, assume switch failed, contact FM
- FM updates state, informs affected switches
- switches recompute tables based on new topology
- implemented using OpenFlow
- vs VL2: added LDP for easier mgmt
- [Ethan also uses centralized routing implemented in OpenFlow]
- http://everythingisdata.wordpress.com/2009/09/19/portland-a-scalable-fault-tolerant-layer-2-data-center-network-fabric/
VL2 (msr, sigcomm09)
- focused on high bisectional bandwidth and good fairness among flows
- VL2 (virtual layer 2) is an overlay that virtualizes IP addrs
- app-level IP addrs (AAs) allow for flexibility and migration; single flat layer 2 addr space
- loc-specific IP addrs (LAs) are actual locations; standard IP (layer 3) addrs
- central directory translates AAs to LAs
- modify end-host OSs to be clients of directory; no need to change routers though
- OS module also intercepts ARP reqs for LA addrs and converts ARP bcasts into unicast lookups into a centralized directory service (like PortLand/Ethane)
- use clos network topology
- each top-of-rack switch connects to two agg switches
- each agg switch connects to every core switch (called "intermediate" switch)
- use valiant load balancing (VLB) and equal-cost multi-path (ECMP) routing
- valiant load balancing: work from '80
- dst-indep (eg random) traffic spreading across multiple intermediate nodes
- VL2 spreads flows rather than packets; avoids OOO delivery
- VLB+ECMP spreads traffic uniformly across network paths
- VLB uses random intermediate waypoint (switch) to route traffic btwn 2 endpoints (using IP-over-IP)
- ECMP selects amongst multiple equal-cost paths; provides load balancing among the many VLB paths
- [but does this apply to intra-rack flows too? seems silly.]
- valiant load balancing: work from '80
- all together
- first AA -> LA then encapsulate with an intermediate switch
- found high variance among flows in MS properties, so static BW alloc is not promising
- implemented entirely in existing protocols: packet encaps, ECMP, OSPF, etc
- ran on cluster of 3 racks, 80 servers, 10 switches
PortLand vs VL2
- both presented in same session
- both propose locator/ID split
- modify switch SW vs modify end host; both don't modify switch HW
- both leverage directory service for efficient routing w/o bcast
- PortLand is L2 only soln; VL2 is L2/L3
- http://sns.cs.princeton.edu/2009/10/new-datacenter-networks/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+PrincetonSNS+(Princeton+SNS)&utm_content=Google+Reader
BCube/DCell (MSRA)
- more complex networks than fat trees
- like hypercube networks; orig in thinking machine's CM-5 supercomputer
Monsoon TODO
your data center is a router: the case for reconfigurable optical circuit switched paths (rice, cmu, intel pa, hotnets09)
- use reconfigurable optically switched network for bulk xfers
- keep electrically switched network for small flows
flyways to de-congest data center networks (MSR, hotnets09)
- propose hybrid wired/wireless network; wifi is "flyway"
- use wifi to handle transient overages in wire
Secure Internet Routing (Sharon Goldberg, Princeton, MIT special seminar, 2008-02-26)
- work on both the Control Plane (routing protocol layer) and the Data Plane (normal traffic)
- SBGP: announce paths signed with public keys
- on the Control Plane, showed that Secure BGP does not dissuade rational
players under the "attraction" incentive model
- key idea (game theory-ish): a player's utility function depends not just on their willingness to send packets to a place, but also how much they want others to send packets through them (how much they want to receive packets; this is the "attraction")
- SBGP does dissuade rational players from lying under the previous "no attraction" incentive model
- how do players know how to manipulate people they're receiving from? they can learn about others' valuation functions out-of-band
- next-hop routing: you can lie in this by generating cycles, taking advantage
of players' refusal to route using cycles
- but SBGP prevents cycles
- combine SBGP and next-hop routing: then you can't lie!
- next-hop routing not practical itself, but useful when combined with filtering, but you can no longer prove anything (simple NHR is necessary)
- on the Data Plane, you can actually detect a lying router
- pair share secure key into a hash function
$h$ - maintain a sketch: an array of counters
$A$ - for each packet
$x$ ,$A[h(x)]++$ - compare sketches every once in a while; if drop or change, then mismatch
- pair share secure key into a hash function
- ongoing work: need to have keys for each pair of endpoint nodes in a network
- to localize, you need to have key pairs with every node in the system
DCell: Data Center Networking (reading group reading 2008-07-21)
- properties
- scalability
- fault-tolerance: better than trees
- high network capacity
- related work
- Fat Tree (by CEL, for supercomputers): "scalability of Fat Tree is not
comparable to that of DCell"
- Fat Tree: total bandwidth at every level of the tree is the same
- Fat Tree (by CEL, for supercomputers): "scalability of Fat Tree is not
comparable to that of DCell"
Compact Routing
Speeding up Networking (Van Jacobson, Linux.conf.au 2006)
- TODO
- http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2006/01/27#vj_channels
- http://evanjones.ca/researchpapers/network-channels.html
iplane nano (ucsd nsdi09)
- motivation
- eg p2p CDNs are geo dist; clients must pair with best replica
- eg bt, skype, etc
- internet perf neither const nor queriable; currently each app measures independently
- desired sol: predict perf and have shared infrastructure for all apps
- prior work
- network coords: limited to latency, but lightweight
- iplane: rich metrics, but 2GB atlas (large mem footprint)
- iplane nano: 7mb atlas
- overview
- traceroutes from vantage points
- plane: store all paths; leads to combinatorial explosion
- key idea: replace atlas of paths with atlas of links
- O(n^2) -> O(n)
- clients download atlas and query locally against prediction engine
- routing policy: link atlas throws away too much info
- iplane: 81% correct predicted AS path using 2GB atlas
- strawman link-based: 30%, 5MB; too big a hit
- iplane nano: 70%, 6.6MB
- so far just predicting routes; also predict properties (latency, loss rate,
bw)
- predict route, then compute prop based on props of links
- ongoing challenge: measuring link props
- evaluated how much it improves 3 apps (refer to paper for latter 2)
- p2p cdn: choose replica with best perf
- voip: choose detour node to bridge hosts behind nats
- detour routing for reliability: route around failures
- p2p cdn: 199 plab nodes, 10 random akamai nodes per client, 1MB downloaded
from 'best' replica
- best to worst: iplane nano, measured latency, vivaldi, OASIS, random
Analysis and Simulation of a Fair Queueing Algorithm
- traditionally, endpoints manage congestion control (eg TCP), while use simple
FCFS/tail drop policy
- problems
- malicious endpoints can do whatever they want
- heterogeneous CC techniques may interact in unpredictable/undesirable ways
- high latency for interactive low-BW conns sharing same network as high-BW conns
- problems
- FQ: bit-by-bit round robin, but 1 packet at a time
- each user has separate queue
- route packets in order defined by per-packet finishing times
- if packet arrives and full, drop last packet from largest queue
- give slight pref to packets arriving into empty queues
- even without this, FQ has much better promptness
- game theory: encourages good behavior since bad is punished (dropped)
- adds much more state to routers
random early detection gateways for congestion avoidance
- 2 schools: argue that router must play important role in CC
- random early detection (RED): take action as soon as approaching congestion is detected
- fair queuing (FQ): take action once queue buffers are full
- detection: calculate avg queue size using low-pass filter with EWMA
- min size and max size separate no/moderate/severe congestion states
- on congestion, mark packet to inform senders to reduce rate
- marking: either dropping packet or setting bit in header
- no congestion: no packets need to be marked
- severe congestion: all packets marked
- moderate: mark with probability that's scaling with queue size
- properties
- responds more promptly than "tail drop" policy, avoiding catastrophic congestion collapse
- fairness: high-BW senders more likely to have packets dropped than low-BW senders
- no bias against bursty traffic, unlike tail drop or random drop
- [this work led to ECN]
Congestion Control for High Bandwidth-Delay Product Networks
- XCP: CC from scratch
- TCP: AIMD for both efficiency and fairness
- XCP: separate; MIMD for efficiency, AIMD for fairness
- why explicit congestion signals
- clients can't distinguish losses due to errors vs congestion
- detection requires RTO timeout, duplicate acks, or similar
- packet loss is binary, but notification can carry more info
- comments
- ECN for IP has been around for a long time
- DECNet did something similar long before XCP
- hard to deploy
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication (CMU)
- TCP Incast problem: catastrophic drop in TCP goodput when:
- client issues req in parallel to multiple servers and waits for all resps
- servers respond at once, overflowing switch buffers on bottleneck link, causing pkt loss
- default lower bound on TCP RTO is typically 200ms, much more than typical
RTT on DC nets (<1ms)
- servers that suffered pkt loss must wait for a relatively long time before retransmitting dropped packets
- client is waiting for all to respond, so can't make progress/further utilize network
- causes
- commodity switches have small buffers
- RTO timers on all servers will retry at about same time; RTO grows exponentially
- proposes lowering the TCP retransmission timeout (RTO) to <= 1ms [and not delaying ACKs?]
- simulated next-gen network (10Gbps, 20gs RTT); showed problem still persists
at scale (>1024 servers)
- adding randomization to RTO avoids phenomenon, but might be unnecessary outside of sim
- fine-grained RTO safe to use in WANs? yes, little to no impact
- how does fine-grained RTO interact with unmodified TCP stacks that use
delayed ACKs?
- doesn't actually wait up to 40ms; no delays when seeing a retransmission
- interacts poorly, but not as bad as incast
- latency more important than ACK traffic in DCs
- http://everythingisdata.wordpress.com/2009/09/25/fine-grained-tcp-retransmissions/
Understanding TCP Incast Throughput Collapse in Datacenter Networks (yanpei chen, berkeley, 2009)
- differences from CMU paper
- while varying number of servers, CMU kept size of data received by client fixed, vs. size of data sent by each server fixed
- delayed acks hurt both workloads bc overdrives congestion window (widens it too much) and bc of congestion; no explanation why overdrive
- 200us RTO lower bound led to poor perf for both workloads bc network RTT
was 2ms, whereas CMU's was 100us
- should really be using jacobson RTO estimation: RTO = min(200us, mean RTT
- 4 * RTT linear deviation)
- should really be using jacobson RTO estimation: RTO = min(200us, mean RTT
- http://everythingisdata.wordpress.com/2009/09/25/understanding-tcp-incast-throughput-collapse/
exor TODO
- greedy algo: node closest to dest fwds packets
- src routing; src knows where all nodes are & sets up batch map ordering
MACAW: a media access protocol for wireless lan's
- media access protocol: link-layer protocol for allocating access to a shared medium (wireless channel) among multiple transmitters
- MACAW: media access protocol for mobile pads that communicate wirelessly with base stations
- each base station can hear transmission from pads within a certain range of its location, called a cell
- challenges
- mobility, but location not communicated explicitly; transmitter simply moves
- connectivity is not transitive: A-B comm, and B-C comm, but A-C don't
- must prevent B-C from disrupting A, even though A doesn't know C
- noise: asymmetric comm (A->B, but not B->A), interfering xmitters (A out of B's range, but A's xmissions might still disrupt B's xmissions)
- MACAW ignores asymmetry and interference; assumes simplified model
- xmission successfully received iff there's only one active xmitter in range of receiver
- no 2 base stations are within range of one another
- goals: high media utilization and fairness
- MACA: earlier protocol to improve on
- before xmitting, send "ready to send" (RTS) containing len of upcoming data
- if receiver hears RTS and is not deferring, reply with "clear to send" (CTS) containing len
- any station that hears a CTS defers xmissions for long enough to allow len reception
- any station that hears a RTS defers xmissions for long enough to allow CTS reception (not len reception, since collisions are at receiver)
- backoff: if no CTS, wait [1..BO] seconds, re-xmit RTS; BO = exponentially growing backoff counter, reduced to 1 on successful RTS-CTS
- MACAW
- less aggressive backoff: orig has large oscillations
- increase BO by 1.5 after timeout, decrease by 1 after success
- communicate BO btwn clients, to prevent client with lower BO starving other clients
- maintain BO per-destination, rather than single counter
- receivers should ACK bc min TCP RTO is long (.5s at the time)
- allow xmitters to avoid contention
- send DS after successful RTS-CTS, before data
- so if a pad hears RTS but not CTS, don't try to xmit during data xfer period
- say A,B in diff cells, want to receive rapidly, but close to ea other
[TODO confusing]
- B will be unlikely to send CTS bc it would need to hear RTS exactly right after A finishes receiving a data msg
- fix: if receiver hears RTS while deferring xmissions, after deferral period reply with "ready for RTS" (RRTS)
- sender then resends RTS
- solution is imperfect: if A is receiving data constantly, B won't hear RTS, so won't know to send RRTS
- send DS after successful RTS-CTS, before data
- less aggressive backoff: orig has large oscillations
- best-case perf of MACAW < MACA bc of add'l ACK/DS
NetMedic: Detailed Diagnose in Enterprise Networks (sigcomm 2009)
- http://everythingisdata.wordpress.com/2009/09/17/detailed-diagnose-in-enterprise-networks/
- pinpointing network problems
- model network as set of components (machines, processes, network paths, etc.)
- each component has set of vars defining current state
- and a set of unidirectional dependencies on other components
- auto construct dependency graph
- record state over time; infer correlations between states of 2 components
- eg when C1_x abnormal, C2_y abnormal also
- label edges with correlation strength, estimating likelihood that one
component influences the state of another component
- infer which component is causing the undesirable behavior/failure
Network Protocols and Infrastructures to Hide Latency for Cloud-Based Applications (MSR talk 2009)
- xbox live primetime (XBL) service needs reliable and low-latency comm btwn xbox and DC
- all comm is based on TCP
- 1-way msg latency (ms)
- RTT 200 avg 144 90% 203 99% 703
- RTT 300 avg 208 90% 305 99% 860 99.9% 1360 99.99% 2321
- forward error correction (FEC) widely used as E2E tech
- real-time comm: degradable; video, voip
- bulk data xfer: throughput; reliable multicast, digital fountains, network coding
- short xfer: not degradable, latency; balakrishnan '07 '08; pangolin addresses this
- propose pangolin protocol: FEC with rexmits
- 1 msg -> k packets -> n packets (FEC); ACK received pkts; RTO
- key for rest of talk: how to choose the rate?
- TCP vs pangolin: 300ms RTT, 2% bidir loss, n=5 k=4
- TCP, pangolin same up to 90%; TCP latency drops considerably
- pangolin: 90% 313 99% 313 99.9% 625
- TCP: 90% 313 99% 1734 99.9% 3656
VanGo
- linear chain of filters (nesc modules)
- configuration says where to partition
- built on SNACK (Sensor Network Application Construction Kit)
Lewis Girod, Sevan, Ryan
- using wavescope to do audio processing
- specifically, separate signal from noise, probably using clustering of spectra (using euclidian distance of their frequencies)