wiki:NetworkCore/TrafficLeaks

Version 1 (modified by chaos@bbn.com, 11 years ago) (diff)

--

This page is a view into service checks running at the GPO lab which are designed to detect traffic leaks and broadcast storms in the GENI OpenFlow core.

What we detect

The core consists of two VLANs, 3715 and 3716, which traverse the NLR and Internet2 backbones as shown in NetworkCore. Traffic flow for these VLANs is controlled by OpenFlow, and traffic should never flow freely between the two VLANs. However, it is possible for a misconfiguration in a switch, in a flowvisor, or in an opted-in experiment, to allow traffic to flow between the VLANs, typically by bridging the VLANs entirely.

A single uncontrolled cross-connect between the two VLANs may be harmful to experimenters, since it makes the topology differ from the topology around which they may have planned their experiments. If two (or more) cross-connects occur at the same time, broadcast floods can arise in the core.

At GPO lab, we try to detect both situations using the checks shown on this page: any leaking traffic between the VLANs means the topology should be investigated, while excessive broadcasts are a symptom that other operators and users in the core may also be seeing broadcast storms.

Detection of traffic leaks

Two IP subnets, 10.37.15.0/24 and 10.37.16.0/24, are reserved for monitoring of overall reachability between core hosts, of vlan 3715 and vlan 3716, using fixed hosts at several campuses, including GPO's. The detction mechanism is simple: an interface on vlan 3715 should never see IP traffic sourced from a 10.37.16.0/24 IP, and vice versa.

The graphs in this section show such "stray" packets, as seen by GPO's two connectivity test hosts. We believe that any non-zero values in these graphs represent some probable traffic leaks or other misconfiguration which should be investigated.

Metric Past Hour Past Week
argos 3715 (NLR CHIC)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php
argos 3716 (NLR CHIC)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php
iolkos 3715 (I2 NEWY)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php
iolkos 3716 (I2 NEWY)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php

Detection of broadcast storms

We look for broadcast storms by counting broadcast packets received on uplink interfaces of our lab switches, and comparing them to an expected rate. In general, we see a very low rate of broadcast traffic on these interfaces: fewer than 10 broadcasts/sec seems to be normal.

Metric Past Hour Past Week
poblano[gi0/1]
(dataplane: to NLR and I2 via NoX)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php
habanero[gi47]
(dataplane: to poblano)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php
jalapeno[gi0/48]
(control: to BBN external)
Nagios
http://monitor.gpolab.bbn.com/ganglia/graph.php http://monitor.gpolab.bbn.com/ganglia/graph.php

What we do when we see stray packets

In order to find a traffic leak, we need to know:

  1. Is the problem a misconfigured host, or a network which is forwarding packets it should not be forwarding?
  2. At what campus or network location is the problem originating?

For simplicity, this example assumes that core_vlan_3715_stray_packets is alerting: this means that interfaces on vlan 3715 are erroneously seeing traffic sourced from 10.37.16.0/24. If only vlan 3716 is alerting, switch the vlan and subnet in use for testing. (In the common case in which the vlans are fully bridged, both vlans will alert, so you can test using either one.)

Step 1: look at the stray traffic which is causing the alarm

The stray packet counters only indicate that there is a problem; they don't provide any source or location information. An easy first step is to replicate the tcpdump by hand, and actually look at the packets. On a monitoring host with an interface on the appropriate VLAN:

iface=(interface on vlan 3715, e.g. `eth1.3715`)
straynet=(the unexpected IP subnet, e.g. `10.37.16.0`)
sudo tcpdump -i $iface -n src net $straynet mask 255.255.255.0

This output can tell us whether the problem is more likely to be a misconfigured host or a bridge between the networks. Ping testing is done by a handful of IPs, including (as of July 2011) 10.37.X.3, 10.37.X.100, 10.37.X.102 (for each of X=15 and X=16). If the only detected traffic is arp requests from these hosts, e.g.:

16:48:03.920495 arp who-has 10.37.16.100 tell 10.37.16.3
16:48:03.920511 arp reply 10.37.16.100 is-at 00:26:b9:7e:6c:c8
16:50:02.335429 arp who-has 10.37.16.145 tell 10.37.16.102
16:50:02.785954 arp who-has 10.37.16.100 tell 10.37.16.102
16:50:02.785973 arp reply 10.37.16.100 is-at 00:26:b9:7e:6c:c8
16:50:04.599972 arp who-has 10.37.16.90 tell 10.37.16.102
16:50:05.693130 arp who-has 10.37.16.90 tell 10.37.16.102

then the networks are probably being bridged. If non-ARP traffic is being sent on the crossed subnet, then something else is going on. If all of the stray traffic is being sent by a small number of IPs, especially if these IPs are unknown to the DNS server at ns.gpolab.bbn.com (which contains PTR records for known IPs in shared 10.0.0.0/8 space), then the problem may be a misconfigured host interface rather than a misconfigured network.

Step 2: generate a consistent stream of test packets

The traffic which sets off the stray packet alert is typically intermittent (e.g. the test pings are run only once a minute, so leaking ping traffic can only be seen once a minute). Also, since many of the pingtest interfaces use VLAN subinterfacing, the MACs are not unique per-VLAN, so it can be confusing to try to locate a leak within the core by looking for pingtest interface MACs.

When debugging a problem, it is useful to start up a stream of broadcast packets which have a unique source MAC which should be found on only one VLAN. Find or create a host interface which is on only vlan 3716 (the network from which the strays are originating). From that host, do:

iface=(interface on vlan 3716, e.g. `eth1`)
sudo arping -I $iface 10.37.16.1

There should be no host at 10.37.16.1, so the arp packets will get no response, and the host will continue to send broadcast arp packets indefinitely.

Verify using the test host from step 1, that these new packets are erroneously visible on vlan 3715 of that test host. If so, you now have an example of a MAC address whose packets are being leaked from 3716 to 3715 at some point in the core.

Step 3: track down the source campus in the core

Now get help from GMOC to find the point at which the traffic is being leaked. A GMOC engineer can look at the mac address tables of the core switches and determine at which interface the packets enter the core on vlan 3715. That interface should correspond to the campus or regional at which the leak is occurring. This does not explain what the problem is, but it may narrow it down sufficiently that it should be possible to determine what changed or take a guess about the likely issue.