Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of GENIOperationsTrial/DataPlaneDebugging

Timestamp:: 06/18/15 08:08:41 (9 years ago)
Author:: lnevers@bbn.com
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GENIOperationsTrial/DataPlaneDebugging

                       v1
+[[PageOutline(1-2)]]
+= OPS-003-A: GENI Dataplane Debugging =
+This procedures introduces a methodology for detecting and debugging dataplane issues in GENI.  A GENI dataplane failure may be reported by the [http://genimon.uky.edu/ GENI Monitoring System],  by [http://tamassos.gpolab.bbn.com/nagios3/ GPO Nagios],  or by a GENI experimenter.
+Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. Ticket must copy the issue reporter and the GENI Experimenters at geni-users@googlegroups.com.
+= 1. Issue Reported =
+GMOC gathers technical details for failures including:
+ - Requester Organization
+ - Requester Name
+ - Requester email
+ - Requester GENI site-name
+ - Slice Name, any site sliver details available
+ - Problem symptoms and impact
+GMOC should classifies a dataplane failure as `High` priority. This type of issue may be deemed `Critical`  if the person reporting the issue identifies it as `Critical`. For example if the issue impacts a demo, a training session, or a conference.
+== 1.1 GENI Event Type Prioritization ==
+GMOC should classifies a dataplane failure as `High` priority. This type of issue may be deemed `Critical`  if the person reporting the issue identifies it as `Critical`. For example if the issue impacts a demo, a training session, or a conference.
+== 1.2 Create Ticket ==
+The GMOC ticketing system is used to capture issue information. GMOC may follow up to request additional information as the problem is investigated. The ticket creation operation results in an email notification to the reporter.  Subsequent updates and interactions between GMOC and reporter will also generate notifications to the issue reporter.
+= 2. Investigate and Identify Response =
+== 2.1 Investigate the Problem ==
+=== 2.1.1 Nagios ===
+ * Nagios is currently used as an alerting system for monitoring data collected by the [http://genimon.uky.edu/ GENI Monitoring System]. To access the necessary alerts, click on "Services" from the [http://tamassos.gpolab.bbn.com/nagios3/ nagios interface]. In particular, each service name corresponds to an active stream of pings between two GENI endpoints.
+ * Grab the service name and use instructions from the GENI Monitoring System below to obtain detailed statistics.
+=== 2.1.2 GENI Monitoring System ===
+ * From the GENI Monitoring System dashboard, select "Checks" (which, for our data plane detection and debugging purposes, is equivalent to Nagios Services).
+ * From the list of "Checks" shown, select/search for the check of interest which you obtained from Nagios.
+ * Scroll to the "Information" section to obtain detailed statistics for your particular check. This information will serve as a guide in the "Debugging Methodology" section below.
+== 2.2 Identify Potential Response ==
+This section outlines debugging methodology to be used to determine the source and potential response for a dataplane problem:
+=== 2.2.1 Start ===
+ * Verify that both endpoints in the test are up and working as expected.
+ * Replicate the test setup as much as possible in another slice.  This can help identify if the problem is slice-specific, or if there is a general substrate problem.  You can skip this step if you know for sure that the problem is not slice specific.
+ * If you believe the problem is substrate specific:
+  * You will need to identify where the problem is, and which aggregates are affected.  Once you know this information, you will want contact any aggregate or network operators (this might include a rack team and a site operational contact) as well as the GMOC.  You will want the operators to help fix the problem, and you will want the GMOC to track the problem in a ticket and announce it to the operator and user community.  Continue reading for information on how to track down where the problem is.
+  * If the failing connectivity is an !OpenFlow connection, go [#OpenFlow here].
+  * If the failing connectivity is a non-!OpenFlow (stitched or static) connection, go [#Non-OpenFlow here]
+=== 2.2.2 !OpenFlow issue ===
+==== Control plane ====
+ * Use netstat at the controller to see if the number of connecting switches is as expected.
+ * If the connection is in place, take a packet capture on the controller's !OpenFlow control plane interface and see if !OpenFlow traffic from the switch is getting through and it matches your expectations.
+ * If the connection is not in place:
+  * Verify that your reservation still exists at each aggregate manager in the path.
+  * Ensure that all switches in the path are up.  Ops-monitoring and the aggregate manager advertisement may be able to help.  If you cannot find the info there, then you may need to ask site admins to verify this.
+  * Ensure that no switches are seeing consistently high CPU usage.  If they are:
+   * Sending a lot of traffic through a software table on the switch.
+   * Sending a lot of traffic through the control plane.
+  * Ensure all proxies in the path (e.g. slicers, etc) are up and connected to both the switch and your controller.  Ops-monitoring and the aggregate manager advertisement may be able to help.  If you cannot find the info there, then you may need to ask site admins to verify this.
+==== Data plane ====
+ * At this point, you can use a combination of the techniques below in conjunction with those mentioned in the non-!OpenFlow debugging section.
+ * The goal at this point is to isolate where the problem is occurring in the data plane, so you will want to use a combination searching from the middle (like a binary search) and educated guesses based on the information you have.
+  * You can query for flowstats as you generate data plane probes.  Trace through the flow stats and see if you can figure out where a switch is not being hit.
+  * If your controller is reactive, follow packet-ins and packet-outs as well using packet captures.
+  * Remember that in some rare cases, you will encounter error cases due to software or hardware issues in an !OpenFlow switch.  In those cases, you will likely need to use the above techniques.
+=== 2.2.3 Non-!OpenFlow issue ===
+ * Start by trying to spot the traffic from the middle and working your way towards the problem.  Get as much information as you can on your own using tools, or devices which you have access to.  Tools include:
+  * [http://routerproxy.grnoc.iu.edu/al2s/ AL2S Router Proxy] (more on using this too [#AL2SRouterProxy below])
+  * [http://genimon.uky.edu/login GENI operational monitoring]
+ * Isolate the breakage(s) to segments that you don't have visibility into.  From there, you will need to work with operators of aggregates and intermediary networks.  You will want to have them:
+  * Try to verify that all switches in the path of your slice are up.
+  * Trace MAC addresses through the path by checking the MAC learning tables in the switches.  Note that in some VLANs, MAC address learning has been disabled throughout the path, and they will need to temporarily enable MAC address learning on those VLANs.
+=== 2.2.4 Debugging Tools ===
+==== AL2S Router Proxy ====
+AL2S router proxy can show you information that a given switch knows.  The types of queries that you use will be vendor specific.  Below are some useful queries at the time of this writing, organized by vendor.
+==== Brocade ====
+Show flows on a device which match an ingress port of <portnum>
+{{{
+show openflow flows ethernet <portnum>
+}}}
+After running this command, you will want to search the output for the flows corresponding to the VLAN ID that is associated with the slice at that switch.  You should be able to look up flow statistics using this method.
+==== Juniper ====
+Find flow ID associated with a slice:
+{{{
+show openflow flows detail
+}}}
+After running this command, you will want to search the output for the flows corresponding to the VLAN ID that is associated with the slice at that switch.  When you find the flows you care about, save the corresponding Flow ID.
+Once you have the flow IDs you care about, you will want to run the following to look up statistics for those flows:
+{{{
+show openflow statistics flows <flow ID>
+}}}
+= 3. GMOC Response =
+The GMOC implements the actions outlined to identify the source of the issue and updates the ticket to capture actions taken.  In some scenarios the GMOC may dispatch a problem to other organizations, following is a table of organizations that will provide support listed by area of responsibility:
+|| ''' Team '''        || ''' Area of !Responsibility/Tools''' ||
+|| GPO Dev Team        || GENI Tools (gcf, omni, stitcher), GENI Portal, GENI Clearinghouse ||
+|| RENCI Dev Team      || ExoGENI Racks, ExoGENI Stitching ||
+|| GENI Operations     || InstaGENI Racks ||
+|| UKY Operations Team || GENI Monitoring System, Stitching Computation System ||
+|| Utah Dev Team       || Jack Tool, !CloudLab, Emulab, Apt||
+== 3.1 Implement Response ==
+The GMOC executes the steps outlined. The response implementation may take few iteration, as some attempt may not yield the expected results. GMOC may may have to go back and try further actions in case where new symptoms may occur, or where the procedure is found to be lacking.  For both cases, an update to the procedures may be required. Actions should be taken to get the procedures updated.
+== 3.2 Procedure Updates ==
+When instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.
+= 4. Resolution =
+GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem.
+== 4.1 Document Resolution and Close Ticket ==
+GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.
+Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.