wiki:GENIOperationsTrial/GENINetworkStitching

Version 3 (modified by lnevers@bbn.com, 9 years ago) (diff)

--

OPS-003-B GENI Network Stitching Procedure

This procedures describes how to investigate and resolve a GENI Network Stitching problem. GENI Network stitching issues may be reported by an experimenter or by the GENI Monitoring System external checks. Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. Ticket must copy the issue reporter and the GENI Experimenters at geni-users@googlegroups.com.

1. Issue Reported

GMOC gathers technical details for the ticket about the GENI Network Stitching failure including:

  • Requester Organization
  • Requester Name
  • Requester email
  • Requester GENI site-name
  • Slice Name, and sites sliver details
  • Problem Description - MUST include:
    • Endpoint host sites
    • Stitcher error/failure messages.
    • If possible, collect the request RSPec
    • If possible, collect the stitcher output file (stitcher.log file) generated by the failed request.

1.1 GENI Event Type Prioritization

GENI Network Stitching event fall under two categories:

  • Stitching Computation Service (SCS) failure - A Critical issue, when the is not available all stitching will not work.[1]
  • Stitching to specific site failure - A `High priority issue that prevents one or more sites from setting up layer two GENI stitching connections.[1]
  • ExoGENI to ExoGENI only failure - A High priority issue that prevents EG to EG connection using ExoGENI stitching. [2]

[1] This type of issue is dispatched to GMOC and uses this procedure. [2] This type of issue is immediately dispatched to the ExoGENI team.

1.1 Create Ticket

The GMOC ticketing system is used to capture information above. GMOC may follow up to request additional information as problem is investigated. This operation results in the problem reporter getting a ticket email for the issue reported.

For GENI Network Stitching issues it is crucial to capture the two main clues to resolve the problem:

  • Does the SCS respond to the request?
  • What are the endpoint sites where the compute resources are to be allocated
  • What is the error/failure message returned by the stitcher createsliver command.

The Stitching Computation Service (SCS) tracks the path information that is to be used in establishing Layer2 connection between the GENI Aggregates host endpoints. When a createsliver is issued, the stitcher queries the SCS for path information for the end-point hosts in the experimenter's Request RSPec and then uses that information to create the slivers at the compute aggregates and the VLAN connections at the network aggregates. Stitcher is the GENI tool that executes the createsliver at each of the GENI aggregates.

2. Investigate and Identify Response

GENI Network Stitching failures can be seen at either of these two stages:

  1. When stitcher is contacting the SCS to get path information. If no SCS response, (i.e connection time out or SSL failure) then contact GMOC Responsible for service to restore service. If response is given it is not possible to tell is correct until the sliver creation.
  2. When stitcher is creating a sliver at each sliver aggregates (including for path via network aggregates (i.e. AL2S, MAX, Utah-Stitch) any of the aggregates can fail. Depending on where the failure occurs, there may be different actions required.

2.1 Investigate the Problem

2.1.1 SCS failure for path information

If the following error code is returned immediately by stitcher and no aggregate is contacted to create a sliver:

StitchingServiceFailedError: Error from Stitching Service: code 3: MxTCE ComputeWorker return error message 
'Action_ProcessRequestTopology_MP2P::Finish() Cannot find the set of paths for the RequestTopology. '.''

Then there are 3 possible error sources:

a.1. One or more of the end-point sites are not known to the Stitching Computation Service.
a.2. The wrong SCS service is being used.
a.3. An incorrect or mis-matched request syntax is being made.
a.4. A request for unreasonable bandwidth along the network path is being submitted.

If these are the symptoms reported proceed to section 2.2.1.

Please note, that there an additional error scenario, that is reported as a path error and it is for the unsupported feature of multi-point VLANs:

StitchingServiceFailedError: Error from Stitching Service: code 3: MxTCE ComputeWorker return error message 
'Action::Run caught Exception: TCEException [Action_BridgeTerminal_MPVB::Process Empty bridgeNodes list! ] '.

Only point to point VLANs are supported at this time.

2.1.2 Failure while creating a sliver

Sliver creation may fail for requested GENI stitching resources, below is a list of geni_code error codes along with explanations that can be reported during a sliver creation:

Error Code Error Text Explanation and Potential Resolution
'geni_code': 2, 'am_code': 2 $link_name: no edge hop' This failure occurs when there is a mismatch in the interface associated with the compute aggregate. Either the RSPec has an incorrect interface, or the SCS path is mis-configured.
'geni_code': 2', 'am_code': 24 Not enough VLANs available at $agg_nick_name All VLAN delegated for stitching have been used. No work-around or solution. Wait for resources to be released by other experimenters
'geni_code': 14', 'am_code': 14Resource is busy; try again later Stitcher is trying to delete a sliver for which the createsliver operation is not yet completed. Happens frequently on InstaGENI racks and clears in a few minutes. In rare cases does not be clear up, if so contact the InstaGENI support team to release resources
'geni_code': 24'Requested VLAN not available Normal operation, stitcher requested a VLAN that is in use. Stitcher will keep requesting VLANs until a VLAN is found from the list of available VLANs at the site.
'geni_code': 25'Requested capacity for link not available Modify capacity parameter in RSpec or specify command line option --defaultCapacity=capacity default is in bits/sec

If any of the above symptoms are reported proceed to section 2.2.2.

2.2 Identify Potential Response

2.2.1 SCS failure for path information

Here are the possible responses for SCS failures that are due to the path information. Below are each potential source of the error listed with the potential actions that can be taken to address the problem.

  1. Endpoint host is part of an aggregate that is not known to the SCS. Verify that each of the end-points aggregates in the experimenter request are known to the SCS. This can be done with the tool scs.py, which is part of the GCF software package:
    $ python ~/gcf/src/gcf/omnilib/stitch/scs.py --listaggregates --scs_url https://geni-scs.net.internet2.edu:8443/geni/xmlrpc --key /home/lnevers/.ssl/geni_cert_portal.pem --cert /home/lnevers/.ssl/geni_cert_portal.pem
    

The above will return the list of the compute and network aggregates that are known to the SCS. If an aggregate is part of the user request but is not part of the list of known aggregates, then the unknown path error is returned. If a site is not known to the SCS, it has most likely not been deemed as stitching production sites yet, check with the user to make sure they intended to use the site.

  1. Verify that the experimenter is using the expected SCS server. When the experimenter does ask for a specific SCS, the stitcher uses the I2 production SCS (!https://geni-scs.net.internet2.edu:8443/geni/xmlrpc) and it does not output the SCS URL in the createslivercommand output. If the experimenter is using anotherr SCS server, such as the Test SCS (`https://nutshell.maxgigapop.net:8443/geni/xmlrpc), then they should be directed back to the Production I2 SCS.

  1. If the default Production SCS is used and each of the aggregates is known, then there may be something wrong with request RSpec. Verify that the aggregate URN in the RSpec matches the URN known to the SCS (scs.py output).
  1. Check the requested link capacity in the RSpec to make sure that it is unreasonable, the SCS does some capacity checks, but the error handling for unreasonable capacity requests returns the same same error as all other path issues. Make sure that the experimenter is not requesting more bandwidth that is available for the site. This feature is not officially supported, but is still part of the SCS code.
  1. If none of the above apply, then contact the SCS developement <insert-email here>.

2.2.2 Failure while creating a sliver

The response for each geni_code is listed above in the section above.

  1. The createsliver returns 'geni_code': 2, 'am_code': 2 $link_name: no edge hop' . This failure is returned by IG aggregates when there is a mismatch between the interface associated and the compute node in the request. Most likely, the RSPec has an incorrect interfacesm, but there is also a low possibility that the SCS path is suggesting the use of an incorrect path for the site. First make sure the node and link resources URNs are a match in the request, if this is not correct, than the SCS must update it topology definition for the site, which requires the SCS server running update scripts.
  1. The createsliver returns 'geni_code': 2', 'am_code': 24 Not enough VLANs available at $agg_nick_name. All VLAN delegated for stitching have been used. No work-around or solution. Wait for resources to be released by other experimenters. If this persists, the rack team can be consulted to determine who is using the VLANs, so that they can be asked about using the VLAN.
  1. The createsliver returns 'geni_code': 14', 'am_code': 14 Resource is busy; try again later. Stitcher is trying to delete a sliver for which the createsliver operation is not yet completed. Happens frequently on InstaGENI racks and clears in a few minutes. In rare cases does not be clear up, if so contact the InstaGENI rack team to release resources.
  1. The createsliver returns 'geni_code': 24' Requested VLAN not available. Normal operation, stitcher requested a VLAN that is in use. Stitcher will keep requesting VLANs until a VLAN is found from the list of available VLANs at the site. Tell experimenter to wait for stitcher retry, which should eventually succeed.
  1. The createsliver returns 'geni_code': 25' Requested capacity for link not available . Modify capacity parameter in RSpec or specify command line option --defaultCapacity=capacity (bits/sec) to a value that is possible for the site reporting the failure.

Remember there are also potential failures that can occur for ExoGENI to ExoGENI site connections, for those ExoGENI connection failures contact the ExoGENI team.

3. GMOC Response

The GMOC implements the actions identified in this procedure and updates the ticket to capture actions taken. In some scenarios the GMOC may dispatch a problem to other organizations, following is a table of organizations that will provide support listed by area of responsibility:

Team Area of Responsibility/Tools
GMOC Support Stitching Computation Service
AL2S/OESS Team AL2S/OESS Network Aggregate issues
GPO Dev Team GENI stitcher tool or GENI Portal
RENCI Dev Team ExoGENI Rack, ExoGENI Stitching
GENI Operations InstaGENI Racks
UKY Operations Team GENI Monitoring System
Utah Dev Team CloudLab, Emulab, Apt

3.2 Procedure Updates

If instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.

4. Resolution

GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully. There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.

4.1 Document Resolution and Close Ticket

GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.

Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.

For a scheduled event, the ticket may be closed or rescheduled when it cannot be completed in the scheduled time.