Changes between Initial Version and Version 1 of GENIOperationsTrial/GENINetworkStitching


Ignore:
Timestamp:
06/18/15 08:09:40 (9 years ago)
Author:
lnevers@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GENIOperationsTrial/GENINetworkStitching

    v1 v1  
     1[[PageOutline(1-2)]]
     2
     3= OPS-003-B GENI Network Stitching Procedure =
     4
     5This procedures describes how to investigate and resolve a GENI Network Stitching problem. GENI Network stitching issues may be reported by an experimenter or by the [http://genimon.uky.edu/ GENI Monitoring System] external checks.    Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. Ticket must copy the issue reporter and the GENI Experimenters at  geni-users@googlegroups.com.
     6
     7= 1. Issue Reported =
     8
     9GMOC gathers technical details for the ticket about the GENI Network Stitching failure including:
     10 - Requester Organization
     11 - Requester Name
     12 - Requester email
     13 - Requester GENI site-name
     14 - Slice Name, and sites sliver details 
     15 - Problem Description - MUST include:
     16    - Endpoint host sites
     17    - Stitcher error/failure messages.
     18    - If possible, collect the request RSPec
     19    - If possible, collect the stitcher output file (stitcher.log file) generated by the failed request.
     20
     21== 1.1 GENI Event Type Prioritization ==
     22
     23GENI Network Stitching event fall under two categories:
     24  - Stitching Computation Service (SCS) failure - A `Critical` issue, when the is not available all stitching will not work.[1]
     25  - Stitching to specific site failure - A `High priority issue that prevents one or more sites from setting up layer two GENI stitching connections.[1]
     26  - ExoGENI to ExoGENI only failure - A `High` priority issue that prevents EG to EG connection using ExoGENI stitching.  [2]
     27
     28 [1] This type of issue is dispatched to GMOC and uses this procedure.
     29 [2] This type of issue is immediately dispatched to the ExoGENI team.
     30
     31== 1.1 Create Ticket ==
     32
     33The GMOC ticketing system is used to capture information above. GMOC may follow up to request additional information as problem is investigated. This operation results in the problem reporter getting a ticket email for the issue reported.
     34
     35For GENI Network Stitching issues it is crucial to capture the two main clues to resolve the problem:
     36  - Does the SCS respond to the request?
     37  - What are the endpoint sites where the compute resources are to be allocated
     38  - What is the error/failure message returned by the stitcher `createsliver` command.
     39
     40The Stitching Computation Service (SCS) tracks the path information that is to be used in establishing Layer2 connection between the GENI Aggregates host endpoints.  When a `createsliver` is issued, the stitcher queries the SCS for path information for the end-point hosts in the experimenter's Request RSPec and then uses that information to create the slivers at the compute aggregates and the VLAN connections at the network aggregates. Stitcher is the GENI tool that executes the `createsliver` at each of the GENI aggregates.
     41
     42= 2. Investigate and Identify Response =
     43
     44GENI Network Stitching failures can be seen at either of these two stages:
     45
     46 a. When stitcher is contacting the SCS to get path information. If no SCS response, (i.e connection time out or SSL failure) then contact GMOC Responsible for service to restore service.   If response is given it is not possible to tell is correct until the sliver creation.
     47 b. When stitcher is creating a sliver at each sliver aggregates (including for path via network aggregates (i.e. AL2S, MAX, Utah-Stitch) any of the aggregates can fail. Depending on where the failure occurs, there may be different actions required.
     48
     49== 2.1 Investigate the Problem ==
     50
     51=== 2.1.1 SCS failure for path information ===
     52
     53If the following error code is returned immediately by stitcher and no aggregate is contacted to create a sliver:
     54{{{
     55StitchingServiceFailedError: Error from Stitching Service: code 3: MxTCE ComputeWorker return error message
     56' Action_ProcessRequestTopology_MP2P::Finish() Cannot find the set of paths for the RequestTopology. '.''
     57}}}
     58
     59Then there are 3 possible error sources:
     60
     61 a.1. One or more of the end-point sites are not known to the Stitching Computation Service. [[BR]]
     62 a.2. The wrong SCS service is being used.[[BR]]
     63 a.3. An incorrect or mis-matched request syntax is being made.[[BR]]
     64 a.4. A request for unreasonable bandwidth along the network path is being submitted. [[BR]]
     65
     66If these are the symptoms reported proceed to section 2.2.1.
     67
     68=== 2.1.2 Failure while creating a sliver ===
     69
     70Sliver creation may fail for requested GENI stitching resources, below is a list of `geni_code` error codes along with explanations that can be reported during a sliver creation:
     71
     72||'''Error Code '''              || ''' Error Text '''               || ''' Explanation and Potential Resolution'''  ||
     73
     74||'geni_code': 2, 'am_code': 2    ||$link_name: no edge hop' ||  This failure occurs when there is a mismatch in the interface associated with the compute aggregate. Either the RSPec has an incorrect interface, or the SCS path is mis-configured. ||
     75||'geni_code': 2', 'am_code': 24 ||Not enough VLANs available at $agg_nick_name ||  All VLAN delegated for stitching have been used. No work-around or solution. Wait for resources to be released by other experimenters ||
     76||'geni_code': 14', 'am_code': 14||Resource is busy; try again later ||Stitcher is trying to delete a sliver for which the createsliver operation is not yet completed. Happens frequently on InstaGENI racks and clears in a few minutes. In rare cases does not be clear up, if so contact the InstaGENI support team to release resources ||
     77||'geni_code': 24'||Requested VLAN not available      || Normal operation, stitcher requested a VLAN that is in use. Stitcher will keep requesting VLANs until a VLAN is found from the list of available VLANs at the site. ||
     78||'geni_code': 25'||Requested capacity for link not available || Modify `capacity` parameter in RSpec or specify command line option --defaultCapacity=capacity  default is in bits/sec ||
     79
     80If any of the above symptoms are reported proceed to section 2.2.2.
     81
     82== 2.2 Identify Potential Response ==
     83
     84=== 2.2.1  SCS failure for path information ===
     85
     86Here are the possible responses for SCS failures that are due to the path information. Below are each potential source of the error listed with the potential actions that can be taken to address the problem.
     87
     88 1. Endpoint host is part of an aggregate that is not known to the SCS. Verify that each of the end-points aggregates in the experimenter request are known to the SCS. This can be done with the tool `scs.py`, which is part of the GCF software package:
     89{{{
     90$ python ~/gcf/src/gcf/omnilib/stitch/scs.py --listaggregates --scs_url https://geni-scs.net.internet2.edu:8443/geni/xmlrpc --key /home/lnevers/.ssl/geni_cert_portal.pem --cert /home/lnevers/.ssl/geni_cert_portal.pem
     91}}}
     92The above will return the list of the compute and network aggregates that are known to the SCS. If an aggregate is part of the user request but is not part of the list  of known aggregates, then the `unknown path` error is returned.  If a site is not known to the SCS, it has most likely not been deemed as `stitching production sites` yet, check with the user to make sure they intended to use the site.
     93
     94 2. Verify that the experimenter is using the expected SCS server. When the experimenter does ask for a specific SCS,  the stitcher uses the I2 production SCS (`!https://geni-scs.net.internet2.edu:8443/geni/xmlrpc`) and it does not output the SCS URL in the `createsliver`command output.  If the experimenter is using anotherr SCS server, such as the Test SCS (`!https://nutshell.maxgigapop.net:8443/geni/xmlrpc), then they should be directed back to the Production I2 SCS.
     95 
     96 3. If the default Production SCS is used and each of the aggregates is known, then there may be something wrong with request RSpec. Verify that the aggregate URN in the RSpec matches the URN known to the SCS (scs.py output). 
     97
     98 4. Check the requested link capacity in the RSpec to make sure that it is unreasonable, the SCS does some capacity checks, but the error handling for unreasonable capacity requests returns the same same error as all other path issues. Make sure that the experimenter is not requesting more bandwidth that is available for the site. This feature is not officially supported, but is still part of the SCS code.
     99
     100 5. If none of the above apply, then contact the SCS developement <insert-email here>.
     101
     102
     103=== 2.2.2 Failure while creating a sliver ===
     104
     105The response for each `geni_code` is listed above in the section [/LuisaSandbox/GENIOperationsTrial/GENINetworkStitching#a2.1.2Failurewhilecreatingasliver above].
     106
     1071. The createsliver returns '' 'geni_code': 2, 'am_code': 2  $link_name: no edge hop' ''.  This failure is returned by IG aggregates when there is a mismatch between the interface associated and the compute node in the request. Most likely, the RSPec has an incorrect interfacesm, but there is also a low possibility that the SCS path is suggesting the use of an incorrect path for the site. First make sure the node and link resources URNs are a match in the request, if this is not correct, than the SCS must update it topology definition for the site, which requires the SCS server running update scripts.
     108
     1092. The createsliver returns '' 'geni_code': 2', 'am_code': 24 Not enough VLANs available at $agg_nick_name''.  All VLAN delegated for stitching have been used. No work-around or solution. Wait for resources to be released by other experimenters.  If this persists, the rack team can be consulted to determine who is using the VLANs, so that they can be asked about using the VLAN.
     110
     1113. The createsliver returns '' 'geni_code': 14', 'am_code': 14 Resource is busy; try again later''. Stitcher is trying to delete a sliver for which the createsliver operation is not yet completed. Happens frequently on InstaGENI racks and clears in a few minutes.  In rare cases does not be clear up, if so contact the InstaGENI rack team to release resources.
     112
     1134. The createsliver returns '' 'geni_code': 24' Requested VLAN not available''.  Normal operation, stitcher requested a VLAN that is in use. Stitcher will keep requesting VLANs until a VLAN is found from the list of available VLANs at the site. Tell experimenter to wait for stitcher retry, which should eventually succeed.
     114
     1155. The createsliver returns '' 'geni_code': 25' Requested capacity for link not available'' .  Modify `capacity` parameter in RSpec or specify command line option --defaultCapacity=capacity  (bits/sec) to a value that is possible for the site reporting the failure.
     116
     117Remember there are also potential failures that can occur for ExoGENI to ExoGENI site connections, for those ExoGENI connection failures contact the ExoGENI team.
     118
     119
     120= 3. GMOC Response =
     121
     122The GMOC implements the actions identified in this procedure and updates the ticket to capture actions taken. In some scenarios the GMOC may dispatch a problem to other organizations, following is a table of organizations that will provide support listed by area of responsibility:
     123
     124|| ''' Team '''        || ''' Area of !Responsibility/Tools''' ||
     125|| GMOC Support        || Stitching Computation Service ||
     126|| AL2S/OESS Team      || AL2S/OESS Network Aggregate issues||
     127|| GPO Dev Team        || GENI stitcher tool or GENI Portal ||
     128|| RENCI Dev Team      || ExoGENI Rack, ExoGENI Stitching ||
     129|| GENI Operations     || InstaGENI Racks ||
     130|| UKY Operations Team || GENI Monitoring System ||
     131|| Utah Dev Team       || !CloudLab, Emulab, Apt||
     132
     133== 3.2 Procedure Updates ==
     134
     135If instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.
     136
     137
     138= 4. Resolution =
     139
     140GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully.  There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.
     141
     142== 4.1 Document Resolution and Close Ticket ==
     143
     144GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.
     145
     146Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.
     147
     148For a scheduled event, the ticket may be closed or rescheduled when it cannot be completed in the scheduled time.
     149