wiki:OperationsProcedures/URNConsolidation

Version 7 (modified by hdempsey@bbn.com, 8 years ago) (diff)

minor format fix

GENI Internet2 Switch Consolidation Procedure

This page defines the steps required to update stitching to handle PoP device consolidation that is taking place in Internet2 AL2S. This consolidation effort will replace existing AL2S Brocade devices with Juniper devices, and will converge the two distinct devices that currently provide L2 and L3 services into a single converged Juniper device in locations where AL2S services exist. These steps outline the actions required at the GENI rack, AL2S AM, and at the SCS servers to incorporate URN changes (due to port changes) resulting from the consolidation.

The steps include examples based on details from previous switch consolidation and their effect on GENI stitching sites connected to this switch.

1. Generate Tickets and check for conflicts with upcoming GENI events

Create GMOC tickets

Open tickets with GMOC for the scheduled maintenance events listing all affected GENI resources as soon as we receive notice of the scheduled days (this comes in an email from Eric Boyd NLT 1 week before the outage (usually sooner). Open tickets and send emails as early as possible--it is easier to manage changes in the events with early notification than to try to recover 1 week before the event. Confirm that GMOC generates corresponding requests to Internet2 Engineering (GRNOC). GMOC tickets should notify operators and experimenters lists. Although the outage is only scheduled for 1 day on the Internet2 schedule, the GENI ticket may require a longer outage. The ticket should include a warning that the outage may be extended at the end of the day if there are issues with any updates (include this in your initial email to GMOC). Adam Williamson will coordinate efforts for GMOC, but initial requests should go to the usual gmoc@grnoc.iu.edu address.

Note that Internet2 schedules both an IP and an AL2S outage (usually on different days) for each PoP consolidation. The IP event has no related GENI URN work needed, and will simply result in the GENI resources being unreachable (because the entire device is disconnected). The GMOC should create tickets for both events, since they both have GENI impact, and the rack admins should see the tickets if they read their GENI operators email.

Note that GMOC should check the GENI calendar for any conflicting events that are scheduled to overlap, and send followup email if they find any. Rather than wait for this to happen,check the existing GENI tickets or calendar yourself for conflicts and notify any affected event coordinator via email directly.

Check the test SCS for affected sites and generate warning emails

The GMOC may not have records of GENI connections for some nodes that are only supported on the test SCS (e.g. CloudLab). If the scheduled maintenance will affect a test SCS node (rack, switch etc.), email the contact for that node directly and cc: gpo-infra, informing them of the scheduled outage, and asking them to be available to test connectivity after the update. Add the test SCS node owner contact to any status update emails you send during the outage. (Sometimes test SCS nodes are no longer in use, so the owner may indicate they can be retired instead of revised. Include retiring resources as part of the outage work.)

Changing the Schedule and Escalation

Internet2 won't change their schedule, but we can work with affected sites and event contacts to try to priortize the work to minimize the outage impact for priority sites if needed.

If the consolidation event goes longer than the scheduled outage ticket lasts, be sure to email updates to the GMOC and to anyone who was contacted via email (from the test SCS or event lists) as soon as you know an extension is needed. Update the same list if you need to extend more than once. If the event will continue to the next day, indicate when work will start again on the next day in your update. You should send updates to the GMOC ticket for any significant events or changes that happen during the maintenance (version updates, bad hardware etc.). Do not send these type of updates only to an ops email list, because that info won't get out to resource owners.

If there are any significant problems during the event, escalate to Heidi Dempsey (hdempsey@bbn.com) while you work on them (in addition to noting them in the ticket).

2. Identify Affected Stitching Endpoints

The GENI aggregate advertisement includes a stitching section which defines how VLANs are to be connected and which VLANs are associated with that stitching site. To determine the impact of a consolidation on stitching, you must start by collecting the AL2S aggregate advertisement and reviewing its stitching definitions using the omni tool:

   omni -a al2s listresources -o

Review the content of the stitching section in the output file rspec-al2s-internet2-edu.xml and see if there are any sites affected for the switch being consolidated.

For example there were several stitching endpoints for "sdn-sw.newy32aoa.net.internet2.edu" in the AL2S Advertisement:

 <stitch:node id="urn:publicid:IDN+al2s.internet2.edu+node+sdn-sw.newy32aoa.net.internet2.edu">
 <stitch:port id="urn:publicid:IDN+al2s.internet2.edu+stitchport+sdn-sw.newy32aoa.net.internet2.edu:eth1/1">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth1/1:iminds">
 <stitch:port id="urn:publicid:IDN+al2s.internet2.edu+stitchport+sdn-sw.newy32aoa.net.internet2.edu:eth5/2">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth5/2:gpo-og">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth5/2:gpo-eg">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth5/2:gpo-ig">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth5/2:host-gpolab">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth5/2:umass-eg">
 <stitch:port id="urn:publicid:IDN+al2s.internet2.edu+stitchport+sdn-sw.newy32aoa.net.internet2.edu:eth7/2">
 <stitch:link id="urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth7/2:nysernet-ig">

From the above list we will request the "stitch:link id" to be updated. The "stitch:port id" transitions are implicit. In this example, there are 6 stitching endpoints requiring updates (2 InstaGENI, 2 ExoGENI, 1 OpenGENI, 1 network aggregate(iMinds) and 1 fixed endpoint (host-gpolab).

In GENI Network Stitching, a fixed endpoint is a resource that is not a GENI aggregate but still supports stitching. Fixed endpoints are statically configured in the SCS servers to capture stitching information and are generally set up for specific demonstrations, or peering points. Fixed endpoints require no special SCS update or configuration, simply update the AL2S Advertisement and the fixed endpoint change will take effect.

3. Define Stitching Configuration Changes

Review Internet2 announced changes for switch names and ports. Based on the information, identify the changes to be made to stitching definitions for the stitching endpoints identified in the previous step.

For example, using details from the consolidation email from Internet2 for the New York Switch:

 Old Hostname: sdn-sw.newy32aoa.net.internet2.edu
 New Hostname: rtsw.newy32aoa.net.internet2.edu
        'Old Interface'                       'New Interface'
 100GigabitEthernet1/1   100GE                   et-3/1/0.0
 100GigabitEthernet1/2   100GE                   et-3/3/0.0
 100GigabitEthernet3/1   100GE                   et-7/1/0.0
 100GigabitEthernet5/2   100GE                   et-7/3/0.0 
 100GigabitEthernet7/1   100GE                   et-4/1/0.0
 100GigabitEthernet7/2   100GE                   et-4/3/0.0
 10GigabitEthernet15/1   10GE                    xe-3/0/0.0
 10GigabitEthernet15/4   10GE                    xe-3/0/1.0
 10GigabitEthernet15/5   10GE                    xe-3/0/2.0
 10GigabitEthernet15/7   10GE                    xe-3/0/3.0

From the check of the AL2S stitching Advertisement, we know that there are seven stitching sites impacted by this URN transition. Define a list of each of the expected changes. The table below highlights each of the transitions:

Old URN New URN
sdn-sw.newy32aoa.net.internet2.edu:eth1/1:iminds rtsw.newy32aoa.net.internet2.edu:et-3/1/0.0:iminds
sdn-sw.newy32aoa.net.internet2.edu:eth5/2:gpo-og rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-og
sdn-sw.newy32aoa.net.internet2.edu:eth5/2:gpo-eg rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-eg
sdn-sw.newy32aoa.net.internet2.edu:eth5/2:gpo-ig rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-ig
sdn-sw.newy32aoa.net.internet2.edu:eth5/2:host-gpolab rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:host-gpolab
sdn-sw.newy32aoa.net.internet2.edu:eth5/2:umass-eg rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:umass-eg
sdn-sw.newy32aoa.net.internet2.edu:eth7/2:nysernet-ig rtsw.newy32aoa.net.internet2.edu:et-4/3/0.0:nysernet-ig

Note that Internet2 may change the port assignments or port names in the course of their work with the hardware, which happens before the GENI scheduled maintenance begins. Internet2 engineering ops team must notify the respective GENI operations teams as soon as possible via the GMOC ticket when such a change occurs. GENI operations teams will then update the various configurations to reflect this new change. The GMOC is responsible for coordinating with Internet2's engineering ops team for changes such as this.

4. Request Stitching Changes from GENI Aggregates Operations Teams

URN transition requires co-ordination with various teams. Get positive confirmation in email before the scheduled outage that at least one person from any affected ops team will be available at the time the scheduled outage begins through the end of the scheduled outage. Remember to get confirmations from the contacts for any resources affected in the test SCS as well, because they may not be part of any of the usual ops teams. Request confirmations NLT 1 week before the scheduled outage, so there is time to get alternate coverage or extend the scheduled outage window if needed.. Following are the teams/contributors that handle the transition based on the type of aggregate:

Note: All aggregates' advertisements must be updated before the SCS servers. The SCS discovers the new stitching path information from the aggregates stitching advertisements. SCS is statically configured for fixed endpoints.

4a. Change Request Details

Based on the existing Stitching information and the announced changes, generate a list of new link IDs to be used at each site. Following is an example from the New York transition, where GPO IG and NYSERNet URNs changes were requested from InstaGENI Team:

Link ID:          urnpublicid:IDN+instageni.gpolab.bbn.com+interface+procurve2:5.24.al2s
Remote Link ID:   urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-ig
VLAN Range:       3596-3600,3706-3732,3746-3749

Link ID:          urnpublicid:IDN+instageni.nysernet.org+interface+procurve2:1.19.al2s
Remote Link ID:   urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-4/3/0.0:nysernet-ig
VLAN Range:       1700-1719

GPO EG URNs changes were requested from the ExoGENI Team:

Link ID:          urnpublicid:IDN+exogeni.net:bbnNet+interface+BbnNet:IBM:G8052:GigabitEthernet:1:2:ethernet
Remote Link ID:   urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-eg
VLAN Range:       3741,3736-3739

GPO OG URNs changes were request from OpenGENI Team:

Link ID:          urnpublicid:IDN+bbn-cam-ctrl-1.gpolab.bbn.com+interface+force10:3:al2s
Remote Link ID:   urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-og
VLAN Range:       2611-2630

Wall2 iMinds URN changes were requested from the Iminds Team:

Link ID:          urnpublicid:IDN+wall2.ilabt.iminds.be+interface+c300b:0.12
Remote Link ID:   urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-3/1/0.0:iminds
VLAN Range:       1125-1164

AL2S Aggregate URN Changes were reqeusted from Internet2 via the GMOC:

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-3/1/0.0:iminds
Remote Link ID:   urnpublicid:IDN+wall2.ilabt.iminds.be+interface+c300b:0.12
VLAN Range:       1125-1164

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-og
Remote Link ID:   urnpublicid:IDN+bbn-cam-ctrl-1.gpolab.bbn.com+interface+force10:3:al2s
VLAN Range:       2611-2630

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-eg
Remote Link ID:   urnpublicid:IDN+exogeni.net:bbnNet+interface+BbnNet:IBM:G8052:GigabitEthernet:1:2:ethernet
VLAN Range:       3741,3736-3739

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:gpo-ig
Remote Link ID:   urnpublicid:IDN+instageni.gpolab.bbn.com+interface+procurve2:5.24.al2s
VLAN Range:       3596-3600,3706-3732,3746-3749

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-4/3/0.0:nysernet-ig
Remote Link ID:   urnpublicid:IDN+instageni.nysernet.org+interface+procurve2:1.19.al2s
VLAN Range:       1700-1719

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:host-gpolab
Remote Link ID:   urnpublicid:IDN+gpolab.bbn.com+interface+switch:port:al2s
VLAN Range:       2646

Link ID:          urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.newy32aoa.net.internet2.edu:et-7/3/0.0:umass-eg
Remote Link ID:   urnpublicid:IDN+exogeni.net:umassNet+interface+umassNet:IBM:G8264:TenGigabitEthernet:1:1:ethernet
VLAN Range:       3581-3595

4b. Submit Change Requests to Teams

Send Email to each of the teams to request the above changes. For example, for the New York Switch updates change request, emails were sent to these aggregate teams: IG, EG, OG, iMinds and Internet2 AL2S.

As a courtesy, copy the rack admin contact(s) or email list from the Operators page on these requests. They don't have to take any action, but they may want to know that their racks will be potentially unable to stitch for a period of time during the scheduled outage.

Also copy the GENI Monitoring team (kathryn.wong1@uky.edu, cody@uky.edu and caylin@uky.edu). With the exception of the ATLA consolidation, this work should not require any immediate action for monitoring, but the folks at UKY may want to note the "retired" URNs in their database, and to pay extra attention to their monitoring site during these transitions.

Once the requested changes are completed, verify that the requested changes appear in each of the GENI aggregates stitching advertisement.

$ for i in gpo-ig gpo-og gpo-eg nysernet-ig al2s wall2 ; do stitcher listresources -a $i -o; done

Review all listresources output files to verify that the correct URN is in place for each advertisement.

InstaGENI Update Details

InstaGENI updates follow this approach:

  1. Ask geni-ops@googlgroups.com, which maps to Hussam (Hussamuddin Nasir) running the commands below on the rack boss node.
  2. Or, a site contact may be asked to log into boss node and run these commands (Hussam does not maintain a few dev racks, or racks that being provisioned by GPO that are not yet completed).
  3. Or, the engineer coordinating the scheduled maintenancce, can reqeust an admin account on the boss node for this work from the rack ower(via the web UI for the site, e.g. http://instageni.gpolab.bbn.com/ for gpo-ig). Once that account is approved, you can run the necessary commands for the update remotely.

Note: Options 2 and 3 are not likely to happen, as option 1 has always taken place as expected.

The InstaGENI changes that will made to the external network definition for AL2S will be used for the stitching configuration. Below is an old example of the commands used to modify the URN for the uwashington-ig external network. These commands are executed on the InstaGENI boss node:

 mysql tbdb -e 'update external_networks set external_interface="urn:publicid:IDN+al2s.internet2.edu+interface+rtsw.seat.net.internet2.edu:et-4/3/0.0:uwashington-ig" where network_id="ion"'

 mysql tbdb -e 'update external_networks set external_wire="urn:publicid:IDN+al2s.internet2.edu+link+rtsw.seat.net.internet2.edu:et-4/3/0.0:uwashington-ig" where network_id="ion"'

Note: Be aware of potential line wrapping pitfalls.

In the above example, these commands assume that in the "external_networks" table on the boss node, there exists an entry named "ion" in the "network_id" column and an associated URN ending in "<InstaGENI Sitename>-ig". On most racks, these values will be either "ion" (for racks existing during the ION age) or "al2s" (for racks existing during the AL2S age). To determine the value for "network_id" for a given InstaGENI rack, use the following:

mysql tbdb -e 'select * from external_networks'

5. Request SCS Servers Update

In order for GENI Network Stitching to pick up these path configuration changes, an SCS update must be run. There are two SCS systems:

The Production and Test SCS include stitching information for different sets of aggregates. To find out which SCS knows about which aggregates, issue the following GENI tools commands:

For the Production SCS:

 python ~/gcf/src/gcf/omnilib/stitch/scs.py --listaggregates --scs_url http://geni-scs.net.internet2.edu:8081/geni/xmlrpc >scs-prod

Look for the aggregates identified in the earlier steps. For example for the New York switch consolidation effort, the 'listaggregates' function shows that the GPO IG, GPO EG, and NYSERnet IG sites are known to the Production SCS.

For the Test SCS:

python gcf/gcf-current/src/gcf/omnilib/stitch/scs.py --listaggregates --scs_url http://nutshell.maxgigapop.net:8081/geni/xmlrpc > scs-test

Look for the aggregates identified in the earlier steps. For example for the New York switch consolidation effort, the 'listaggregates' function shows that sites GPO IG, GPO EG, GPO OG, NYSERnet IG, iMinds, and Umass are known to the test SCS.

Send a request to:

  • the GMOC to update the Production SCS
  • to Xi to update the Test SCS.

6. Validate Updated Stitching

When the updates are completed for all Aggregates and for the SCS servers, testing takes place to verify the URN changes. Validation includes:

  • Verify Advertisement for AL2s and GENI aggregate that were updated. If the new URN is missing from the stitching section, contact the appropriate aggregate team.
  • Create stitched slivers with the production SCS that uses each of the rack aggregates that were updated and connects it to a remote stitching site. Login in to one node for each sliver and leave some ping traffic running. DO NOT delete these slivers used later in monitoring verification. If Production SCS reports unknown path, contact Luke or AJ about updating the production SCS.
  • Create stitched slivers with the test SCS, which can be done by using the omni/stitcher option "--scsURL https://nutshell.maxgigapop.net:8443/geni/xmlrpc" that uses each of the rack aggregates that were updated and connect them to a remote stitching site. Login in to one node for each sliver and leave some ping traffic running. DO NOT delete these slivers used later in monitoring verification. If Test SCS reports unknown path contact Xi about updating the Test SCS.
  • Review the Operators page to replace any instances of old URNs or old switch/port names. Check the network drawings as well as the text. It is OK to add notes to the network drawing section, because revising the drawings usually requires getting a new drawing from the site, which takes longer than the scheduled outage.

Router Interface Site VLAN Range
salt.net.internet2.edu eth7/1 utah-stitch 2100-3499

With the new switch details:

Router Interface Site VLAN Range
salt.net.internet2.edu et-4/3/0 utah-stitch 2100-3499

  • GENI Monitoring URN Validation. Login into https://genimondev.uky.edu and use the search feature to find all data relating to the new AL2S switch, for example "rtsw.salt.net.internet2.edu". Make sure the following are returned:
    • a switch is listed with the new name "rtsw.salt.net.internet2.edu",
    • interface statistics are available for the new switch,
    • VLAN are being reported for the new switch

  • Report back about test finding or any outstanding/unresolved issues.

7. Update and Close Tickets

Assuming all tests are successful, update and close all tickets by emailing the GMOC and any individual resource owners who were contacted but not included in the tickets. If there are outstanding issues that are significant, leave the ticket open until they are resolved. If there are smaller outstanding issues, close the maintenance tickets, and open new tickets with the appropriate owners to track and resolve, ideally before the next maintenance.

If this process needs revision to account for events that occurred during the maintenance, email the ops teams and follow up with discussion or revision as appropriate.