[[PageOutline]] = GENI Operations Trial = This page captures procedures and documentation required for the GENI Operations Trials. = GENI Operations Responsibilities = GENI Operations groups need to track and hand activities off to each other, based on the type of GENI event that happens. Following is a table of organizations that provide support listed by area of operations responsibility: || ''' Team ''' ||'''mail list'''|| ''' Area of !Responsibility/Tools''' || || GMOC Ops ||gmoc@grnoc.iu.edu||Tracking and Escalation, Emergency Stop, Legal, Law Enforcement and Regulatory Requests, New Site Tracking, GENI AL2S services (OESS, OESS-AM, GENI SCS, FSFW, GRNOC coordination)|| || GPO Dev ||gpo-sw-dev@geni.net ||GENI Tools (gcf, omni, stitcher), GENI Clearinghouse and Portal|| || GPO Ops ||geni-ops@googlegroups.com ||Stitching, Escalation Response, !OpenFlow|| || MAX Dev ||maxyang@umd.edu and tlehman@umd.edu ||SCS Escalation|| || Nick Bastin Dev* ||nick.bastin@gmail.com||FOAM and !FlowVisor || || RENCI !Ops/Dev ||exogeni-ops@renci.org ||ExoGENI Racks, ExoGENI Stitching || || UKY !Ops/Dev || geni-ops@googlegroups.com ||GENI Monitoring System, Stitching Computation System || || U of Utah Ops||geni-ops@googlegroups.com ||InstaGENI Racks || || Utah Dev || geni-dev-utah@flux.utah.edu|| Jacks Tool, Emulab, Apt|| * FOAM and FV events are not included in the current Operations Trial. Other operations groups may support GENI for particular aggregates or particular services. || ''' Team ''' ||'''mail list''' || ''' Area of !Responsibility/Tools''' || || !CloudLab Operations||cloudlab-ops@cloudlab.us || !CloudLab || || ESNET Operations || || || || iMINDS Operations ||vwall-ops@atlantis.ugent.be || iMinds Aggregates, iMinds SCS || || Starlight Operations|| || || == GENI Event Type Prioritization and Dispatch == As events are reported the operations team must determine their priority in order to respond appropriately. The GMOC must also set appropriate priorities for related GENI tickets. `High` priority events may be up leveled to `Critical` if the person reporting the issue identifies it as `Critical`. For example if an issue will impact a demo, training or conference, the reporter may ask for it to be treated as `Critical`. Following are guidelines for prioritizing issues that are reported by a person or by information from a tool (such as a Nagios alert). This prioritization does not cover scheduled events, which fall under the `Normal` priority classification, below all events listed here. `Dispatch` indicates the group with primary responsibility to begin/track the operations response in an area. ||'''Priority'''||'''#'''||'''Event type''' ||'''Dispatch'''||'''Procedure'''|| || Critical || 1 ||Emergency Stop and LLR || GMOC || [http://gmoc.grnoc.iu.edu/uploads/7e/39/7e39c5ec9577a5badab80ea15419ece8/GENI-Emergency-Stop-Procedure-and-Workflow.pdf GMOC Emergency Stop procedure] and [https://gmoc.grnoc.iu.edu/uploads/79/f2/79f227a094bd55b11838b994498d525a/GENI-LLR-Procedure-Workflow.pdf GMOC LLR procedure] || || Critical || 2 ||Security Event ''[*]'' || GMOC || [wiki:LuisaSandbox/GENIOperationsTrial/SecurityEvent OPS-002-A: Security: System Security Events] || || Critical || 3 ||GENI !Clearinghouse/Portal Event || GMOC || <> || || Critical || 4 ||Stitching Computation Service || GMOC || GMOC I2 SCS Procedure exists || || Critical || 5 ||AL2S Aggregate Event || GMOC || GMOC OESS Procedures exist || || Critical || 6 ||GENI Stitching Event || GMOC || [wiki:LuisaSandbox/GENIOperationsTrial/GENINetworkStitching OPS-003-B: Network Stitching Experiment Debugging Procedure] || || High || 7 ||WiMAX Multipoint VLANs Events || GMOC || [wiki:LuisaSandbox/GENIOperationsTrial/WimaxDataplaneebugging OPS-006-B: GENI WiMAX Dataplane Debugging] || || High || 8 ||Regional (AM+Switches+links) || Rack or Site Group || <> || || High || 9 ||Site Events reported by site contact|| GMOC || [wiki:LuisaSandbox/GENIOperationsTrial/SitePowerOutage OPS-004-A: Site Power Outage], [wiki:LuisaSandbox/GENIOperationsTrial/ScheduledMaintenance OPS-005-A: Scheduled Site Maintenance] || || High || 10 ||Site Events(AM+Switches+links reported by experimenters or tools) ''[**]''|| Rack or Site Group || [wiki:LuisaSandbox/GENIOperationsTrial/SitePowerOutage OPS-004-A: Site Power Outage], [wiki:LuisaSandbox/GENIOperationsTrial/ScheduledMaintenance OPS-005-A: Scheduled Site Maintenance], [wiki:LuisaSandbox/GENIOperationsTrial/WimaxDebugging OPS-006-A: GENI WiMAX Base Station Debugging] || || High || 11 ||Experimenter Tools Events(Portal, jacks, omni..)|| Tools Contacts || || || High || 12 ||Monitoring Infrastructure Events|| UKY Monitoring team ||[OPS-001-A: Monitoring: Creating Monitoring Event Alerts], [wiki:LuisaSandbox/GENIOperationsTrial/MonitoringSitesProcedure OPS-001-B: Monitoring: Adding Monitoring Sites], [wiki:LuisaSandbox/GENIOperationsTrial/MonitoringSystemOutage OPS-001-C: Monitoring: Monitoring System Outage], || ''[*] Security Events start as Critical and may be re-prioritized upon investigation. [[BR]]'' ''[**] Some `Site Events` may affect multiple sites (ExoSM) or non-GENI functions (!CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.'' = GENI Monitoring Tools and Documentation = The [http://genimon.uky.edu/ GENI Monitoring System] is available to report status for resources, services and experimenters in GENI. See the [http://groups.geni.net/geni/wiki/GENIMonitoring GENI Monitoring] wiki page for information on features, reporting and alerts fro GENI infrastructure. Some pages that maybe initially helpful: - [http://groups.geni.net/geni/wiki/GENIMonitoring/Account Getting an Account] - An overview of account types and instructions on how to get an account. - [http://groups.geni.net/geni/wiki/GENIMonitoring/Overview GENI Monitoring Overview] - An overview of the functionality and highlights of potential usage, available functions, and notes restrictions or potential issues for a experimenters. - [http://groups.geni.net/geni/wiki/GENIMonitoring/GENIMonCheck GENI Monitoring Resources Check] - An example on how to monitoring a specific experiment's resources, how to access slices, slivers, VLAN and interfaces information and determine mapping between these objects. Other tools are available to determine resource status in GENI: - A [http://tamassos.gpolab.bbn.com/nagios3/ GENI Nagios] System is available were reports and on alerts based on [http://genimon.uky.edu/ GENI Monitoring] events are available as well as host status. - The [http://groups.geni.net/geni/report GENI TRAC Ticket System] captures known issues. - The [https://github.com/GENI-NSF GENI NSF Git Hub] sites also track issues for: * `geni-portal` - the user interface for a GENI clearinghouse * ` geni-ch` - the GENI clearinghouse services * `geni-tools` - GENI tools including omni, stitcher, sample aggregate manager, etc. Formerly "gcf." - The [https://tick.globalnoc.iu.edu/fp_tools/public_ticket_viewer/index.cgi?proj=126 GMOC public ticket system] The GMOC, rack teams, and (maybe) Nick Bastin use other operations tools that are not covered in this wiki. Tools that are required for operations should be mentioned in the relevant procedures, along with a link to documentation and (when available) source code. = GENI Operations Procedures = Procedures to handle common operational events are available to define how the GENI Operations team should responde, track, and resolve events. A [wiki:LuisaSandbox/GENIOperationsTrial/Template GENI Operations Template] is available for defining new procedures. ||'''Procedure Name ''' || ''' Procedure Description ''' || || [http://groups.geni.net/geni/wiki/GENIMonitoring/Alerts OPS-001-A: Monitoring: Creating Monitoring Event Alerts] || Instructions to how to define `alerts` for operation events.|| || [wiki:LuisaSandbox/GENIOperationsTrial/MonitoringAddingSites OPS-001-B: Monitoring: Adding Monitoring Sites] || Defines the required information to request the addition, modification, or removal of a GENI site in monitoring. || || [wiki:LuisaSandbox/GENIOperationsTrial/MonitoringSystemOutage OPS-001-C: Monitoring: Monitoring System Outage] || Defines how to report and respond to monitoring system events such as: collector or GUI is down, report non-critical bugs, or incorrect measurements || || [wiki:LuisaSandbox/GENIOperationsTrial/SecurityEvent OPS-002-A: Security: System Security Events] || Describes how to report a security event to the GENI Monitoring team, and how to respond to that event. This type of event can also include Emergency Stop events.|| || [wiki:LuisaSandbox/GENIOperationsTrial/DataPlaneDebugging OPS-003-A: GENI Network: Dataplane Debugging] || Defines the processes to be used for detecting and debugging GENI experiment issues with the data plane and control plane in experiments. [1] || || [wiki:LuisaSandbox/GENIOperationsTrial/GENINetworkStitching OPS-003-B: GENI Network: Stitching Experiment Debugging] || Defines detecting and debugging GENI Network Stitching experiment issues with the stitching of Layer 2 VLAN for GENI experiments.|| || [wiki:LuisaSandbox/GENIOperationsTrial/SitePowerOutage OPS-004-A: Site Power Outage] || Defines how handle a site power failure and how to verify that the site's aggregate is fully restored upon recovery [1] || || [wiki:LuisaSandbox/GENIOperationsTrial/ScheduledMaintenance OPS-005-A: Scheduled Site Maintenance] || Defines schedules one-time or periodic maintenance activities [2] || || [wiki:LuisaSandbox/GENIOperationsTrial/WimaxDebugging OPS-006-A: GENI WiMAX Base Station Debugging] || Defines how to identify and remedy GENI WiMAX base station configuration at a WiMAX site. [3] || || [wiki:LuisaSandbox/GENIOperationsTrial/WimaxDataplaneebugging OPS-006-B: GENI WiMAX Dataplane Debugging] || Defines how to identify and remedy GENI WiMAX multipoint VLAN(s)between sites || [1] Before investigating this type of issue, first verify that it is not due to a [wiki:LuisaSandbox/GENIOperationsTrial/ScheduledMaintenance Scheduled Maintenance].[[BR]] [2] Sites, services or resources outages may be caused by a [wiki:LuisaSandbox/GENIOperationsTrial/ScheduledMaintenance Scheduled Maintenance]. [[BR]] [3] A WiMAX issue arises when the frequency or transmit power of a campus WiMAX base station changes. These variables are fixed by site administrators and negotiated via a GENI-Sprint agreement. An event of this type should be very rare, but if monitoring reports a change, handling the event is high priority.