[[PageOutline]] = GENI Operations Trial = This page captures procedures and documentation required for the GENI Operations Trials. = GENI Operations Responsibilities = GENI Operations groups track and hand activities off to each other, based on the type of GENI event that happens. The GMOC provides a public support line/email, tracking and a public ticket view for all GENI experimenters and operators. Following is a table of organizations that provide support listed by their area(s) of operations responsibility: || ''' Team ''' ||'''mail list'''|| ''' Area of !Responsibility/Tools''' || || GMOC Ops ||gmoc@grnoc.iu.edu||Tracking and Escalation, Emergency Stop, Legal, Law Enforcement and Regulatory Requests, New Site Tracking, GENI AL2S services (OESS, OESS-AM, GENI SCS, FSFW, GRNOC coordination)|| || GPO Dev ||gpo-sw-dev@geni.net ||GENI Tools (gcf, omni, stitcher), GENI Clearinghouse and Portal|| || GPO Ops ||geni-ops@googlegroups.com ||Stitching, Escalation Response, !OpenFlow|| || MAX Dev ||maxyang@umd.edu and tlehman@umd.edu ||SCS Escalation|| || Nick Bastin Dev* ||nick.bastin@gmail.com||FOAM and !FlowVisor || || RENCI !Ops/Dev ||exogeni-ops@renci.org ||ExoGENI Racks, ExoGENI Stitching || || UKY !Ops/Dev || geni-ops@googlegroups.com ||GENI Monitoring System, Stitching Computation System (IG rack endpoints) (Note, UKY is coming up to speed on IG racks) || || U of Utah Ops||geni-ops@googlegroups.com ||InstaGENI Racks (Note, Utah is gradually transitioning some IG activities to UKY) || || OpenGENI Ops||TBD ||OpenGENI Racks || || Utah Dev || geni-dev-utah@flux.utah.edu|| Jacks Tool, Emulab, Apt|| * FOAM and FV events are not included in the current Operations Trial. Other operations groups may support GENI for particular aggregates or particular services. || ''' Team ''' ||'''mail list''' || ''' Area of !Responsibility/Tools''' || || !CloudLab Operations||cloudlab-ops@cloudlab.us || !CloudLab || || ESNET Operations || || || || iMINDS Operations ||vwall-ops@atlantis.ugent.be || iMinds Aggregates, iMinds SCS || || Starlight Operations|| || || [[BR]] == GENI Event Type Prioritization and Dispatch == As events are reported the operations team must determine their priority in order to respond appropriately. `Critical` events require immediate attention, and should be resolved within 1 business day. `High` priority events should be investigated within 1 business day and resolved within 2 business days. `Normal` priority events should be investigated and closed within 1 business week, except for scheduled events (including scheduled maintenance events), which should be verified and closed by the end of their scheduled window. The GMOC must also set appropriate priorities for related GENI tickets. `High` priority events may be up leveled to `Critical` if the person reporting the issue identifies it as `Critical` based on the reported circumstances. For example, if an issue will impact a demo, training session, or conference, the reporter may ask for it to be treated as `Critical`. Following are guidelines for prioritizing issues that are reported by a person or by information from a tool (such as a Nagios alert). This prioritization does not cover scheduled events, or routine requeists, which fall under the `Normal` priority classification, below all events listed here. `Dispatch` in the table below indicates the group with primary responsibility to begin/track the operations response in an area. ||'''Priority'''||'''#'''||'''Event type''' ||'''Dispatch'''||'''Procedure'''|| || Critical || 1 ||Emergency Stop and LLR || GMOC || [http://gmoc.grnoc.iu.edu/uploads/7e/39/7e39c5ec9577a5badab80ea15419ece8/GENI-Emergency-Stop-Procedure-and-Workflow.pdf GMOC Emergency Stop procedure] and [https://gmoc.grnoc.iu.edu/uploads/79/f2/79f227a094bd55b11838b994498d525a/GENI-LLR-Procedure-Workflow.pdf GMOC LLR procedure] || || Critical || 2 ||Security Event ''[*]'' || GMOC || [wiki:GENIOperationsTrial/SecurityEvent OPS-002-A: Security: System Security Events] || || Critical || 3 ||GENI !Clearinghouse/Portal Event || GMOC || <> || || Critical || 4 ||Stitching Computation Service || GMOC || GMOC I2 SCS Procedure exists || || Critical || 5 ||AL2S Aggregate Event || GMOC || GMOC OESS Procedures exist || || Critical || 6 ||GENI Stitching Event || GMOC || [wiki:GENIOperationsTrial/GENINetworkStitching OPS-003-B: Network Stitching Experiment Debugging Procedure] || || High || 7 ||WiMAX Multipoint VLANs Events || GMOC || [wiki:GENIOperationsTrial/WimaxDataplaneebugging OPS-006-B: GENI WiMAX Dataplane Debugging] || || High || 8 ||Regional (AM+Switches+links) || Rack or Site Group || <> || || High || 9 ||Site Events reported by site contact|| GMOC || [wiki:GENIOperationsTrial/SitePowerOutage OPS-004-A: Site Power Outage], [wiki:GENIOperationsTrial/ScheduledMaintenance OPS-005-A: Scheduled Site Maintenance] || || High || 10 ||Site Events(AM+Switches+links reported by experimenters or tools) ''[**]''|| Rack or Site Group || [wiki:GENIOperationsTrial/SitePowerOutage OPS-004-A: Site Power Outage], [wiki:GENIOperationsTrial/ScheduledMaintenance OPS-005-A: Scheduled Site Maintenance], [wiki:GENIOperationsTrial/WimaxDebugging OPS-006-A: GENI WiMAX Base Station Debugging] || || High || 11 ||Experimenter Tools Events(Portal, jacks, omni..)|| Tools Contacts || || || High || 12 ||Monitoring Infrastructure Events|| UKY Monitoring team ||[wiki:GENIOperationsTrial/MonitoringSystemOutage OPS-001-C: Monitoring: Monitoring System Outage] || || Normal || 13 ||Monitoring Infrastructure Setup || UKY Monitoring team / GPO ||[http://groups.geni.net/geni/wiki/GENIMonitoring/Alerts OPS-001-A: Monitoring: Creating Monitoring Event Alerts], [wiki:GENIOperationsTrial/MonitoringAddingSites OPS-001-B: Monitoring: Adding Monitoring Sites] || ''[*] Security Events start as Critical and may be re-prioritized upon investigation. [[BR]]'' ''[**] Some `Site Events` may affect multiple sites (ExoSM) or non-GENI functions (!CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.'' [[BR]] = GENI Monitoring Tools and Documentation = The [http://genimon.uky.edu/ GENI Monitoring System] is available to report status for resources, services and experimenters in GENI. See the [http://groups.geni.net/geni/wiki/GENIMonitoring GENI Monitoring] wiki page for information on features, reporting and alerts fro GENI infrastructure. Some pages that maybe initially helpful: - [http://groups.geni.net/geni/wiki/GENIMonitoring/Account Getting an Account] - An overview of account types and instructions on how to get an account. - [http://groups.geni.net/geni/wiki/GENIMonitoring/Overview GENI Monitoring Overview] - An overview of the functionality and highlights of potential usage, available functions, and notes restrictions or potential issues for a experimenters. - [http://groups.geni.net/geni/wiki/GENIMonitoring/GENIMonCheck GENI Monitoring Resources Check] - An example on how to use monitoring a for specific experiment's resources, how to access slices, slivers, VLAN and interfaces information and determine mapping between these objects. - [http://groups.geni.net/geni/wiki/GENIMonitoring/GENIMonInvestigate GENI Monitoring Site Issue Investigation] - An example on how to investigate a site specific issue using monitoring to determine resource allocation and source of issue. Other tools are available to determine resource status in GENI: - A [http://tamassos.gpolab.bbn.com/nagios3/ GENI Nagios] System provides hosts status and reports and on alerts based on [http://genimon.uky.edu/ GENI Monitoring] events. - The [http://groups.geni.net/geni/report GENI TRAC Ticket System] captures known issues. - The [https://github.com/GENI-NSF GENI NSF Git Hub] sites also track issues for: * `geni-portal` - the user interface for a GENI clearinghouse * ` geni-ch` - the GENI clearinghouse services * `geni-tools` - GENI tools including omni, stitcher, sample aggregate manager, etc. Formerly "gcf." - The [https://tick.globalnoc.iu.edu/fp_tools/public_ticket_viewer/index.cgi?proj=126 GMOC public ticket system] The GMOC, rack teams, and (maybe) Nick Bastin use other operations tools that are not covered in this wiki. Tools that are required for operations should be mentioned in the relevant procedures, along with a link to documentation and (when available) source code. [[BR]] = Proposed GENI Operations Procedures = Procedures to handle common operational events are available to define how the GENI Operations team should response, track, and resolve events. A '' '''[wiki:GENIOperationsTrial/Template GENI Operations Template] ''' '' is available for defining new procedures. ||'''Procedure Name ''' || ''' Procedure Description ''' || || [http://groups.geni.net/geni/wiki/GENIMonitoring/Alerts OPS-001-A: Monitoring: Creating Monitoring Event Alerts] || Instructions to how to define `alerts` for operation events.|| || [wiki:GENIOperationsTrial/MonitoringAddingSites OPS-001-B: Monitoring: Adding Monitoring Sites] || Defines the required information to request the addition, modification, or removal of a GENI site in monitoring. || || [wiki:GENIOperationsTrial/MonitoringSystemOutage OPS-001-C: Monitoring: Monitoring System Outage] || Defines how to report and respond to monitoring system events such as: collector or GUI is down, report non-critical bugs, or incorrect measurements || || [wiki:GENIOperationsTrial/SecurityEvent OPS-002-A: Security: System Security Events] || Describes how to report a security event to the GENI Monitoring team, and how to respond to that event. This type of event can also include Emergency Stop events.|| || [wiki:GENIOperationsTrial/DataPlaneDebugging OPS-003-A: GENI Network: Dataplane Debugging] || Defines the processes to be used for detecting and debugging GENI experiment issues with the data plane and control plane in experiments. [1] || || [wiki:GENIOperationsTrial/GENINetworkStitching OPS-003-B: GENI Network: Stitching Experiment Debugging] || Defines detecting and debugging GENI Network Stitching experiment issues with the stitching of Layer 2 VLAN for GENI experiments.|| || [wiki:GENIOperationsTrial/SitePowerOutage OPS-004-A: Site Power Outage] || Defines how handle a site power failure and how to verify that the site's aggregate is fully restored upon recovery [1] || || [wiki:GENIOperationsTrial/ScheduledMaintenance OPS-005-A: Scheduled Site Maintenance] || Defines schedules one-time or periodic maintenance activities [2] || || [wiki:GENIOperationsTrial/WimaxDebugging OPS-006-A: GENI WiMAX Base Station Debugging] || Defines how to identify and remedy GENI WiMAX base station configuration at a WiMAX site. [3] || || [wiki:GENIOperationsTrial/WimaxDataplaneebugging OPS-006-B: GENI WiMAX Dataplane Debugging] || Defines how to identify and remedy GENI WiMAX multipoint VLAN(s)between sites || [1] Before investigating this type of issue, first verify that it is not due to a [wiki:GENIOperationsTrial/ScheduledMaintenance Scheduled Maintenance].[[BR]] [2] Sites, services or resources outages may be caused by a [wiki:GENIOperationsTrial/ScheduledMaintenance Scheduled Maintenance]. [[BR]] [3] A WiMAX issue arises when the frequency or transmit power of a campus WiMAX base station changes. These variables are fixed by site administrators and negotiated via a GENI-Sprint agreement. An event of this type should be very rare, but if monitoring reports a change, handling the event is high priority. [[BR]] = GENI Operations Daily Checks = Procedures for Daily Checks are available to define how the GENI Operations team should verify the state GENI Resources and escalate as needed. A '' '''[wiki:GENIOperationsTrial/TemplateDaily GENI Daily Check Template] ''' '' is available for defining new procedures. ||'''Daily Checks ''' || ''' Daily Checks Description ''' || || [wiki:GENIOperationsTrial/GENISecurityCheckRack CHK-001-A: GENI Racks Security Checks] || Defines daily security checks for InstaGENI rack logs || || [wiki:GENIOperationsTrial/GENISecurityCheckStitch CHK-001-B: GENI Stitching Computation Service Security Checks] || Defines daily security checks for Stitching Computation Service (SCS)|| || [wiki:GENIOperationsTrial/GENISecurityCheckAL2S CHK-001-C: GENI AL2S AM Security Checks] || Defines daily security checks for GENI AL2S Aggregate Manager || || [wiki:GENIOperationsTrial/GENISecurityCheckClearinghouse CHK-001-D: GENI Clearinghouse Security Checks] || Defines daily security checks for GENI Clearing House || || [wiki:GENIOperationsTrial/GENISecurityCheckPortal CHK-001-E: GENI Portal Security Checks] || Defines daily security checks for GENI Portal || || [wiki:GENIOperationsTrial/GENISecurityWiMAX CHK-001-F: GENI WiMAX Checks] || Defines daily security checks for WiMAX sites || || [wiki:GENIOperationsTrial/GENIExpResourcesCheck CHK-002: GENI Experiment Resources Checks] || Defines a daily check for all compute, and network resources in GENI. || || [wiki:GENIOperationsTrial/GENIStitchingCheck CHK-003: GENI Network Stitching Checks] || Defines daily check of GENI Network stitching resources || || [wiki:GENIOperationsTrial/GENIExpSliceCheck CHK-004: GENI Long Running Slice Checks] || Defines daily check for a long running experiment named `triangle`, a 3-node topology continuously exchanging traffic. || || [wiki:GENIOperationsTrial/GENIOpenFlowCheck CHK-005: GENI Network Connectivity OpenFlow Checks] || Defines daily check of the GENI network through connectivity tests using !OpenFlow resources || [[BR]] = GMOC Approved GENI Operations Procedures = For a list procedures that have been accepted for operations, see the [wiki:OperationsProcedures GENI Network Operations Procedures] page.