[[PageOutline]] = Monitoring Requirements = This page includes the monitoring requirements presented at [wiki:GEC12MesoScaleMonitoring GEC12]. The purpose of this page is to allow the addition and clarification of these requirements over time. == Introduction == * Why are we here? * Would like to ensure: * We don’t ignore any important pieces * Architecture decisions reflect needs of monitoring so that GENI Clearinghouse, I&M, etc serve our needs * Where possible, we build tools which can be adapted to new software when it becomes available * Therefore we need to answer the following questions: * Are there any holes in our understanding of our requirements? * Do we have agreement on what we need in a way that can be communicated to other groups working on topics of interest to ops? * Do we know how to build tools in an adaptable way? * And also… * What do we work on next? * GENI Requires Monitoring * 10.2-3 Visible operational status * The GENI system shall make sufficient data available that researchers and maintainers will be able to evaluate the availability and operational status of the system. * 10.2-5 Federated event escalation * The GENI system shall provide operations and management support for event management and escalation, including security events, within GENI and with those organizations that interconnect with GENI. * 10.2-6 Federated operations data exchange * The GENI system shall support operational and management data exchange according to [TBS] GENI O&M Policy between GENI and operators/owners of federated components, aggregates, and networks. * Monitoring Players in GENI * Meta-operations (a.k.a. GMOC) * Aggregates * Examples: ProtoGENI, !PlanetLab, Orca * Campuses * Which host & run aggregates * Which only host aggregates * Backbone & Regional Networks * Networks which are GENI participants: I2, NLR * Networks which carry GENI traffic: some regionals (eg NoX) * Experimenters * Aggregates, Racks and the GENI CH * Each GENI rack is a SINGLE aggregate * Therefore requirements are the same as for aggregates * Aggregates can outsource (some of) their responsibilities to the GENI Clearinghouse * Definition * Monitoring * Act of collecting data and measuring what is happening * Management * Act of fixing problems and responding to requests * What does monitoring & management involve? * Observe unexpected events * THEN fix what’s wrong * Observe expected events * THEN develop policy for fixing what’s wrong * THEN fix what’s wrong (by responding to monitoring) * Plan for the future * Monitor long-term trends in resource usage * THEN provision resources to meet forecasted needs * What makes GENI different? * Federated entities managed by different institutions with different policies * People and information needed to troubleshoot and resolve problems are spread across several physical locations * Users (end user and experimenters), managers and hosts of a given piece of equipment may all be different. * Interactions between groups are governed by GENI federation agreements (e.g. aggregate provider agreement) and mutual understanding. * Motherhood & Apple Pie * We are not covering: * Monitoring and management which fits entirely within the purview of aggregates, campuses, etc * For example, we will do (but not discuss here) * Keeping logs * Obeying local laws and policy * Answering the phone when someone has a concern * … and tie your shoes and everything else. * These things do NOT make GENI different == Top-level aspects of GENI Monitoring and Management == These are the top-level aspects of GENI monitoring and management... * Top-level Aspects of GENI Monitoring 1. Information must be shareable 1. Information must be collected 1. Information must be available when needed 1. Cross-GENI operational statistics collected and synthesized to indicate GENI as a whole is working 1. Preserve privacy of users (opt-in, experimenters, other users of resources) * Top-level Aspects of GENI Management 1. For both debugging and security problems: 1. Must be possible to escalate events 1. Meta-operations and aggregate operators must work together to resolve problems in a timely manner 1. Must be possible to do an emergency stop in case of a problem 1. Orgs must manage GENI resources consistent with local policy and best practices * e.g security procedures, logging, backups, etc 1. Develop policies for monitoring 1. All parties should implement agreed upon policies 1. Security of GENI as a whole and its pieces == Breakdown of top level requirements == Each of the top-level aspects of GENI monitoring and management, is broken down in more detail below. The current implementation of the requirement is listed after the -->. The color means [[Color(green,implemented according to the plan)]], [[Color(orange,partially implemented according to some plan or there exists an unimplemented plan)]], or [[Color(red,not implemented and no plan)]]. * Requirement: Cross-GENI Monitoring * ''GENI monitoring is more than the sum of the monitoring at GENI’s parts. In order to know if GENI is working properly, additional monitoring is required beyond that done by each of its constituent pieces.'' * Collect and synthesize additional operational statistics which indicate whether GENI is working * e.g. meso-scale ping tests, topology * Collect cross-GENI stats * Make cross-GENI stats available when needed * Requirement: Privacy * Preserve privacy of users (opt-in, experimenters, other users of resources) * [[Color(red,TBD – This is an area needing major discussion)]] * Requirement: Troubleshooting & Event Escalation * For both debugging and security problems: * Meta-operations and aggregate operators must work together to resolve problems * Aggregates must advertise resources accurately * (threshold) statically --> [[Color(green,Fill out aggregate page)]] * (objective) dynamically --> [[Color(green,Advertise resources via AM API)]] * Aggregates notify meta-operations when resources are unavailable --> [[Color(orange,via e-mail (doing SOME of the time))]] * Aggregates cooperate with meta-operations on the resolution of security events * Aggregates cooperate with LLR on the resolution of security events * Must be possible to escalate events * Requirement: Emergency Stop * Must be possible to do an emergency stop in case of a problem * Must maintain POC information at meta-operations * Aggregate --> [[Color(green,end contact info to GMOC)]] * Campus --> [[Color(green,send contact info to GMOC)]] * Experimenter --> [[Color(orange,slice e-mail)]] * Other infrastructure --> [[Color(green,contacted by relevant campus)]] * Aggregates & Meta-operations must each have policies and procedures in place to support an emergency stop * --> [[Color(orange,Has been dry run)]] * Requirement: Policy * Orgs must manage GENI resources consistent with local policy and best practices (e.g security procedures, logging, backups, etc) * In general, follow local policy and procedures * Follow best practices which if not followed would affect other members of the GENI community * Develop policies for monitoring * All parties should implement agreed upon policies * Follow Aggregate Provider Agreement * Follow LLR * Follow other GENI policies as they come into effect * Requirement: Security * Security of GENI as a whole and its pieces * Two things we want to prevent: * Compromise of GENI resources * Use of GENI resources to compromise other entities * Two things we can do about this: * Follow best practices to hinder compromise * Detect and respond to compromise * Allow interesting research for which experimenters and operations have to coordinate for security and management reasons * Security experimentation BOF tonight and session at GEC13!!! * --> [[Color(red,TBD – This is an area needing major discussion)]] * Requirement: Info must be shareable/collected/available * Information must be shareable * Consistent definitions of data * Consistent data exchange format * Consistent data collection mechanisms * Data sharing mechanisms * The following benefit from shared common processes: * Accessing data, finding data, visualizing data * Information must be collected * Verify continued successful data collection * Debug collection and reliability outages * Information must be available when needed * Privacy of data must be maintained == Data Requirements == * Data Definitions * Consistent definition of data * Relational data * Resources (incl. connectivity) * List of aggregates * List of slices * List of users * Aggregate contact information * Timeseries data * Examples: Host and network statistics * Events * Examples: SNMP Traps * Data Storage & Collection Methods * Data collection methods * Relational data --> store in relational DB * Resources --> [[Color(green,Rspecs available via AM API)]] * List of aggregates --> [[Color(orange,ctrl framework clearinghouse & GENI wiki)]] * List of slices --> [[Color(orange,control framework slice authority)]] * List of users --> [[Color(red,TBD)]] * Aggregate contact information --> [[Color(orange,aggregate page and GMOC)]] * Timeseries data --> [[Color(orange,store in RRD)]] * --> [[Color(green,collect via SNMP (ie host and network stats))]] * --> [[Color(green,by asking the aggregate (ie custom !OpenFlow API))]] * Events --> [[Color(red,store in relational DB (?))]] * --> [[Color(red,TBD)]] * General: Using Data * Sharing Data * --> [[Color(green,publish to central DB at GMOC)]] * --> [[Color(green,publish locally via webpage or local API)]] * --> [[Color(red,TBD: publish via a distributed mechanism)]] * Accessing, Finding and Visualizing Data * --> [[Color(green,GMOC Portals)]] * --> [[Color(green,GMOC SNAPP Interface (with search))]] * --> [[Color(green,GMOC data available to interested consumers via API)]] * --> [[Color(red,TBD: More to do here)]] * Other people who need data * Troubleshooting info from aggregates, campuses, meta-operations * Accountability report: How to prove if this is not my fault?