Changes between Initial Version and Version 1 of GENI-Infrastructure-Portal/MonitoringReqts


Ignore:
Timestamp:
12/12/11 19:25:49 (12 years ago)
Author:
sedwards@bbn.com
Comment:

Original text from GEC12 slides

Legend:

Unmodified
Added
Removed
Modified
  • GENI-Infrastructure-Portal/MonitoringReqts

    v1 v1  
     1[[PageOutline]]
     2
     3= Monitoring Requirements =
     4
     5This page includes the monitoring requirements presented at [wiki:GEC12MesoScaleMonitoring GEC12]. 
     6
     7The purpose of this page is to allow the addition and clarification of these requirements over time.
     8
     9== Introduction ==
     10 * Why are we here?
     11   * Would like to ensure:
     12     * We don’t ignore any important pieces
     13     * Architecture decisions reflect needs of monitoring so that GENI Clearinghouse, I&M, etc serve our needs
     14     * Where possible, we build tools which can be adapted to new software when it becomes available
     15 * Therefore we need to answer the following questions:
     16   * Are there any holes in our understanding of our requirements?
     17   * Do we have agreement on what we need in a way that can be communicated to other groups working on topics of interest to ops?
     18   * Do we know how to build tools in an adaptable way?
     19 * And also…
     20   * What do we work on next?
     21
     22 * GENI Requires Monitoring
     23   * 10.2-3 Visible operational status
     24     * The GENI system shall make sufficient data available that researchers and maintainers will be able to evaluate the availability and operational status of the system.
     25   * 10.2-5 Federated event escalation
     26     * The GENI system shall provide operations and management support for event management and escalation, including security events, within GENI and with those organizations that interconnect with GENI.
     27   * 10.2-6 Federated operations data exchange
     28     * The GENI system shall support operational and management data exchange according to [TBS] GENI O&M Policy between GENI and operators/owners of federated components, aggregates, and networks. 
     29
     30 * Monitoring Players in GENI
     31   * Meta-operations (a.k.a. GMOC)
     32   * Aggregates
     33     * Examples: ProtoGENI, !PlanetLab, Orca
     34   * Campuses
     35     * Which host & run aggregates
     36     * Which only host aggregates
     37   * Backbone & Regional Networks
     38     * Networks which are GENI participants: I2, NLR
     39     * Networks which carry GENI traffic: some regionals (eg NoX)
     40   * Experimenters
     41
     42 * Aggregates, Racks and the GENI CH
     43   * Each GENI rack is a SINGLE aggregate
     44     * Therefore requirements are the same as for aggregates
     45   * Aggregates can outsource (some of) their responsibilities to the GENI Clearinghouse
     46
     47 * Definition
     48   * Monitoring
     49     * Act of collecting data and measuring what is happening
     50   * Management
     51     * Act of fixing problems and responding to requests
     52   * What does monitoring & management involve?
     53    * Observe unexpected events
     54      * THEN fix what’s wrong
     55    * Observe expected events
     56      * THEN develop policy for fixing what’s wrong
     57      * THEN fix what’s wrong (by responding to monitoring)
     58    * Plan for the future
     59      * Monitor long-term trends in resource usage
     60      * THEN provision resources to meet forecasted needs
     61
     62 * What makes GENI different?
     63   * Federated entities managed by different institutions with different policies
     64   * People and information needed to troubleshoot and resolve problems are spread across several physical locations
     65   * Users (end user and experimenters), managers and hosts of a given piece of equipment may all be different.
     66   * Interactions between groups are governed by GENI federation agreements (e.g. aggregate provider agreement) and mutual understanding.
     67
     68 * Motherhood & Apple Pie
     69   * We are not covering:
     70     * Monitoring and management which fits entirely within the purview of aggregates, campuses, etc
     71   * For example, we will do (but not discuss here)
     72     * Keeping logs
     73     * Obeying local laws and policy
     74     * Answering the phone when someone has a concern
     75     * … and tie your shoes and everything else.
     76   * These things do NOT make GENI different
     77== Top-level aspects of GENI Monitoring and Management ==
     78
     79These are the top-level aspects of GENI monitoring and management...
     80
     81 * Top-level Aspects of GENI Monitoring
     82    1. Information must be shareable
     83    1. Information must be collected
     84    1. Information must be available when needed
     85    1. Cross-GENI operational statistics collected and synthesized to indicate GENI as a whole is working 
     86    1. Preserve privacy of users (opt-in, experimenters, other users
     87    of resources)
     88
     89 * Top-level Aspects of GENI Management
     90   1. For both debugging and security problems:
     91   1. Must be possible to escalate events
     92   1. Meta-operations and aggregate operators must work together to resolve problems in a timely manner
     93   1. Must be possible to do an emergency stop in case of a problem
     94   1. Orgs must manage GENI resources consistent with local policy and best practices
     95       * e.g security procedures, logging, backups, etc
     96   1. Develop policies for monitoring
     97   1. All parties should implement agreed upon policies
     98   1. Security of GENI as a whole and its pieces
     99
     100== Breakdown of top level requirements ==
     101Each of the top-level aspects of GENI monitoring and management,
     102is broken down in more detail below.
     103
     104The current implementation of the requirement is listed after the -->.
     105The color means [[Color(green,implemented according to the plan)]],
     106[[Color(orange,partially implemented according to some plan or there exists an unimplemented plan)]], or [[Color(red,not implemented and no plan)]].
     107
     108 * Requirement: Cross-GENI Monitoring
     109    * ''GENI monitoring is more than the sum of the monitoring at GENI’s parts. In order to know if GENI is working properly, additional monitoring is required beyond that done by each of its constituent pieces.''
     110    * Collect and synthesize additional operational statistics which indicate whether GENI is working
     111       * e.g. meso-scale ping tests, topology
     112    * Collect cross-GENI stats
     113    * Make cross-GENI stats available when needed
     114
     115 * Requirement: Privacy
     116   * Preserve privacy of users (opt-in, experimenters, other users of resources)
     117      * [[Color(red,TBD – This is an area needing major discussion)]]
     118
     119 * Requirement: Troubleshooting & Event Escalation
     120   * For both debugging and security problems:
     121      * Meta-operations and aggregate operators must work together to resolve problems
     122        * Aggregates must advertise resources accurately
     123           * (threshold) statically --> [[Color(green,Fill out aggregate page)]]
     124           * (objective) dynamically --> [[Color(green,Advertise resources via AM API)]]
     125        * Aggregates notify meta-operations when resources are unavailable --> [[Color(orange,via e-mail (doing SOME of the time))]]
     126        * Aggregates cooperate with meta-operations on the resolution of security events
     127        * Aggregates cooperate with LLR on the resolution of security events
     128      * Must be possible to escalate events
     129
     130
     131 * Requirement: Emergency Stop
     132   * Must be possible to do an emergency stop in case of a problem
     133   * Must maintain POC information at meta-operations
     134      * Aggregate -->  [[Color(green,end contact info to GMOC)]]
     135      * Campus --> [[Color(green,send contact info to GMOC)]]
     136      * Experimenter --> [[Color(orange,slice e-mail)]]
     137      * Other infrastructure --> [[Color(green,contacted by relevant campus)]]
     138   * Aggregates & Meta-operations must each have policies and procedures in place to support an emergency stop
     139      *  -->  [[Color(orange,Has been dry run)]]
     140
     141 * Requirement: Policy
     142   * Orgs must manage GENI resources consistent with local policy and best practices (e.g security procedures, logging, backups, etc)
     143   * In general, follow local policy and procedures
     144     * Follow best practices which if not followed would affect other members of the GENI community
     145     * Develop policies for monitoring
     146   * All parties should implement agreed upon policies
     147     * Follow Aggregate Provider Agreement
     148     * Follow LLR
     149     * Follow other GENI policies as they come into effect
     150
     151 * Requirement: Security
     152   * Security of GENI as a whole and its pieces
     153   * Two things we want to prevent:
     154      * Compromise of GENI resources
     155      * Use of GENI resources to compromise other entities
     156   * Two things we can do about this:
     157      * Follow best practices to hinder compromise
     158      * Detect and respond to compromise
     159   * Allow interesting research for which experimenters and operations have to coordinate for security and management reasons
     160      * Security experimentation BOF tonight and session at GEC13!!!
     161   * --> [[Color(red,TBD – This is an area needing major discussion)]]
     162
     163 * Requirement: Info must be shareable/collected/available
     164   * Information must be shareable
     165     * Consistent definitions of data
     166     * Consistent data exchange format
     167     * Consistent data collection mechanisms
     168     * Data sharing mechanisms
     169     * The following benefit from shared common processes:
     170       * Accessing data, finding data, visualizing data
     171   * Information must be collected
     172     * Verify continued successful data collection
     173     * Debug collection and reliability outages
     174   * Information must be available when needed
     175     * Privacy of data must be maintained
     176
     177== Data Requirements ==
     178 * Data Definitions
     179   * Consistent definition of data
     180      * Relational data
     181        * Resources (incl. connectivity)
     182        * List of aggregates
     183        * List of slices
     184        * List of users
     185        * Aggregate contact information
     186      * Timeseries data
     187        * Examples: Host and network statistics
     188      * Events
     189        * Examples: SNMP Traps
     190
     191 * Data Storage & Collection Methods
     192    * Data collection methods
     193      * Relational data --> store in relational DB
     194      * Resources --> [[Color(green,Rspecs available via AM API)]]
     195      * List of aggregates --> [[Color(orange,ctrl framework clearinghouse & GENI wiki)]]
     196      * List of slices --> [[Color(orange,control framework slice authority)]]
     197      * List of users --> [[Color(red,TBD)]]
     198      * Aggregate contact information --> [[Color(orange,aggregate page and GMOC)]]
     199    * Timeseries data --> [[Color(orange,store in RRD)]]
     200      * --> [[Color(green,collect via SNMP (ie host and network stats))]]
     201      * --> [[Color(green,by asking the aggregate (ie custom !OpenFlow API))]]
     202    * Events --> [[Color(red,store in relational DB (?))]]
     203      * --> [[Color(red,TBD)]]
     204
     205 * General: Using Data
     206    * Sharing Data
     207       * --> [[Color(green,publish to central DB at GMOC)]]
     208       * --> [[Color(green,publish locally via webpage or local API)]]
     209       * --> [[Color(red,TBD: publish via a distributed mechanism)]]
     210    * Accessing, Finding and Visualizing Data
     211       * --> [[Color(green,GMOC Portals)]]
     212       * --> [[Color(green,GMOC SNAPP Interface (with search))]]
     213       * --> [[Color(green,GMOC data available to interested consumers via API)]]
     214       * --> [[Color(red,TBD: More to do here)]]
     215
     216 * Other people who need data
     217    * Troubleshooting info from aggregates, campuses, meta-operations
     218    * Accountability report: How to prove if this is not my fault?
     219
     220