Changes between Version 2 and Version 3 of GeniMgmtMonitor


Ignore:
Timestamp:
03/28/12 10:56:00 (8 years ago)
Author:
sedwards@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GeniMgmtMonitor

    v2 v3  
     1[[PageOutline]]
     2
     3= Get involved in monitoring in GENI =
     4If you are interested in the topic of monitoring, you are welcome to participate in a bi-weekly phone call (typically held on Fridays at 2PM Eastern).  Meeting information and minutes are posted to the [http://lists.geni.net/pipermail/monitoring/ monitoring mailing list]. 
     5
     6If you are interested, [http://lists.geni.net/mailman/listinfo/monitoring sign up] for the mailing list or e-mail Sarah Edwards (sedwards at bbn dot com).
     7
    18= Monitoring & Management =
    29
    3 '''Monitoring''' focuses on areas of shared interest across GENI:
     10'''Monitoring''' is the act of collecting data and measuring what is happening.
     11
     12'''Management''' is the act of fixing problems and responding to requests.
     13
     14Monitoring focuses on areas of shared interest across GENI:
    415 1. Tools to aid people with responsibilities (e.g. campuses who have signed the aggregate agreement)
    516 2. Lower burden on monitoring done multiple places (e.g. monitoring MyPLC happens at lots of campuses)
     
    819 5. Monitoring of GENI racks
    920
    10 '''Management''' focuses on the ability to control GENI resources (in particular GENI racks).
     21Management focuses on the ability to control GENI resources (in particular GENI racks).
     22
     23== Schedule ==
    1124
    1225As GENI scales during Spiral 4, monitoring and management will mature in a manner that scales to include new campuses and new racks.
    13 
    14 [wiki:GEC11MonitoringMiniWorkshop GEC 11 Monitoring Mini-Workshop] focused exclusively on what is required in the short term.
    15 
    16 Starting with the lead up to GEC12, discussion is expanding to management and developing a longer term software monitoring plan.
    1726
    1827Spiral 4 schedule leverages existing software to define long-term requirements for monitoring/mgmt software:
    1928    * At GEC12: Agree on long-term requirements for monitoring/mgmt software
    2029    * At GEC13: Agree on APIs & Architecture
    21     * GEC14: Working software
    22     * GEC15: deploy mature monitoring at many campuses/racks
     30    * GEC14:
     31       * Design and implement relational data API
     32       * Rack monitoring is operational (threshold: collecting time series data; objective: collecting relational data).
     33       * There exist public, documented web interfaces suitable for operators and experimenters to locate and use monitoring data
     34       * Flesh out issues such as: event interface design (but not implement), privacy of monitoring data, and consistent naming
     35    * GEC15:
     36       * Rack monitoring is operational (threshold: collecting time series and relational data; objective: collecting event data)
     37       * Have been collecting rack data since GEC 14
    2338
     39== Links ==
    2440
    25 Links:
    26  * [wiki:MonitoringCurrentStatus Summary of data being collected by GMOC]
    27  * [http://monitor.gpolab.bbn.com/tango/ Meso-scale GENI monitoring graphs]
     41Related Efforts:
     42 * [http://gmoc.grnoc.iu.edu/gmoc/index.html GMOC (webpage includes tickets, maps, etc)] - GENI Meta-operations Center
     43 * Service Desk
     44 * [http://groups.geni.net/geni/wiki/PlasticSlices "Plastic slices" monitoring experiment]
     45
     46Related Policies:
     47 * [wiki:GpoDoc#EmergencyStopProcedure Emergency Stop Plan]
     48 * [wiki:GpoDoc#GENIAggregateProvidersAgreement Aggregate Provider Agreement] 
     49 * [wiki:GpoDoc#LegalLawEnforcementandRegulatoryPlanLLR Legal, Law Enforcement and Regulatory Plan (LLR)]
     50
     51Monitoring data:
    2852 * [http://gmoc-db.grnoc.iu.edu/api-demo Raw data on SNAPP]
    2953 * [http://gmoc-db.grnoc.iu.edu/web-services/gen_api.pl Last 10 minutes of data in XML format (at GMOC)]
    3054
     55Graphs and visualizations of monitoring data:
     56 * [http://monitor.gpolab.bbn.com/tango/ Meso-scale GENI monitoring graphs]
     57 * [https://gmoc-db.grnoc.iu.edu/protected/ Experimenter web interface to GMOC data] (under development)
     58 * [http://gmoc-db.grnoc.iu.edu/measurement/portal.cgi Operator web interface to GMOC data]
    3159
    32 == GEC11 Summary ==
    33  The first half of the monitoring mini-workshop included presentations on the current status of what tools are used and what data is collected as part of monitoring today.  The second half of the workshop was a discussion about issues of interest to the community including: what !OpenFlow stats are available and the difficulty of current !OpenFlow Opt-In, the importance of privacy and incorporating privacy into the design now, and concerns about how campuses could share data without necessarily giving others admin access to their resources.  This community will be working to address these issues in the coming months.
    3460
    35  Detailed notes are on the [wiki:GEC11MonitoringMiniWorkshop GEC11 wiki].
     61== Monitoring Requirements ==
     62
     63[http://groups.geni.net/geni/wiki/GENI-Infrastructure-Portal/MonitoringReqts Monitoring requirements] were discussed at [wiki:#GEC12Summary GEC12]. 
     64
     65GENI monitoring and management requirements include ...
     66
     67 A. GENI Monitoring Requirements
     68    1. Information must be shareable
     69    2. Information must be collected
     70    3. Information must be available when needed
     71    4. Cross-GENI operational statistics must be collected and synthesized to indicate GENI as a whole is working 
     72    5. Preserve privacy of users (opt-in, experimenters, other users
     73    of resources)
     74 B. GENI Management Requirements
     75   1. For both debugging and security problems:
     76       1. Must be possible to escalate events
     77       2. Meta-operations and aggregate operators must work together to resolve problems in a timely manner
     78   2. Must be possible to do an emergency stop in case of a problem
     79   3. Organizations must manage GENI resources consistent with local policy and best practices
     80       * e.g security procedures, logging, backups, etc
     81   4. Develop policies for monitoring
     82   5. All parties should implement agreed upon policies
     83   6. Secure GENI as a whole and secure the pieces of GENI
     84
     85== Monitoring Architecture Overview ==
     86
     87The monitoring architecture was discussed at [wiki:#GEC13Summary GEC13].
     88
     89The monitoring architecture includes the major actors in the monitoring architecture,  the interfaces between them, use cases, and the data to be collected. (See [attachment:"GEC13_Monitoring_FINAL.pdf" slides] 4-22). 
     90
     91The major actors are:
     92 * ''Meta-operations'', which collects and makes available operational data and also generates cross-GENI monitoring data
     93 * The future ''GENI Clearinghouse'', which is the authoritative repository for project, user, and slice information
     94 * ''Aggregates'', which contain resources
     95 * ''Campuses'', which host resources
     96 * ''Experimenters'', which under limited circumstances might provide some monitoring information to meta-operations
     97 * ''Regional Networks'' and ''Backbone Networks'' (I2, NLR)
     98
     99The major monitoring interfaces allow:
     100 * Aggregates/Campuses/Experimenters/Meta-operations to submit data to Meta-operations
     101 * Anyone to query data from Meta-operations
     102 * Out-of-band communication to experimenters, aggregates and campuses
     103
     104In addition, GENI monitoring relies on other interfaces which allow:
     105 * Access to definitive data about slices, projects, and users from the future GENI Clearinghouse (via interfaces defined by the GENI Architecture group)
     106 * Resolution of problems via out-of-band access between Campuses and their Regional Networks
     107 * Resolution of problems via out-of-band access between GMOC and GENI backbone networks (I2, NLR)
     108
     109The structure of the monitoring data falls in one of three categories:
     110 * ''Time series'' data is a particular piece of data measured over time.
     111 * ''Relational'' data conveys entities and the relationship between them as observed at time T.
     112 * ''Event'' data denotes something happening or being noticed at a particular moment in time.
     113
     114The precision of the data can be described as either:
     115 * ''Definitive'' data is known to be definitive because it was generated by the source that created it. The future GENI clearinghouse will contain a lot of definitive data.
     116 * ''Sampled'' data is collected by measuring the state of the world periodically. The Meta-operations database contains a lot of sampled data.
     117
     118Monitoring use cases include:
     119 * An operator responds to a request from an experimenter who is unable to create a sliver at an aggregate.  The operator needs to see the current and recent status of that aggregate (eg is the AM responding to queries?)
     120 * Assess the availability/utilization of GENI resources over a four month period.
     121 * A core !OpenFlow switch misbehaves leading to intermittent failures. Various operators need information about whether L2 network resources are usable and the utilization of relevant network resources in order to debug the failure.
     122 * Assess the historic usage of network resources in order to plan for future expansion of those resources.
     123 * An experimenter wants to access basic information about their sliver to assess the general health and activity of their experiment.
     124 * An aggregate operator receives a complaint from campus IT about unusual traffic on the campus network.  The aggregate operator needs to looks at the current state of resources at their campus in order to demonstrate the problem is not caused by his aggregate.
     125
     126The data required to support the above use cases is also required for the architecture.
     127
     128In addition several other topics such as [wiki:GpoDoc#EmergencyStopProcedure emergency stop], various policies (such as [wiki:GpoDoc#GENIAggregateProvidersAgreement Aggregate provider agreement] and [wiki:GpoDoc#LegalLawEnforcementandRegulatoryPlanLLR LLR], and privacy of monitoring data are also important to monitoring data in GENI.
     129
     130== GEC Session Summaries ==
     131=== GEC13 Summary ===
     132At GEC 13 we had a productive discussion during the Monitoring session.  The notes and slides are posted on the [wiki:GEC13Agenda/Monitoring session wiki page].
     133
     134We discussed four major items:
     135 1. '''A proposed monitoring architecture which built on the monitoring requirements discussed at GEC12.'''  Major components include the interfaces, use cases, and data to be collected.  We reviewed these items and how they relate to the rest of GENI. No major objections to these items were raised.
     136
     137 2. '''A proposal for privacy of GENI monitoring data.'''  We proposed that some items, such as slice name, should always be public; some items should only be shared with GENI operators; and some items should not be shared at all. During the discussion of this session, it was proposed that we should create an operator's "code of ethics" and an experimenter's "privacy recommendations" documents to inform operators and experimenter of information they should know in order to keep their information private.
     138
     139 3. '''The status of the GMOC monitoring interfaces and web portals.'''  In addition to improvements to the API for submitting time series monitoring data implemented during Spiral 3, work continues on an API for relational data as well as a web portal to make operational and health data easier for members of the GENI community to locate and visualize. 
     140
     141 4. '''We reviewed the ongoing list of "monitoring pains".'''  In addition to the items discussed above, since GEC12 progress has been made on: the packaging of monitoring software; and, the release and installation of the latest !OpenFlow Aggregate Manager (FOAM) which facilitates improved monitoring.
     142
     143
     144=== GEC12 Summary ===
     145
     146The session included a discussion of long-term GENI monitoring requirements and provided updates on several near-term topics including:
     147* how the new !OpenFlow aggregate manager, FOAM, makes !OpenFlow management and monitoring easier;
     148* a demonstration of the GMOC DB's SNAPP user interface and new Portal;
     149* and a discussion of the importance of monitoring topology. 
     150
     151There is a summary and a link to detailed notes on the [wiki:GEC12MesoScaleMonitoring session page].
     152
     153=== GEC11 Summary ===
     154
     155The first half of the monitoring mini-workshop included presentations on the current status of what tools are used and what data is collected as part of monitoring today.  The second half of the workshop was a discussion about issues of interest to the community including: what !OpenFlow stats are available and the difficulty of current !OpenFlow Opt-In, the importance of privacy and incorporating privacy into the design now, and concerns about how campuses could share data without necessarily giving others admin access to their resources.  This community will be working to address these issues in the coming months.
     156
     157Detailed notes are on the [wiki:GEC11MonitoringMiniWorkshop GEC11 wiki].