wiki:GENI-Infrastructure-Portal/MonitoringReqts

Version 3 (modified by sedwards@bbn.com, 8 years ago) (diff)

--

Monitoring Requirements

This page includes the monitoring requirements presented at GEC12.

The purpose of this page is to allow the addition and clarification of these requirements over time.

Introduction

First, some introduction to explain terms and provide a common understanding of what monitoring and management in GENI is.

  • Why are we here?
    • Would like to ensure:
      • We don’t ignore any important pieces
      • Architecture decisions reflect needs of monitoring so that GENI Clearinghouse, Instrumentation & Measurement, etc serve our needs
      • Where possible, we build tools which can be adapted to new software when it becomes available
  • Therefore we need to answer the following questions:
    • Are there any holes in our understanding of our requirements?
    • Do we have agreement on what we need in a way that can be communicated to other groups working on topics of interest to ops?
    • Do we know how to build tools in an adaptable way?
  • And also…
    • What do we work on next?
  • GENI Requires Monitoring

The following are some relevant requirements from the GENI System Requirements Document (July 7, 2009)

  • 10.2-3 Visible operational status
    • The GENI system shall make sufficient data available that researchers and maintainers will be able to evaluate the availability and operational status of the system.
  • 10.2-5 Federated event escalation
    • The GENI system shall provide operations and management support for event management and escalation, including security events, within GENI and with those organizations that interconnect with GENI.
  • 10.2-6 Federated operations data exchange
    • The GENI system shall support operational and management data exchange according to [TBS] GENI O&M Policy between GENI and operators/owners of federated components, aggregates, and networks.
  • Monitoring Players in GENI
    • Meta-operations (a.k.a. GMOC)
    • Aggregates
      • Examples: ProtoGENI, PlanetLab, Orca
    • Campuses
      • Which host & operate aggregates
      • Which only host aggregates
    • Backbone & Regional Networks
      • Networks which are GENI participants: I2, NLR
      • Networks which carry GENI traffic: some regionals (eg NoX)
    • Experimenters
  • Aggregates, Racks and the GENI Clearinghouse
    • Each GENI rack is a single aggregate
      • Therefore requirements are the same as for aggregates
    • Aggregates can outsource (some of) their responsibilities to the GENI Clearinghouse
  • Definition
    • Monitoring
      • Act of collecting data and measuring what is happening
    • Management
      • Act of fixing problems and responding to requests
  • What does monitoring & management involve?
    • Observe unexpected events
      • THEN fix what’s wrong
    • Observe expected events
      • THEN develop policy for fixing what’s wrong
      • THEN fix what’s wrong (by responding to monitoring)
    • Plan for the future
      • Monitor long-term trends in resource usage
      • THEN provision resources to meet forecasted needs
  • What makes GENI different?
    • Federated entities managed by different institutions with different policies
    • People and information needed to troubleshoot and resolve problems are spread across several physical locations
    • Users (end user and experimenters), managers and hosts of a given piece of equipment may all be different.
    • Interactions between groups are governed by GENI federation agreements (e.g. aggregate provider agreement) and mutual understanding.
  • The requirements below do not cover:
    • Monitoring and management which fits entirely within the purview of aggregates, campuses, etc
    • For example, we will do (but not discuss in these requirements) the following items
      • Keeping logs
      • Obeying local laws and policy
      • Answering the phone when someone has a concern
      • … and everything else.
    • The above items are not included in the requirements below because they do NOT make GENI different

GENI Monitoring and Management Requirements Summary

These are the summary GENI monitoring and management requirements...

  1. GENI Monitoring Requirements
    1. Information must be shareable
    2. Information must be collected
    3. Information must be available when needed
    4. Cross-GENI operational statistics must be collected and synthesized to indicate GENI as a whole is working
    5. Preserve privacy of users (opt-in, experimenters, other users of resources)
  1. GENI Management Requirements
    1. For both debugging and security problems:
      1. Must be possible to escalate events
      2. Meta-operations and aggregate operators must work together to resolve problems in a timely manner
    2. Must be possible to do an emergency stop in case of a problem
    3. Organizations must manage GENI resources consistent with local policy and best practices
      • e.g security procedures, logging, backups, etc
    4. Develop policies for monitoring
    5. All parties should implement agreed upon policies
    6. Secure GENI as a whole and secure the pieces of GENI

Monitoring & Management Requirements Details

Each of the above GENI monitoring and management requirements, is broken down in more detail below.

The current implementation of the requirement is listed after the -->. The color means Color(green,implemented according to the plan)?, Color(orange,partially implemented according to some plan or there exists an unimplemented plan)?, or Color(red,not implemented and no plan)?.

  • A.4 Requirement: Cross-GENI Monitoring
    • GENI monitoring is more than the sum of the monitoring at GENI’s parts. In order to know if GENI is working properly, additional monitoring is required beyond that done by each of its constituent pieces.
    • Must collect and synthesize additional operational statistics which indicate whether GENI is working
      • e.g. meso-scale ping tests, topology
    • Must collect cross-GENI stats
    • Must make cross-GENI stats available when needed
  • B.1 Requirement: Troubleshooting & Event Escalation
  • Requirement: Policy
    • B.3 Organizations must manage GENI resources consistent with local policy and best practices (e.g security procedures, logging, backups, etc)
    • In general, follow local policy and procedures
      • Follow best practices which if not followed would affect other members of the GENI community
      • B.4 Develop policies for monitoring
    • B.5 All parties should implement agreed upon policies
      • Follow Aggregate Provider Agreement
      • Follow LLR
      • Follow other GENI policies as they come into effect
  • B.6 Requirement: Security
    • Secure GENI as a whole and secure the pieces of GENI
    • Two things we want to prevent:
      • Compromise of GENI resources
      • Use of GENI resources to compromise other entities
    • Two things we can do about this:
      • Follow best practices to hinder compromise
      • Detect and respond to compromise
    • Allow interesting research for which experimenters and operations have to coordinate for security and management reasons
      • Security experimentation BOF tonight and session at GEC13!!!
    • --> Color(red,TBD – This is an area needing major discussion)?
  • Requirement: Info must be shareable/collected/available
    • A.1 Information must be shareable
      • Consistent definitions of data
      • Consistent data exchange format
      • Consistent data collection mechanisms
      • Data sharing mechanisms
      • The following benefit from shared common processes:
        • Accessing data, finding data, visualizing data
    • A.2 Information must be collected
      • Verify continued successful data collection
      • Debug collection and reliability outages
    • A.3 Information must be available when needed
      • Privacy of data must be maintained

Data Requirements

  • Data Definitions
    • Consistent definition of data
      • Relational data -- data which explains the relationship between entities and resources
        • Resources (incl. connectivity)
        • List of aggregates
        • List of slices
        • List of users
        • Aggregate contact information
      • Timeseries data -- data collected repeatedly at a regular interval
        • Examples: Host and network statistics
      • Event data -- data with information about a unique event occuring at a single point in time
        • Examples: SNMP Traps
  • Other people who need data
    • Troubleshooting info from aggregates, campuses, meta-operations
    • Accountability report: How to prove if this is not my fault?