Changes between Initial Version and Version 1 of GEC14Agenda/IMMonitoring/DetailedNotes


Ignore:
Timestamp:
08/06/12 10:48:39 (12 years ago)
Author:
sedwards@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GEC14Agenda/IMMonitoring/DetailedNotes

    v1 v1  
     1= GEC 14 I&M and Monitoring Session Detailed Notes =
     2=== Part I: The state of the world ===
     3
     4'''Sarah Edwards''' introduced the purpose of the session.  This is the first combined I&M and monitoring session.  We are here because there are many tools using various techniques to collect various types of data.  We believe folks in this room will benefit from and will want to share tools and data.
     5
     6The session is broken into two parts.  In the first part, a series of speakers will discuss some monitoring and I&M tools and data.  In the second part, Jeanne Ohren will lead a discussion about common issues.
     7
     8'''Kevin Bohan''' of GMOC demonstrated the new monitoring user interface. 
     9
     10GENI Meta-operations Center (GMOC) supports a cross-cutting meta-operations framework. 
     11
     12The user interface has two goals. 
     13
     14  1. For experimenters, provide confidence that their resources will work as they expect. 
     15  2. For operations, provide information about their infrastructure.
     16
     17GMOC collects data into a database and makes that information available via a web interface.  There are two APIs for submitting data to the database: a relational API and a time series API.  We are getting relational data from slice authorities, aggregates, and resources as well as getting time series data about resources.
     18
     19Many entities are reporting data to GMOC (complete list in the slides) including:
     20 * the GPO slice authority (SA)
     21 * ExoGENI and InstaGENI racks
     22 * Aggregates
     23 * some Health Checks
     24
     25The following information is provided by these items:
     26 * All entities report ''name'', ''type'', ''physical location'', and ''operating organization''.
     27 * Slices and slivers provide ''creator'', ''creation/expiration time''.
     28 * SAs provide a list of ''slices'' (identified by URN+UUID).
     29 * Aggregates include ''slivers'' (including ''state'', corresponding ''slice'', and containing ''resources'')
     30 * Resources include ''interfaces'' as well as time series data such as ''CPU & Disk Utilization'', ''Number of active VMs'', ''Interface traffic counters'', and ''OpenFlow control stats''
     31 * Health Checks tells us if ''AM is responding''.
     32
     33The demonstration will highlight two use cases:
     34 * For experimenters, what's happening on my slice?
     35 * For operations, what's happening at my location?
     36
     37Kevin then did a live demonstration of the [https://gmoc-db.grnoc.iu.edu/protected/ GMOC DB user interface].
     38
     39First, from the perspective of an experimenter who is interested in their slice:
     40 1. Login to GMOC DB.
     41 1. Slice page shows current state as of a few minutes ago including list of slices by URN, contact, last update, # slivers, # resources.
     42 1. Search for a particular slice (in this case `tuptymon`) in the search box at the top of the page.
     43 1. Click on detail link for that slice.
     44 1. Slivers tab shows slivers in that slice: aggregate, urn, expiration, status, last update
     45 1. Resources tabs shows: some OpenFlow datapaths, and a VM. 
     46 1. Click on a resource.
     47 1. Resource measurements tab gives us a set of metrics.  Currently collecting 7 metrics and here show 3 of those.  See VM stats and click on VM count. 
     48 1. Change resolution to 1 week gives a sense of the variation in this stat over time.
     49
     50Second, from the perspective of operations, who is interested in the status of the resources they are running...
     51 1. Go to "aggregates".
     52 1. See list of all aggregates.
     53 1. Search for `ExoGENI` aggregates using the search box at the top of the page.
     54 1. Get three aggregates: 2 FOAMs and 1 Orca aggregate
     55 1. Click on rc-bn.exogeni.net.  Look at tabs: Slivers, Resources, and Aggregate Measurements.
     56 1. This is a different list of measurements than we saw above.
     57 1. Click on OpenFlow statistics and get OpenFlow control statistics.
     58
     59If you are interested in using the "protected" interface, please contact the [mailto:gmoc@grnoc.iu.edu GMOC service desk] or the [mailto:help@geni.net GENI Help Desk].
     60
     61If you are interested in monitoring, [http://lists.geni.net/mailman/listinfo/monitoring join the monitoring@geni.net] mailing list
     62
     63
     64
     65'''Anirban Mandal''' described GENI client authentication and authorization for an XMPP messaging service.
     66
     67Overall vision is to have an XMPP server which acts as a conduit for messages flowing through measurement points.  Added features for authenticating clients. 
     68
     69There are different types of clients:
     70 1. client inside a slice
     71 2. client outside a slice (like a control framework entity)
     72 3. pubsub server subscribers
     73
     74''Authentication'': "Can a client authenticate with the XMPP server using authentication mechanisms advertised by the XMPP server using GENI certificates?"
     75
     76Did this by:
     77 * Added code to gcf code.
     78 * Added SASL external authentication on XMPP server.  This is mostly one-time configuration of XMPP server (clearinghouse certificate needs to be inserted in server's client truststore.)
     79
     80Think of this as a jabber server and various measurement points act as clients.
     81
     82This part isn't pub sub.  This is just sending messages authenticated through a GENI structure.
     83
     84''Authorization'': “Does an already authenticated client have credentials (rights) to publish and subscribe to a pubsub node ? ”
     85
     86If you want to subscribe to pubsub server, need the appropriate credentials.
     87 * How get credentials?
     88 * How verify creds on the XMPP server?
     89
     90Did this by:
     91 * Added `xmppcred` to `gcf` tool.
     92 * Takes certificate of client, certificate of clearinghouse, XMPP server certificate key-pair, and rights namespace (which is a set of pubsub node namespaces which say which part of the pubsub space that client has rights to).
     93 * Extended XMPP server code to enable credential verification.
     94 * Openfire pubsub policy code is augmented with GENI credential verification.
     95 * Verify credentials. If it works, then pubsub action is approved.
     96
     97Two example use cases that are using this...
     98
     99First example, Orca service manager publishes slice manifest on XMPP server.
     100
     101 * Publishes once when created, when nodes are created, it publishes an update.
     102 * A manifest subscriber client subscribes to slices of interest.  Then get notified when manifest changes over time.
     103
     104Second example, OMF components are communicating via an XMPP messaging service.
     105
     106
     107'''Martin Swany''' discussed active network monitoring with GEMINI.
     108
     109GEMINI is an I&M system based on:
     110 * perfSONAR/LAMP which is a modification to general perfSONAR.  Modified to understand GENI creds, etc.
     111 * INSTOOLS
     112 * Periscope (update to perfSONAR to make them more modern web services using JSON and REST interfaces)
     113
     114GEMINI targets the complete I&M scenario.
     115
     116 * Active measurements (from perspective of GEMINI) aren't special.
     117 * Active measurements need to have accurate timestamps.  Tying in the measurements is more tricky.
     118 * All measurements may perturb a user's experiment.
     119 * All measurements may perturb other measurements.
     120 * Active measurement affect infrastructure, perhaps over multiple hops.
     121 * (Consistent with passive measurements, but certainly some issues that need to be addresses.)
     122
     123Active Measurement Tools
     124 * OWAMP
     125   - like ping but can handle one way delay (and parse apart two directions)
     126   - depends on clock synchronization
     127 * BWCTL
     128   - wraps bulk transfer tools to provide mutual exclusion for ongoing tests
     129 * ping
     130   - simple and ubiquitous
     131 * 802.1ag
     132   - defines an L2 ping (to be added soon)
     133 * traceroute
     134   - is in perfSONAR, but not GEMINI yet.
     135
     136GEMINI at GEC14
     137
     138To use these...
     139 * A user marks which nodes are active measurement points in request RSpecs.  Runs on users nodes (should they be distinct?).  Tells instrumentize process to install active measurement tools.
     140 * Need to select node before start is due to code limitations.
     141 * WebUI provides central configuration and administration for active measurments.
     142 * Nodes run a local service which updates config, etc using UNIS service.
     143
     144Performing Active Measurements
     145 * Two classes:
     146    * on demand -- for debugging an issue
     147    * regular testing - make sure things remain unchanged.  Or measure changes over time. 
     148 * perfSONAR includes metadata about measurements.
     149
     150Future Issues
     151
     152 * Work towards a single framework for measurements.  Make it easy to extend with new tools.  perfSONAR protocols were intended to be extensible, but code hasn't supported that over time.
     153 * Interaction between intra-slice and infrastructure measurements.
     154 * We'd like to get high rate, very frequent measurements.  So lots of measurement activity.  Lots in substrate, lots in slices. 
     155Dedicated nodes might provide better info, but users need to be able to request that.
     156 * Coordination and sharing of active measurements is something to discuss.
     157
     158'''Prasad Calyam''' described doing measurements on Layer 2 and !OpenFlow paths.
     159
     160Prasad provided an experimenter perspective on layer2 and !OpenFlow slices.
     161
     162 * Running I&M related experiments.
     163 * Running some !OpenFlow slices.
     164 * Allocating thin clients across a long path.
     165 * Experiment active slice since GEC13.
     166 * I&M slice since April.
     167
     168Use Case:
     169 * Run active measurements to check connectivity and performance as part of layer 2/OpenFlow slice monitoring.
     170 * Schedule experiment and active measurement traffic in a conflict-free manner, and use measurement intelligence for adaptation.
     171
     172Has two slices running for some months running various test scenarios (see slides for details).
     173
     174Brief Results
     175 * traceroute varies between IP and !OpenFlow
     176 * Showed impact of competing traffic in different slices on same network.
     177
     178Conclusion
     179 * I&M has a separate platform requirement compared to others.
     180 * L2 connectivity troubleshooting is required.
     181 * Impact of I&M can be seen on the experiment if measurement conflict occurs.
     182 * Sample !OpenFlow slice RSpecs from GPO are helpful.
     183
     184Next steps
     185 * orchestrate measurements
     186 * easier methods to integrate app metrics
     187 * provide OpenFlow slice traffic visibility -- would help experimenters a lot.
     188
     189=== Part II: Discussion ===
     190
     191'''Jeanne Ohren''' of the GPO introduced the discussion with some overarching questions for us to discuss.
     192
     193Some issues...
     194 1. Consistent naming of resources and devices
     195    - example - consistent names
     196      * two aggregates share a link.  Endpoint names need to be consistent.
     197    - example - globally unique
     198      * 3 ways of identifying same slice: URN, UUID, and slice name
     199      * consumer of this data might need to determine if two slivers belong to the same slice.
     200      * Growing consensus to identify slices by a combination of slice URN and slice UUID because the combination is unique over time and space.
     201      * GENI AM API v3 adopted this.  Monitoring and one I&M group adopting this as well.
     202    - Question: How does this affect other projects?  What other types of naming examples do we need to worry about?
     203
     204 2. Data transport example
     205    - Example:
     206      * Aggregate collects data about sliver and resources in slivers, etc and report to GMOC.
     207      * Experimenter interested in resources available at aggregates.
     208      * Operator is interested in statistics on the slivers that have been created/deleted over a period of time.
     209    - How do each of these parties access the data?
     210      * Aggregates push data to GMOC (using GMOC API)
     211      * The future GENI CH will provide an API to pull data on slices, users, and projects.
     212      * IMF and others provide a pub/sub interface
     213      * I&M provide the ability of users to push data to an archive with metadata
     214      * iRODS account holders can control who has access to this data.
     215    - Currently we're transporting data.
     216      * Consider: Access control.  How do we make sure the right people can access the data?  How do we keep the wrong people from accessing the data?
     217      * Reliability, how do we ensure data is recorded properly.
     218    - Question: Can we work together to get access to good reliable data.
     219
     220 3. Some data sources, quickly... (full list in slides)
     221    - Relational data collected by GMOC
     222    - Time-series data collected by GMOC
     223    - Active network measurement data collected by I&M tools
     224    - Passive host measurement data collected by I&M tools
     225    - Measurement Data Object Descriptor
     226    - Other independent monitoring tools
     227
     228= Discussion =
     229
     230Data naming
     231 * How have lack of globally unique and consistent naming affected other project?
     232 * What are some other data naming examples?
     233
     234Prasad: Experiment itself has a lot of data it is generating.  Lots of app specific measurements that are really critical. Processed measurement in addition to active measurements.
     235
     236Justin Campos: Seattle nodes don't have consistent IPs.  Use public keys to identify nodes.  Do looks up based on keys, but nodes will put in information that helps build a hierarchy of data so we can find things.
     237
     238Data transport
     239 * What are you using that others might find useful?
     240 * How can we all walk away from the table with access to good, reliable data?
     241
     242Sarah Edwards: Control framework folks have lots of experience with authentication and authorization.
     243
     244Chaos: Could GENI credentials be used to access data?
     245How much more could we be doing with GENI credentials we already have.
     246Are there APIs that would make it easier to plug in GENI credentials.
     247
     248Justin Cappos: (Stole from Amazon) API key interface that is easily regeneratable.  Don't want this to be a private key or a password for a user.  Could use a less privileged mechanism which allows you to regenerate keys.
     249
     250Sarah: prototype CH has some ideas with !InCommon
     251
     252Chaos: machines authenticating in a privileged way.
     253
     254Martin Swany: in GEMINI using proxy certificate to allow shorter lived sub-identies
     255
     256General point: Two authentication issues: user's log in (well understood); machines share data
     257
     258Machine to machine transaction may be a smaller set of users.
     259
     260What other issues have you encountered?
     261
     262Clock synchronization?
     263
     264Justin: Lots of issues especially related to crypto.  In places we need to do this.  We use NTP data. Lots of places have NTP blocked.  We run our own NTP nodes and tunnel out with public IP addresses.
     265
     266Chaos: Are you able to easily detect machines that are off?
     267
     268Justin: We control the nodes.  We refresh NTP every day.  We have 20 sec expiry time on crypto ops.