wiki:GENIRacksHome/ExogeniRacks/PreliminaryAcceptanceTestReport

Version 1 (modified by lnevers@bbn.com, 12 years ago) (diff)

--

ExoGENI Preliminary Acceptance Test Report

This page captures the Acceptance Test findings as of August 15, 2012 for the ExoGENI rack Project as described in the ExoGENI Acceptance Test Plan page. For individual test status see the ExoGENI Acceptance Test Status page.

Experimenter Test Case Status

This section provides a summary of major features that have been validated as well as the current list of outstanding issues for the Experimenter test cases.

Functionality verified

Significant progress has been made for the Experimenter test cases, the following features have been validated:

  • Support for at least one bare metal compute resource in each rack.
  • Support for VM and bare metal compute resources simultaneously in a single rack.
  • Support for OpenFlow intra-rack and inter-rack.
  • Support for External OF and non-OF VLAN connection through the rack.
  • Shared and Dynamic VLAN support (intra-rack and between racks).
  • Meso-scale Interoperability
  • Federation with GPO PG.
  • GENI V3 RSpec Support.

Outstanding issues

Various issues were found in running the experimenter test cases. Most have been addressed, and some remain and are categorized below:

High impact

These issues have a high impact on the ability to support experimenters in the GENI environment.

  • There are no sliver counts nor resource counts for the ExoGENI slices available to experimenters. (All EG-EXP)

Medium impact

These issues significantly affect functions requested by GENI experimenters, but are not needed for Spiral 4-scale deployment.

  • Microsoft Windows support on bare metal nodes is not available. Microsoft Windows support on a VM has been discussed, but is also not available. (EG-EXP-1)
  • There is no custom OS image support for Experimenters. (EG-EXP-2)
  • Bare metal nodes only support one Operating System version at this time, CentOS release 6.3, which is provided by the ExoGENI Team. (EG-EXP-1)
  • Some provisioning problems may occur during the provisioning of the bare metal nodes, where the node is not fully configured. (EG-EXP-1)
  • Availability of documented procedure for the modification of compute resources allocation in a rack for the addition of bare metal nodes is not available and not planned at this time. (EG-EXP-1)
  • Unable to successfully request large numbers of VMs. Various issues have come up, with the most common failures occurring during the resource provisioning phase for both Compute Resources and VLAN. (EG-EXP-3)

Low impact

These issues should be resolved and documented, but racks can begin operations without them.

  • OS Images availability is not part of the advertisement RSpec. (All EG-EXP)
  • Default expectation that a client_id in the Manifest RSpec matches the the client_id in the Request RSpec is not met. AM API Acceptance test must be modified to expect unbound resources, with this change in place, there is compliance.
  • There is no monitoring support for bare metal nodes available for experimenter. This should be documented for both operational and I&M monitoring impact. All EG-EXP)
  • Image Playpen system planned but not yet available.

Administration and Monitoring Test Cases

This section summarizes status and outstanding issues for the ExoGENI rack administration and monitoring tests. In particular, this report covers the EG-ADM and EG-MON tests, as outlined in the acceptance test plan.

Functionality verified

  • LDAP credentials allow site admin access to head node, worker nodes, switches, nagios, and wiki (EG-ADM-1)
  • Nagios provides rack status information and alerting (EG-ADM-1)
  • Interim change/outage notification via mailing lists works (EG-ADM-1)
  • Site inventory of parts and connectivity using our standard site processes was possible (though laborious) (EG-ADM-1)
  • Rack control network remote access security seems reasonable provided the head node and SSG5 implement network separation well (EG-ADM-2)
  • Site admins can get OpenFlow state information from the switch, FlowVisor, and FOAM (EG-MON-2)
  • Site admins can inspect ORCA aggregate state and resources on the rack (EG-MON-2)

Outstanding issues

This section describes problems found in testing, for which a fix is still in progress.

High impact

These issues risk security and stability problems for host sites.

  • The VLAN-based separation of control networks, enforced by bbn-hn and bbn-ssg, still need to be fully documented, implemented, and verified. This is additionally concerning because of the complexity of the control network's topology. (EG-ADM-2)
  • Both public and private documentation has been reviewed and overall some documentation was found, but most is still not available or incomplete. For details, see the EG-ADM-7 status page.

Moderate impact

These issues cause substantial annoyance for site administrators and make day-to-day troubleshooting more difficult.

  • DNS is not fully defined for all public and private IPs. Private DNS in particular has been brittle: it breaks frequently, and includes only a fraction of addresses in use. (EG-ADM-1)
  • ExoGENI offered SVN/rancid switch configuration polling as an alternative to privileged site admin access to switches. This functionality would be acceptable if it worked, but has been partially broken for some time. (EG-ADM-1)
  • Site admin access policies (e.g. what level of access site admins should expect to a particular service or host) are not well-defined, and bitrot of site admin access is common (i.e. when we need to access something which worked at one time, we often find that it no longer works). (EG-ADM-1)

Infrequent impact

These issues should be resolved so that site administrators can get information they need, especially since information may be needed in a hurry to investigate a security or stability problem. However, asking RENCI is an acceptable short-term alternative to self-service capability, provided RENCI is responsive.

  • We have not yet been able to locate sources for all software running on bbn-hn. More iteration is needed on both sides on this. (EG-MON-1)

Test Case Status

Test cases in progress

As of 2012-08-14, the following are in progress:

  • Verification that VLANs and MAC addresses on the control switch are as expected (EG-MON-1)
  • Verification that site admins can get information about active and recently-terminated experiments on the rack (EG-MON-3)
  • Verification of a multi-site, multi-experiment, multi-experimenter, multi-subnet, multi-controller, mesoscale-interoperability OpenFlow scenario (EG-EXP-6)

Tests which have not been run

Tests not run due to stability concerns: moderate impact

GPO did not run some tests because of concerns that they would put the rack into a bad state and be time-consuming to debug. These are tests of the rack's response to likely outage and failure modes, and we would like to verify that the rack is stable enough to handle these conditions:

  • Rack reboot test (EG-ADM-3)
  • Control network disconnection test (EG-ADM-6)

Tests not run due to higher-priority testing: low impact

GPO did not run some tests because the benefit during this spiral is low, or because prerequisites are not met. We will revisit this functionality in Spiral 5.

  • Software update test (EG-ADM-5)
  • Infrastructure device performance test (EG-MON-4)

Tests to be run before GEC15

These tests had not yet been run because prerequisites were not met, but they have been discussed, and will be run in the GEC14-GEC15 period.

  • Emergency Stop test (EG-ADM-4)
  • GMOC data collection test (EG-MON-5)

Test Case Descriptions and Status

The full descriptions, and status where available, of the tests covered by this report, are linked here for convenience: