wiki:GENIRacksHome/OpenGENIRacks/AdministrationGuide

Version 1 (modified by Jeanne Ohren, 10 years ago) (diff)

--

GRAM Administration Guide

Introduction

This document describes procedures and provides context for administrators of GRAM-controlled racks. We provide details on:

  • Software maintenance: Upgrading GRAM software to new releases
  • Troubleshooting: Diagnosing and investigating unexpected GRAM behavior

Software Maintenance

Software installation should come in the form of a new gram ".deb" file provided directly by the GENI Program Office or via a public URL. Different .deb files are provided for control and compute nodes (e.g. gram_control.deb and gram_compute.deb).

These can be installed using the dpkg utility (e.g. dpkg -i <deb file>) or using gdebi (e.g. gdebi <deb file>.

Software Versions

The version of the gram install can be derived as follows:

  • dpkg --info gram_control.deb (or gram_compute.deb) will provide the version of that deb release

Troubleshooting

Connectivity Problems

  • Make sure you can ping on external network and control network
  • Trunking - make sure the ports on router and switch allow for VLAN-tagged traffic to flow
  • tcpdump and wireshark are very useful tools to see where traffic is being generated and where it is getting stuck. Be careful about interfaces. Data plane traffic is what the VM's speak (except for SSH traffic into them), and goes through the OpenFlow controlled switch. Control traffic for OpenStack occurs on the control plane interfaces, while SSH traffic into the VM's occurs on the management plane interfaces. These both flow through the non-OpenFlow controlled switch.
  • OpenFlow control only for traffic flowing throgh switch. Collocated VM's can't be OF controlled. We try to place VM's on different nodes but do not mae guarantees.
  • It often helps to reboot VM's (sudo nova reboot <instance_id>) to see the traffic going from VM back to the control node, while tracing in tcpdump or wireshark.
  • Make sure the controller is up and running, and connected to the switch
  • Check that the controller for the slice matches the controller running
  • The data plane interface may not be up (even if the quantum interfaces to that interface are up. If ifconfig eth1 shows the interface is not up (assuming eth1 is the data plane interface), then run sudo ip link set eth1 up.
    • echo 'dump' | nc localhost 7001
    • Check that the OVS configuration is appropriate on the Control and Compute nodes:
  • Verify OVS configuration on the Controller node:
    • There should be a qg port on br-ex for each external network
$ sudo ovs-vsctl show
107352c3-a0bb-4598-a3a3-776c5da0b62b
    Bridge "br-eth1"
        Port "phy-br-eth1"
            Interface "phy-br-eth1"
        Port "eth1"
            Interface "eth1"
        Port "br-eth1"
            Interface "br-eth1"
                type: internal
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port "eth2"
            Interface "eth2"
        Port "qg-9816149f-9c"
            Interface "qg-9816149f-9c"
                type: internal
    Bridge br-int
        Port "int-br-eth1"
            Interface "int-br-eth1"
        Port br-int
            Interface br-int
                type: internal
    ovs_version: "1.4.0+build0"
  • Verify OVS configuration on the Compute nodes:
    • Assuming no VMs on the compute node
      $ sudo ovs-vsctl show
      4ec3588c-5c8f-4d7f-8626-49909e0e4e02
          Bridge br-int
              Port br-int
                  Interface br-int
                      type: internal
              Port "int-br-eth1"
                  Interface "int-br-eth1"
          Bridge "br-eth1"
              Port "phy-br-eth1"
                  Interface "phy-br-eth1"
              Port "br-eth1"
                  Interface "br-eth1"
                      type: internal
              Port "eth1"
                  Interface "eth1"
          ovs_version: "1.4.0+build0"
      

Service Problems

  • Make sure your OpenStack environment is set. Any command you run by hand or in a script needs to have the variables established in /etc/novarc set. Do a 'source /etc/novarc' to be sure.
  • Verify all expected services registered with Nova
    • Expect to see nova-cert, nova-consoleauth, and nova-scheduler on the controller node and nova-compute on each compute node. All should have State = :-) (not XXX)
         $ sudo nova-manage service list
         Binary           Host                                 Zone             Status     State Updated_At
         nova-cert        pridevcontrol                        nova             enabled    :-)   2013-02-07 20:47:38
         nova-consoleauth pridevcontrol                        nova             enabled    :-)   2013-02-07 20:47:37
         nova-scheduler   pridevcontrol                        nova             enabled    :-)   2013-02-07 20:47:38
         nova-compute     pridevcompute1                       nova             enabled    :-)   2013-02-07 20:47:33
         nova-compute     pridevcompute2                       nova             enabled    :-)   2013-02-07 20:47:35
      

More generally, make sure that all OpenStack and GRAM services are up and running by running service <service_name> status": Control Node:

  • Nova (Control)
    • nova-api
    • nova-cert
    • nova-consoleauth
    • nova-novncproxy
    • nova-scheduler
  • Glance (Control)
    • glance-api
    • glance-registry
  • Keystone (Control)
    • keystone
  • Quantum (Control)
    • quantum-server
    • quantum-plugin-openvswitch-agent
    • quantum-dhcp-agent
    • quantum-l3-agent
    • openvswitch-switch
    • rabbitmq-server
  • MySQL (Control)
    • mysql
  • GRAM (Control)
    • gram-am
    • gram-amv2
    • gram-ch
    • gram-cni
    • gram-ctrl
    • gram-vmoc

ComputeNode:

  • Nova (Compute)
    • nova-api-metadata
    • nova-compute
  • Quantum (Compute)
    • quantum-plugin-openvswitch-agent
    • openvswitch-switch
  • KVM
    • qemu-kvm
    • libvirt-bin
  • On the control node, make sure that network servers are listening on the following ports (that is, do a netstat -na | grep <port> and see a line that says "LISTEN"):
  • 8000: GRAM Clearinghouse (Unless you are using a different clearinghouse).
  • 8001: GRAM Aggregate Manager. .
  • 8002: GRAM Aggregate Manager V2.
  • 9000: VMOC Default Controller
  • 7001: VMOC Management.
  • 6633: VMOC

Starting Services on Different Ports

For debugging it is often helpful to start a service on a different port in a command window.

  • VMOC.
    • Stop the VMOC Service. sudo service gram-vmoc stop
    • Start VMOC: /opt/pox/pox.py -- log.level --DEBUG vmoc.VMOC --management_port=7001 --default_controller_url=https://<default_controller_host>:9000
    • These URLs and ports can be changed on the command line as needed.
  • Default Controller
    • Stop the VMOC Default Controller: sudo service gram-ctrl stop
    • Start the Default Controller: /opt/pox/pox.py -- log.level --DEBUG openflow.of_01 --port=9000 vmoc.l2_simple_learning
    • This port # can be changed as needed (to match the VMOC configuration above)
  • GRAM Aggregate Manager
    • Stop the GRAM Aggregate Manager: sudo service gram-am stop
    • Start the GRAM Aggregate Manager: python /home/gram/gram/src/gram-am.py -V3 -p 8001
    • The port can be modified as needed but should match the [aggregate_manager] entry in ~gram/.gcf/gcf_config.
    • The GRAM Aggregate Manager V2 can be run (and port modified) by this command: python /home/gram/gram/src/gram-am.py -V2 -p 8002

KVM virtualization

  • Verify KVM is installed and able to use hardware virtualization:
    • NOTE: kvm-ok is part of the cpu-checker package
         $ kvm -version
         QEMU emulator version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard
         $ kvm-ok
         INFO: /dev/kvm exists
         KVM acceleration can be used
         $ sudo service libvirt-bin status
         libvirt-bin start/running, process 2537
      

Metadata service requirements

  • Nova should have set up a NAT rule for metadata services
    $ sudo iptables -t nat -L
    ...
    Chain quantum-l3-agent-PREROUTING (1 references)
    target     prot opt source               destination         
    DNAT       tcp  --  anywhere             169.254.169.254      tcp dpt:http to:10.10.8.71:8775
    ...
    
    

Resource Allocation Problems

  • Duplicate Slice names: One may not have two slices of the same URN which is composed of the project and slice name. If a given user is trying to create a slice and gets a 'duplicate slice name', they should change the name of the slice they are trying to create, or delete an old version. If the old version is not evident in OpenStack, restart the aggregate manager. Otherwise, see the 'Cleanup Procedures' below.
  • SSH proxy doesn't work.
    • Make sure the NAT rule is in place on the control_node: sudo iptables -L -t nat
    • Make sure gram_ssh_proxy is installed on control_node in /usr/local/bin with privileges:-rwsr-xr-x
  • Out of resources: The rack has a limited set of CPU and Memory resources and thus can only allocated a given number of VM's of particular flavors. If this problem occurs, the rack may be saturated. It may be that all slices are in use, or it may be that there are many old resources that can be harvested and reused. Look at the 'Cleanup Procedures' below.
  • Isolation: Gram provides no guarantees on network or CPU (all is shared, based on how much isolation is provided by KVM, Quantum, OVS layers)
  • "VM Build Error".
    • The logs for the VM's that Nova/KVM tries to build are in /var/lib/nova/instances/<instance_name>/console.log. You can look at these and see what errors occurred in trying to boot the VM. If the log is empty, look in the nova-compute logs in /var/log/upstart.
    • To tell what instance a VM is and where it is running, do a 'nova list --all-tenants' to find the instance id, and then do a nova-show <instance_id> to find the compute node and instance name:
gram@boscontroller:/usr/local/bin$ nova list --all-tenants
+--------------------------------------+------+--------+---------------------------------------------------------------------+
| ID                                   | Name | Status | Networks                                                            |
+--------------------------------------+------+--------+---------------------------------------------------------------------+
| '''01782225-00b1-4ab8-bba3-c7452833b8c2''' | VM-1 | ACTIVE | cntrlNet-marilac:SPOON+slice+SPORK=10.10.108.100; lan0=10.0.109.100 |
| 1aa8ba40-63a2-4a58-b533-faf18c674b77 | VM-2 | ACTIVE | cntrlNet-marilac:SPOON+slice+SPORK=10.10.108.101; lan0=10.0.109.101 |
+--------------------------------------+------+--------+---------------------------------------------------------------------+
gram@boscontroller:/usr/local/bin$ nova show 01782225-00b1-4ab8-bba3-c7452833b8c2
+--------------------------------------------+----------------------------------------------------------+
| Property                                   | Value                                                    |
+--------------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                          | MANUAL                                                   |
| OS-EXT-SRV-ATTR:host                       | boscompute4                                              | 
| OS-EXT-SRV-ATTR:hypervisor_hostname        | boscompute4         # This is the VM host                                     | 
| OS-EXT-SRV-ATTR:instance_name              | instance-0000000f     # This is the VM name                                   | 
| OS-EXT-STS:power_state                     | 1                                                        |
| OS-EXT-STS:task_state                      | None                                                     |
| OS-EXT-STS:vm_state                        | active                                                   |
| accessIPv4                                 |                                                          |
| accessIPv6                                 |                                                          |
| cntrlNet-marilac:SPOON+slice+SPORK network | 10.10.108.100                                            |
| config_drive                               |                                                          |
| created                                    | 2013-04-19T13:29:56Z                                     |
| flavor                                     | m1.small (2)                                             |
| hostId                                     | b590c0756658c24a3aea56372b5c71d2649f16fabad174ee796f40d0 |
| id                                         | 01782225-00b1-4ab8-bba3-c7452833b8c2                     |
| image                                      | ubuntu-12.04 (93779c42-a5d7-4144-ac78-4a597c74a92a)      |
| key_name                                   | None                                                     |
| lan0 network                               | 10.0.109.100                                             |
| metadata                                   | {}                                                       |
| name                                       | VM-1                                                     |
| progress                                   | 0                                                        |
| security_groups                            | [{u'name': u'marilac:SPOON+slice+SPORK_secgrp'}]         |
| status                                     | ACTIVE                                                   |
| tenant_id                                  | 88e8b222da0349528e5864ba60220cfa                         |
| updated                                    | 2013-04-19T13:30:11Z                                     |
| user_id                                    | 77c695dfaf2640a38db0352f4a771828                         |
+--------------------------------------------+----------------------------------------------------------+
gram@boscontroller:/usr/local/bin$ ssh boscompute4
gram@boscompute4:~$ ls -l /var/lib/nova/instances/instance-0000000f/console.log
-rw-rw---- 1 libvirt-qemu kvm 0 Apr 19 09:30 /var/lib/nova/instances/instance-0000000f/console.log

Cleanup Procedures

  • Cleanup.py. This is a script that cleans up all OpenStack resources associated with a given slice (by URN or tenant ID).
  • Manual Cleanup. If there is a slice or resource that needs to be deleted from OpenStack, here's how:
    • keystone user-list
    • keystone user-delete <user_id from above>
    • keystone tenant-list
    • keystone tenant-delete <tenant_id from above>
    • nova list --all-tenants
    • nova delete <instance_id from above>
    • quantum net-list
    • quantum net-delete <net_id from above> NOTE: Be careful not to delete the public network
    • Then restart the gram servers:
           sudo service gram-am restart
           sudo service gram-amv2 restart