Version 4 (modified by 10 years ago) (diff) | ,
---|
OpenGENI GRAM Administration Guide
Introduction
This document describes procedures and provides context for administrators of GRAM-controlled racks. We provide details on:
- Software maintenance: Upgrading GRAM software to new releases
- Troubleshooting: Diagnosing and investigating unexpected GRAM behavior
Software Maintenance
Software installation should come in the form of a new gram ".deb" file provided directly by the GENI Program Office or via a public URL. Different .deb files are provided for control and compute nodes (e.g. gram_control.deb and gram_compute.deb).
These can be installed using the dpkg utility (e.g. dpkg -i <deb file>) or using gdebi (e.g. gdebi <deb file>.
Software Versions
The version of the gram install can be derived as follows:
- dpkg --info gram_control.deb (or gram_compute.deb) will provide the version of that deb release
Troubleshooting
Connectivity Problems
- Make sure you can ping on external network and control network
- Trunking - make sure the ports on router and switch allow for VLAN-tagged traffic to flow
- tcpdump and wireshark are very useful tools to see where traffic is being generated and where it is getting stuck. Be careful about interfaces. Data plane traffic is what the VM's speak (except for SSH traffic into them), and goes through the OpenFlow controlled switch. Control traffic for OpenStack occurs on the control plane interfaces, while SSH traffic into the VM's occurs on the management plane interfaces. These both flow through the non-OpenFlow controlled switch.
- OpenFlow control only for traffic flowing throgh switch. Collocated VM's can't be OF controlled. We try to place VM's on different nodes but do not mae guarantees.
- It often helps to reboot VM's (sudo nova reboot <instance_id>) to see the traffic going from VM back to the control node, while tracing in tcpdump or wireshark.
- Make sure the controller is up and running, and connected to the switch
- Check that the controller for the slice matches the controller running
- The data plane interface may not be up (even if the quantum interfaces to that interface are up. If ifconfig eth1 shows the interface is not up (assuming eth1 is the data plane interface), then run sudo ip link set eth1 up.
- echo 'dump' | nc localhost 7001
- Check that the OVS configuration is appropriate on the Control and Compute nodes:
- Verify OVS configuration on the Controller node:
- There should be a qg port on br-ex for each external network
$ sudo ovs-vsctl show 107352c3-a0bb-4598-a3a3-776c5da0b62b Bridge "br-eth1" Port "phy-br-eth1" Interface "phy-br-eth1" Port "eth1" Interface "eth1" Port "br-eth1" Interface "br-eth1" type: internal Bridge br-ex Port br-ex Interface br-ex type: internal Port "eth2" Interface "eth2" Port "qg-9816149f-9c" Interface "qg-9816149f-9c" type: internal Bridge br-int Port "int-br-eth1" Interface "int-br-eth1" Port br-int Interface br-int type: internal ovs_version: "1.4.0+build0"
- Verify OVS configuration on the Compute nodes:
- Assuming no VMs on the compute node
$ sudo ovs-vsctl show 4ec3588c-5c8f-4d7f-8626-49909e0e4e02 Bridge br-int Port br-int Interface br-int type: internal Port "int-br-eth1" Interface "int-br-eth1" Bridge "br-eth1" Port "phy-br-eth1" Interface "phy-br-eth1" Port "br-eth1" Interface "br-eth1" type: internal Port "eth1" Interface "eth1" ovs_version: "1.4.0+build0"
- Assuming no VMs on the compute node
Service Problems
- Make sure your OpenStack environment is set. Any command you run by hand or in a script needs to have the variables established in /etc/novarc set. Do a 'source /etc/novarc' to be sure.
- Verify all expected services registered with Nova
- Expect to see nova-cert, nova-consoleauth, and nova-scheduler on the controller node and nova-compute on each compute node. All should have State = :-) (not XXX)
$ sudo nova-manage service list Binary Host Zone Status State Updated_At nova-cert pridevcontrol nova enabled :-) 2013-02-07 20:47:38 nova-consoleauth pridevcontrol nova enabled :-) 2013-02-07 20:47:37 nova-scheduler pridevcontrol nova enabled :-) 2013-02-07 20:47:38 nova-compute pridevcompute1 nova enabled :-) 2013-02-07 20:47:33 nova-compute pridevcompute2 nova enabled :-) 2013-02-07 20:47:35
- Expect to see nova-cert, nova-consoleauth, and nova-scheduler on the controller node and nova-compute on each compute node. All should have State = :-) (not XXX)
More generally, make sure that all OpenStack and GRAM services are up and running by running service <service_name> status": Control Node:
- Nova (Control)
- nova-api
- nova-cert
- nova-consoleauth
- nova-novncproxy
- nova-scheduler
- Glance (Control)
- glance-api
- glance-registry
- Keystone (Control)
- keystone
- Quantum (Control)
- quantum-server
- quantum-plugin-openvswitch-agent
- quantum-dhcp-agent
- quantum-l3-agent
- openvswitch-switch
- rabbitmq-server
- MySQL (Control)
- mysql
- GRAM (Control)
- gram-am
- gram-amv2
- gram-ch
- gram-cni
- gram-ctrl
- gram-vmoc
ComputeNode:
- Nova (Compute)
- nova-api-metadata
- nova-compute
- Quantum (Compute)
- quantum-plugin-openvswitch-agent
- openvswitch-switch
- KVM
- qemu-kvm
- libvirt-bin
- On the control node, make sure that network servers are listening on the following ports (that is, do a netstat -na | grep <port> and see a line that says "LISTEN"):
- 8000: GRAM Clearinghouse (Unless you are using a different clearinghouse).
- 8001: GRAM Aggregate Manager. .
- 8002: GRAM Aggregate Manager V2.
- 9000: VMOC Default Controller
- 7001: VMOC Management.
- 6633: VMOC
Starting Services on Different Ports
For debugging it is often helpful to start a service on a different port in a command window.
- VMOC.
- Stop the VMOC Service. sudo service gram-vmoc stop
- Start VMOC: /opt/pox/pox.py -- log.level --DEBUG vmoc.VMOC --management_port=7001 --default_controller_url=https://<default_controller_host>:9000
- These URLs and ports can be changed on the command line as needed.
- Default Controller
- Stop the VMOC Default Controller: sudo service gram-ctrl stop
- Start the Default Controller: /opt/pox/pox.py -- log.level --DEBUG openflow.of_01 --port=9000 vmoc.l2_simple_learning
- This port # can be changed as needed (to match the VMOC configuration above)
- GRAM Aggregate Manager
- Stop the GRAM Aggregate Manager: sudo service gram-am stop
- Start the GRAM Aggregate Manager: python /home/gram/gram/src/gram-am.py -V3 -p 8001
- The port can be modified as needed but should match the [aggregate_manager] entry in ~gram/.gcf/gcf_config.
- The GRAM Aggregate Manager V2 can be run (and port modified) by this command: python /home/gram/gram/src/gram-am.py -V2 -p 8002
KVM virtualization
- Verify KVM is installed and able to use hardware virtualization:
- NOTE: kvm-ok is part of the cpu-checker package
$ kvm -version QEMU emulator version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard $ kvm-ok INFO: /dev/kvm exists KVM acceleration can be used $ sudo service libvirt-bin status libvirt-bin start/running, process 2537
- NOTE: kvm-ok is part of the cpu-checker package
Metadata service requirements
- Nova should have set up a NAT rule for metadata services
$ sudo iptables -t nat -L ... Chain quantum-l3-agent-PREROUTING (1 references) target prot opt source destination DNAT tcp -- anywhere 169.254.169.254 tcp dpt:http to:10.10.8.71:8775 ...
Resource Allocation Problems
- Duplicate Slice names: One may not have two slices of the same URN which is composed of the project and slice name. If a given user is trying to create a slice and gets a 'duplicate slice name', they should change the name of the slice they are trying to create, or delete an old version. If the old version is not evident in OpenStack, restart the aggregate manager. Otherwise, see the 'Cleanup Procedures' below.
- SSH proxy doesn't work.
- Make sure the NAT rule is in place on the control_node: sudo iptables -L -t nat
- Make sure gram_ssh_proxy is installed on control_node in /usr/local/bin with privileges:-rwsr-xr-x
- Out of resources: The rack has a limited set of CPU and Memory resources and thus can only allocated a given number of VM's of particular flavors. If this problem occurs, the rack may be saturated. It may be that all slices are in use, or it may be that there are many old resources that can be harvested and reused. Look at the 'Cleanup Procedures' below.
- Isolation: GRAM provides no guarantees on network or CPU (all is shared, based on how much isolation is provided by KVM, Quantum, OVS layers)
- "VM Build Error".
- The logs for the VM's that Nova/KVM tries to build are in /var/lib/nova/instances/<instance_name>/console.log. You can look at these and see what errors occurred in trying to boot the VM. If the log is empty, look in the nova-compute logs in /var/log/upstart.
- To tell what instance a VM is and where it is running, do a 'nova list --all-tenants' to find the instance id, and then do a nova-show <instance_id> to find the compute node and instance name:
gram@boscontroller:/usr/local/bin$ nova list --all-tenants +--------------------------------------+------+--------+---------------------------------------------------------------------+ | ID | Name | Status | Networks | +--------------------------------------+------+--------+---------------------------------------------------------------------+ | '''01782225-00b1-4ab8-bba3-c7452833b8c2''' | VM-1 | ACTIVE | cntrlNet-marilac:SPOON+slice+SPORK=10.10.108.100; lan0=10.0.109.100 | | 1aa8ba40-63a2-4a58-b533-faf18c674b77 | VM-2 | ACTIVE | cntrlNet-marilac:SPOON+slice+SPORK=10.10.108.101; lan0=10.0.109.101 | +--------------------------------------+------+--------+---------------------------------------------------------------------+ gram@boscontroller:/usr/local/bin$ nova show 01782225-00b1-4ab8-bba3-c7452833b8c2 +--------------------------------------------+----------------------------------------------------------+ | Property | Value | +--------------------------------------------+----------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-SRV-ATTR:host | boscompute4 | | OS-EXT-SRV-ATTR:hypervisor_hostname | boscompute4 # This is the VM host | | OS-EXT-SRV-ATTR:instance_name | instance-0000000f # This is the VM name | | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | accessIPv4 | | | accessIPv6 | | | cntrlNet-marilac:SPOON+slice+SPORK network | 10.10.108.100 | | config_drive | | | created | 2013-04-19T13:29:56Z | | flavor | m1.small (2) | | hostId | b590c0756658c24a3aea56372b5c71d2649f16fabad174ee796f40d0 | | id | 01782225-00b1-4ab8-bba3-c7452833b8c2 | | image | ubuntu-12.04 (93779c42-a5d7-4144-ac78-4a597c74a92a) | | key_name | None | | lan0 network | 10.0.109.100 | | metadata | {} | | name | VM-1 | | progress | 0 | | security_groups | [{u'name': u'marilac:SPOON+slice+SPORK_secgrp'}] | | status | ACTIVE | | tenant_id | 88e8b222da0349528e5864ba60220cfa | | updated | 2013-04-19T13:30:11Z | | user_id | 77c695dfaf2640a38db0352f4a771828 | +--------------------------------------------+----------------------------------------------------------+ gram@boscontroller:/usr/local/bin$ ssh boscompute4 gram@boscompute4:~$ ls -l /var/lib/nova/instances/instance-0000000f/console.log -rw-rw---- 1 libvirt-qemu kvm 0 Apr 19 09:30 /var/lib/nova/instances/instance-0000000f/console.log
Adding a Flavor
Use the flavor-create subcommand of nova : nova flavor-create <ID> <RAM in MB> <disk space in GB> <num virtual cores>
nova flavor-create m1.super 7 32768 160 16 nova flavor-list +--------------------------------------+------------+-----------+------+-----------+------+-------+-------------+-----------+-------------+ | ID | Name | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public | extra_specs | +--------------------------------------+------------+-----------+------+-----------+------+-------+-------------+-----------+-------------+ | 1 | m1.tiny | 512 | 0 | 0 | | 1 | 1.0 | True | {} | | 2 | m1.small | 2048 | 20 | 0 | | 1 | 1.0 | True | {} | | 3 | m1.medium | 4096 | 40 | 0 | | 2 | 1.0 | True | {} | | 4 | m1.large | 8192 | 80 | 0 | | 4 | 1.0 | True | {} | | 6 | default-vm | 2048 | 20 | 0 | | 1 | 1.0 | True | {} | | 7 | m1.super | 32768 | 160 | 0 | | 16 | 1.0 | True | {} | | 76b049db-7f84-4fa0-8202-a31432af34d7 | m1.xlarge | 16384 | 160 | 0 | | 8 | 1.0 | True | {} | +--------------------------------------+------------+-----------+------+-----------+------+-------+-------------+-----------+-------------+
Managing Quotas
Openstack enforces quotas at the slice (project) level and at the server level.
To view the default quotas, use the following command:
source /etc/novarc nova quota-defaults
To update the default quota, use the following command:
nova quota-class-update --key value default
See here for more details: http://docs.openstack.org/user-guide-admin/content/cli_set_quotas.html
There is also a notion of absolute limits on quota. The default quota cannot exceed the absolute limits. To view the absolute limit quota, use the following command:
nova absolute-limits
To change the value of the absolute-limits you must edit /etc/nova/nova.conf on the control node and restart the nova-api service. For example, you can add these lines to set the number of cores and RAM:
quota_ram=512000 quota_cores=150
Further, there is a notion of overcommitting in OpenStack and this too has a limit. See here for details: http://docs.openstack.org/openstack-ops/content/compute_nodes.html
Renewing Keystone Certs
The keystone certificate is used to authenticate the various openstack services. They are valid for 1 year, and a new cert should be generated. Instructions are here: http://groups.geni.net/geni/wiki/GENIRacksHome/OpenGENIRacks/RenewKeystoneKeys If you are seen authenication errors from the various openstack services, then this might be the problem.
Cleanup Procedures
- Cleanup.py. This is a script that cleans up all OpenStack resources associated with a given slice (by URN or tenant ID).
- Manual Cleanup. If there is a slice or resource that needs to be deleted from OpenStack, here's how:
- keystone user-list
- keystone user-delete <user_id from above>
- keystone tenant-list
- keystone tenant-delete <tenant_id from above>
- nova list --all-tenants
- nova delete <instance_id from above>
- quantum net-list
- quantum net-delete <net_id from above> NOTE: Be careful not to delete the public network
- Then restart the gram servers:
sudo service gram-am restart sudo service gram-amv2 restart