Changes between Initial Version and Version 1 of GENIRacksHome/OpenGENIRacks/AdministrationGuide


Ignore:
Timestamp:
05/23/14 11:36:03 (10 years ago)
Author:
Jeanne Ohren
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GENIRacksHome/OpenGENIRacks/AdministrationGuide

    v1 v1  
     1= GRAM Administration Guide =
     2
     3[[PageOutline(2-100,Table of Contents,inline,unnumbered)]]
     4
     5== Introduction ==
     6
     7This document describes procedures and provides context for administrators of GRAM-controlled racks. We provide details on:
     8
     9 * '''Software maintenance''': Upgrading GRAM software to new releases
     10 * '''Troubleshooting''': Diagnosing and investigating unexpected GRAM behavior
     11
     12== Software Maintenance ==
     13
     14Software installation should come in the form of a new gram ".deb" file provided directly by the GENI Program Office or via a public URL. Different .deb files are provided for control and compute  nodes (e.g. gram_control.deb and gram_compute.deb).
     15
     16These can be installed using the dpkg utility (e.g. dpkg -i <deb file>) or using gdebi (e.g. gdebi <deb file>.
     17
     18=== Software Versions ===
     19
     20The version of the gram install can be derived as follows:
     21 * dpkg --info gram_control.deb (or gram_compute.deb) will provide the version of that deb release
     22
     23== Troubleshooting ==
     24
     25=== Connectivity Problems ===
     26 * Make sure you can ping on external network and control network
     27 * Trunking - make sure the ports on router and switch allow for VLAN-tagged traffic to flow
     28 * tcpdump and wireshark are very useful tools to see where traffic is being generated and where it is getting stuck. Be careful about interfaces. Data plane traffic is what the VM's speak (except for SSH traffic into them), and goes through the !OpenFlow controlled switch. Control traffic for !OpenStack occurs on the control plane interfaces, while SSH traffic into the VM's occurs on the management plane interfaces. These both flow through the non-!OpenFlow controlled switch.
     29 * !OpenFlow control only for traffic flowing throgh switch. Collocated VM's can't be OF controlled. We try to place VM's on different nodes but do not mae guarantees.
     30 * It often helps to reboot VM's (sudo nova reboot <instance_id>) to see the traffic going from VM back to the control node, while tracing in tcpdump or wireshark.
     31 * Make sure  the controller is up and running,  and connected to the switch
     32 * Check that the controller for the slice matches the controller running
     33 * The data plane interface may not be up (even if the quantum interfaces to that interface are up. If ''ifconfig eth1'' shows the interface is not up (assuming eth1 is the data plane interface), then run ''sudo ip link set eth1 up''.
     34  * echo 'dump' | nc localhost 7001
     35  * Check that the OVS configuration is appropriate on the Control and Compute nodes:
     36
     37  * Verify OVS configuration on the Controller node:
     38      * There should be a qg port on br-ex for each external network
     39
     40{{{
     41$ sudo ovs-vsctl show
     42107352c3-a0bb-4598-a3a3-776c5da0b62b
     43    Bridge "br-eth1"
     44        Port "phy-br-eth1"
     45            Interface "phy-br-eth1"
     46        Port "eth1"
     47            Interface "eth1"
     48        Port "br-eth1"
     49            Interface "br-eth1"
     50                type: internal
     51    Bridge br-ex
     52        Port br-ex
     53            Interface br-ex
     54                type: internal
     55        Port "eth2"
     56            Interface "eth2"
     57        Port "qg-9816149f-9c"
     58            Interface "qg-9816149f-9c"
     59                type: internal
     60    Bridge br-int
     61        Port "int-br-eth1"
     62            Interface "int-br-eth1"
     63        Port br-int
     64            Interface br-int
     65                type: internal
     66    ovs_version: "1.4.0+build0"
     67}}}
     68
     69   * Verify OVS configuration on the Compute nodes:
     70      * Assuming no VMs on the compute node
     71{{{
     72$ sudo ovs-vsctl show
     734ec3588c-5c8f-4d7f-8626-49909e0e4e02
     74    Bridge br-int
     75        Port br-int
     76            Interface br-int
     77                type: internal
     78        Port "int-br-eth1"
     79            Interface "int-br-eth1"
     80    Bridge "br-eth1"
     81        Port "phy-br-eth1"
     82            Interface "phy-br-eth1"
     83        Port "br-eth1"
     84            Interface "br-eth1"
     85                type: internal
     86        Port "eth1"
     87            Interface "eth1"
     88    ovs_version: "1.4.0+build0"
     89}}}
     90
     91== Service Problems ==
     92
     93 * Make sure your !OpenStack environment is set. Any command you run by hand or in a script needs to have the variables established in /etc/novarc set. Do a 'source /etc/novarc' to be sure.
     94
     95 * Verify all expected services registered with Nova
     96   * Expect to see nova-cert, nova-consoleauth, and nova-scheduler on the controller node and nova-compute on each compute node.  All should have State = :-) (not XXX)
     97{{{
     98   $ sudo nova-manage service list
     99   Binary           Host                                 Zone             Status     State Updated_At
     100   nova-cert        pridevcontrol                        nova             enabled    :-)   2013-02-07 20:47:38
     101   nova-consoleauth pridevcontrol                        nova             enabled    :-)   2013-02-07 20:47:37
     102   nova-scheduler   pridevcontrol                        nova             enabled    :-)   2013-02-07 20:47:38
     103   nova-compute     pridevcompute1                       nova             enabled    :-)   2013-02-07 20:47:33
     104   nova-compute     pridevcompute2                       nova             enabled    :-)   2013-02-07 20:47:35
     105}}}
     106 
     107More generally, make sure that all !OpenStack and GRAM services are up and running by running ''service <service_name> status":
     108'''Control Node:'''
     109 * Nova (Control)
     110   * nova-api
     111   * nova-cert 
     112   * nova-consoleauth
     113   * nova-novncproxy 
     114   * nova-scheduler
     115 * Glance (Control)
     116   * glance-api
     117   * glance-registry
     118 * Keystone (Control)
     119   * keystone
     120 * Quantum (Control)
     121   * quantum-server
     122   * quantum-plugin-openvswitch-agent
     123   * quantum-dhcp-agent
     124   * quantum-l3-agent
     125   * openvswitch-switch
     126   * rabbitmq-server
     127 * MySQL (Control)
     128   * mysql
     129 * GRAM (Control)
     130  * gram-am
     131  * gram-amv2
     132  * gram-ch
     133  * gram-cni
     134  * gram-ctrl
     135  * gram-vmoc
     136
     137'''ComputeNode:'''
     138 * Nova (Compute)
     139   * nova-api-metadata
     140   * nova-compute
     141 * Quantum (Compute)
     142   * quantum-plugin-openvswitch-agent
     143   * openvswitch-switch
     144 * KVM
     145    * qemu-kvm
     146    * libvirt-bin
     147
     148 * On the control node, make sure that network servers are listening on the following ports (that is, do a netstat -na | grep <port> and see a line that says "LISTEN"):
     149
     150      * 8000: GRAM Clearinghouse (Unless you are using a different clearinghouse). 
     151      * 8001: GRAM Aggregate Manager. .
     152      * 8002: GRAM Aggregate Manager V2. 
     153      * 9000: VMOC Default Controller
     154      * 7001: VMOC Management. 
     155      * 6633: VMOC
     156
     157==== Starting Services on Different Ports ====
     158
     159For debugging it is often helpful to start a service on a different port in a command window.
     160
     161 * VMOC.
     162  * Stop the VMOC Service. sudo service gram-vmoc stop
     163  * Start VMOC: /opt/pox/pox.py -- log.level --DEBUG vmoc.VMOC --management_port=7001 --default_controller_url=https://<default_controller_host>:9000
     164  * These URLs and ports can be changed on the command line as needed.
     165
     166 * Default Controller
     167  * Stop the VMOC Default Controller: sudo service gram-ctrl stop
     168  * Start the Default Controller: /opt/pox/pox.py -- log.level --DEBUG openflow.of_01 --port=9000 vmoc.l2_simple_learning
     169  * This port # can be changed as needed (to match the VMOC configuration above)
     170
     171 * GRAM Aggregate Manager
     172  * Stop the GRAM Aggregate Manager: sudo service gram-am stop
     173  * Start the GRAM Aggregate Manager:  python /home/gram/gram/src/gram-am.py -V3 -p 8001
     174  * The port can be modified as needed but should match the [aggregate_manager] entry in ~gram/.gcf/gcf_config.
     175  * The GRAM Aggregate Manager V2 can be run (and port modified) by this command:  python /home/gram/gram/src/gram-am.py -V2 -p 8002
     176
     177==== KVM virtualization ====
     178   * Verify KVM is installed and able to use hardware virtualization:
     179       * NOTE: kvm-ok is part of the cpu-checker package
     180{{{
     181   $ kvm -version
     182   QEMU emulator version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard
     183   $ kvm-ok
     184   INFO: /dev/kvm exists
     185   KVM acceleration can be used
     186   $ sudo service libvirt-bin status
     187   libvirt-bin start/running, process 2537
     188}}}
     189
     190==== Metadata service requirements ====
     191   * Nova should have set up a NAT rule for metadata services
     192{{{
     193$ sudo iptables -t nat -L
     194...
     195Chain quantum-l3-agent-PREROUTING (1 references)
     196target     prot opt source               destination         
     197DNAT       tcp  --  anywhere             169.254.169.254      tcp dpt:http to:10.10.8.71:8775
     198...
     199
     200}}}
     201
     202=== Resource Allocation Problems ===
     203
     204 * ''Duplicate Slice names'': One may not have two slices of the same URN which is composed of the project and slice name. If a given user is trying to create a slice and gets a 'duplicate slice name', they should change the name of the slice they are trying to create, or delete an old version. If the old version is not evident in !OpenStack, restart the aggregate manager. Otherwise, see the 'Cleanup Procedures' below.
     205
     206 * ''SSH proxy doesn't work.''
     207   * Make sure the NAT rule is in place on the control_node: ''sudo iptables -L -t nat''
     208   * Make sure gram_ssh_proxy is installed on control_node in /usr/local/bin with privileges:-rwsr-xr-x
     209
     210 * ''Out of resources'': The rack has a limited set of CPU and Memory resources and thus can only allocated a given number of VM's of particular flavors. If this problem occurs, the rack may be saturated. It may be that all slices are in use, or it may be that there are many old resources that can be harvested and reused. Look at the 'Cleanup Procedures' below.
     211
     212 * ''Isolation'': Gram provides no guarantees on network or CPU (all is shared, based on how much isolation is provided by KVM, Quantum, OVS layers)
     213
     214 * "VM Build Error".
     215   * The logs for the VM's that Nova/KVM tries to build are in /var/lib/nova/instances/<instance_name>/console.log. You can look at these and see what errors occurred in trying to boot the VM. If the log is empty, look in the nova-compute logs in /var/log/upstart.
     216   * To tell what instance a VM is and where it is running, do a 'nova list --all-tenants' to find the instance id, and then do a nova-show <instance_id> to find the compute node and instance name:
     217
     218{{{
     219gram@boscontroller:/usr/local/bin$ nova list --all-tenants
     220+--------------------------------------+------+--------+---------------------------------------------------------------------+
     221| ID                                   | Name | Status | Networks                                                            |
     222+--------------------------------------+------+--------+---------------------------------------------------------------------+
     223| '''01782225-00b1-4ab8-bba3-c7452833b8c2''' | VM-1 | ACTIVE | cntrlNet-marilac:SPOON+slice+SPORK=10.10.108.100; lan0=10.0.109.100 |
     224| 1aa8ba40-63a2-4a58-b533-faf18c674b77 | VM-2 | ACTIVE | cntrlNet-marilac:SPOON+slice+SPORK=10.10.108.101; lan0=10.0.109.101 |
     225+--------------------------------------+------+--------+---------------------------------------------------------------------+
     226gram@boscontroller:/usr/local/bin$ nova show 01782225-00b1-4ab8-bba3-c7452833b8c2
     227+--------------------------------------------+----------------------------------------------------------+
     228| Property                                   | Value                                                    |
     229+--------------------------------------------+----------------------------------------------------------+
     230| OS-DCF:diskConfig                          | MANUAL                                                   |
     231| OS-EXT-SRV-ATTR:host                       | boscompute4                                              |
     232| OS-EXT-SRV-ATTR:hypervisor_hostname        | boscompute4         # This is the VM host                                     |
     233| OS-EXT-SRV-ATTR:instance_name              | instance-0000000f     # This is the VM name                                   |
     234| OS-EXT-STS:power_state                     | 1                                                        |
     235| OS-EXT-STS:task_state                      | None                                                     |
     236| OS-EXT-STS:vm_state                        | active                                                   |
     237| accessIPv4                                 |                                                          |
     238| accessIPv6                                 |                                                          |
     239| cntrlNet-marilac:SPOON+slice+SPORK network | 10.10.108.100                                            |
     240| config_drive                               |                                                          |
     241| created                                    | 2013-04-19T13:29:56Z                                     |
     242| flavor                                     | m1.small (2)                                             |
     243| hostId                                     | b590c0756658c24a3aea56372b5c71d2649f16fabad174ee796f40d0 |
     244| id                                         | 01782225-00b1-4ab8-bba3-c7452833b8c2                     |
     245| image                                      | ubuntu-12.04 (93779c42-a5d7-4144-ac78-4a597c74a92a)      |
     246| key_name                                   | None                                                     |
     247| lan0 network                               | 10.0.109.100                                             |
     248| metadata                                   | {}                                                       |
     249| name                                       | VM-1                                                     |
     250| progress                                   | 0                                                        |
     251| security_groups                            | [{u'name': u'marilac:SPOON+slice+SPORK_secgrp'}]         |
     252| status                                     | ACTIVE                                                   |
     253| tenant_id                                  | 88e8b222da0349528e5864ba60220cfa                         |
     254| updated                                    | 2013-04-19T13:30:11Z                                     |
     255| user_id                                    | 77c695dfaf2640a38db0352f4a771828                         |
     256+--------------------------------------------+----------------------------------------------------------+
     257gram@boscontroller:/usr/local/bin$ ssh boscompute4
     258gram@boscompute4:~$ ls -l /var/lib/nova/instances/instance-0000000f/console.log
     259-rw-rw---- 1 libvirt-qemu kvm 0 Apr 19 09:30 /var/lib/nova/instances/instance-0000000f/console.log
     260
     261}}}
     262
     263
     264
     265=== Cleanup Procedures ===
     266
     267 * Cleanup.py. This is a script that cleans up all !OpenStack resources associated with a given slice (by URN or tenant ID).
     268 * Manual Cleanup. If there is a slice or resource that needs to be deleted from !OpenStack, here's how:
     269   * keystone user-list
     270   * keystone user-delete <user_id from above>
     271   * keystone tenant-list
     272   * keystone tenant-delete <tenant_id from above>
     273   * nova list --all-tenants
     274   * nova delete <instance_id from above>
     275   * quantum net-list
     276   * quantum net-delete <net_id from above> ''NOTE: Be careful not to delete the public network''
     277   * Then restart the gram servers:
     278{{{
     279     sudo service gram-am restart
     280     sudo service gram-amv2 restart
     281}}}