| 1 | [[PageOutline(1-2)]] |
| 2 | |
| 3 | = OPS-001-B Adding Monitoring Sites = |
| 4 | |
| 5 | This procedure define the steps to add a new site to the [http://genimon.uky.edu/ GENI Monitoring System]. |
| 6 | |
| 7 | A request to add a new site can be made by the GENI Rack Team or by an operations group. Adding a new site is part of creating and setting up a new rack at a site, for which GMOC has a master ticket. |
| 8 | |
| 9 | Regardless of the source of the request, a ticket must be written to track the completion of this task. Ticket must copy the issue reporter and __does not__ generate notifications to GENI users. |
| 10 | |
| 11 | = 1. Issue Reported = |
| 12 | |
| 13 | GMOC gathers technical details for the addition request including: |
| 14 | |
| 15 | * Requester Organization |
| 16 | * Requester Name |
| 17 | * Requester email |
| 18 | * New Site Name (i.e. the Aggregate Manager nick name) |
| 19 | * New Site Aggregate Manager URN |
| 20 | * New Site Aggregate Manager API URLs |
| 21 | * New Site Data Store URL |
| 22 | |
| 23 | ''Note:'' These details should be part of the new site master ticket to begin with. |
| 24 | |
| 25 | == 1.1 GENI Event Type Prioritization == |
| 26 | |
| 27 | GMOC should classifies the addition of new GENI sites to the [http://genimon.uky.edu/ GENY Monitoring System] as `Normal` priority. |
| 28 | |
| 29 | == 1.2 Create Ticket == |
| 30 | |
| 31 | The GMOC ticketing system is used to capture information above. GMOC may follow up to request additional information as site is added. This operation results in the requester getting a ticket email. |
| 32 | |
| 33 | For the moment, the response steps involve a very short outage of the ops config data store, which would create a very minor disturbance, if noticeable at all. As a result, there is no need to schedule a maintenance for that system. |
| 34 | This assessment may be revised in the future. |
| 35 | |
| 36 | = 2. Investigate and Identify Response = |
| 37 | |
| 38 | == 2.1 Investigate the Problem == |
| 39 | |
| 40 | * N/A |
| 41 | |
| 42 | == 2.2 Identify the Response == |
| 43 | |
| 44 | |
| 45 | The ops monitoring distributed system follows this [http://groups.geni.net/geni/wiki/OperationalMonitoring/Overview architecture]. Therefore adding a site to monitoring means primarily adding it to the Ops Config data store. |
| 46 | |
| 47 | The instruction to access the ops monitoring software repository are detailed [http://trac.gpolab.bbn.com/ops-monitoring/wiki/GettingSourceCode here] |
| 48 | |
| 49 | The party responsible for maintaining the Ops Config Data store (currently GPO) should execute the following steps. |
| 50 | |
| 51 | |
| 52 | === 2.2.1 Adding a section to the json file: config/opsconfig.json.jwc in the repository === |
| 53 | |
| 54 | An entry for the new site needs to be added to the "aggregatestores" section of the config/opsconfig.json.jwc file in the repository. |
| 55 | |
| 56 | As an example, here's such an entry for a typical production site (utah-ig here): |
| 57 | {{{ |
| 58 | { |
| 59 | "urn": "urn:publicid:IDN+utah.geniracks.net+authority+cm", |
| 60 | "amtype": "instageni", |
| 61 | "href": "https://www.utah.geniracks.net:5001/info/aggregate/utah-ig" |
| 62 | }, |
| 63 | }}} |
| 64 | |
| 65 | Here's an entry for a site that is not yet in production (i.e. the AM API URLs and nickname are not known to the portal - vt-ig here): |
| 66 | {{{ |
| 67 | { |
| 68 | "urn": "urn:publicid:IDN+instageni.arc.vt.edu+authority+cm", |
| 69 | "amtype": "instageni", |
| 70 | "href": "https://www.instageni.arc.vt.edu:5001/info/aggregate/vt-ig", |
| 71 | "amurl": "https://instageni.arc.vt.edu:12369/protogeni/xmlrpc/am", |
| 72 | "amurl1": "https://instageni.arc.vt.edu:12369/protogeni/xmlrpc/am/1.0", |
| 73 | "amurl2": "https://instageni.arc.vt.edu:12369/protogeni/xmlrpc/am/2.0", |
| 74 | "amurl3": "https://instageni.arc.vt.edu:12369/protogeni/xmlrpc/am/3.0", |
| 75 | "am_nickname": "vt-ig" |
| 76 | }, |
| 77 | }}} |
| 78 | |
| 79 | * The file should be edited and committed either on the develop branch, or an a ticket branch. For the develop branch: '''git checkout develop''' and use editor of choice. |
| 80 | * At the top of the repository, run make to create config/opsconfig.json. Cut & Paste the content into http://jsonlint.com/ and make sure the json file passes validation. (It is very easy to forget a quote, colon or comma and/or add one too many) |
| 81 | * Once committed, you can check the build results at http://lemongrass.gpolab.bbn.com:8080/ to make sure the commit didn't break anything. |
| 82 | |
| 83 | === 2.2.2 Create a minor Ops Monitoring release === |
| 84 | |
| 85 | * Checkout the master branch of the repository: '''git checkout master''' |
| 86 | * Merge the branch you used to commit the ops config changes. For example if you used the develop branch: '''git merge develop --no-ff && git push''' |
| 87 | * Create a new minor release. If the previous release was 2.0.n, create 2.0.n+1. In the top directory of the repository, do: '''cm/make_release v2.0.n+1''' |
| 88 | * When asked to enter the tag description, use: |
| 89 | {{{ |
| 90 | Ops Monitoring version 2.0.n+1 |
| 91 | - Added site X to ops config json file. |
| 92 | }}} |
| 93 | * When the release script has finished executing, there are 2 files that have been built: ops-monitoring/release/ops-monitoring.v2.0.n+1.tar.gz and /ops-monitoring/release/ops-monitoring.gpo-only.v2.0.n+1.tar.gz. |
| 94 | |
| 95 | === 2.2.3 Install the new release on the Ops Config data store === |
| 96 | |
| 97 | Currently, the ops config data store is hosted at opsconfigdatastore.gpolab.bbn.com. |
| 98 | |
| 99 | * Copy the release files to opsconfigdatastore.gpolab.bbn.com: '''scp ops-monitoring/release/ops-monitoring.v2.0.n+1.tar.gz ops-monitoring/release/ops-monitoring.gpo-only.v2.0.n+1.tar.gz opsconfigdatastore.gpolab.bbn.com:''' |
| 100 | * Log on the ops config data store system: '''ssh opsconfigdatastore.gpolab.bbn.com''' |
| 101 | * Untar the release: '''cd /usr/local; sudo tar xvzf ~/ops-monitoring.v2.0.n+1.tar.gz; sudo tar xvzf ~/ops-monitoring.gpo-only.v2.0.n+1.tar.gz''' |
| 102 | * Modify the ownership: '''sudo chown -R root:root /usr/local/ops-monitoring-v2.0.n+1''' |
| 103 | * Stop the apache server: '''sudo service apache2 stop''' |
| 104 | * Change the soft link: '''sudo rm /usr/local/ops-monitoring && ln -s /usr/local/ops-monitoring-v2.0.n+1 /usr/local/ops-monitoring''' |
| 105 | * Restart apache: '''sudo service apache2 start''' |
| 106 | |
| 107 | === 2.2.4 Verify that the new site is monitored === |
| 108 | |
| 109 | Immediately after restarting apache on the ops config data store, the json response from the data store should include the new section. |
| 110 | |
| 111 | '''curl -k --cert <collector cert> https://opsconfigdatastore.gpolab.bbn.com/info/opsconfig/geni-prod''' |
| 112 | |
| 113 | ''Note'': the collector cert is one of the crypto certificates issued by the Clearing House to each of the different interested parties of the Ops monitoring project. |
| 114 | |
| 115 | |
| 116 | Within an hour or so of the new minor release installed on the ops config data store, the [http://genimon.uky.edu/ GENY Monitoring System] should list the new aggregate. |
| 117 | |
| 118 | = 3. GMOC Response = |
| 119 | |
| 120 | The GMOC implements the actions outlined and updates the ticket to capture the actions taken. |
| 121 | |
| 122 | == 3.1 Implement Response == |
| 123 | |
| 124 | The GMOC executes the steps outlined. If GMOC finds procedure to be lacking, then steps should be taken to get the procedures updated. |
| 125 | |
| 126 | == 3.2 Procedure Updates == |
| 127 | |
| 128 | When instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use. |
| 129 | |
| 130 | = 4. Resolution = |
| 131 | |
| 132 | GMOC verifies the site has been added by checking the [http://genimon.uky.edu/ GENI Monitoring System] list of aggregates. |
| 133 | |
| 134 | == 4.1 Document Resolution and Close Ticket == |
| 135 | |
| 136 | GMOC captures how the site was added in the ticket and closes the ticket. This should result in notification back to the problem reporter. |