Diagnosing Tripleo Failures Redux

Datetime:2016-08-23 00:11:29          Topic:          Share

Hardy Steven has provided an invaluable reference with his troubleshooting blog post . However, I recently had a problem that didn’t quite match what he was showing. Zane Bitter got me oriented.

Upon a redeploy, I got a failure.

$ openstack stack list
+--------------------------------------+------------+---------------+---------------------+---------------------+
| ID                                   | Stack Name | Stack Status  | Creation Time       | Updated Time        |
+--------------------------------------+------------+---------------+---------------------+---------------------+
| 816c67ab-d360-4f9b-8811-ed2a346dde01 | overcloud  | UPDATE_FAILED | 2016-08-16T13:38:46 | 2016-08-16T14:41:54 |
+--------------------------------------+------------+---------------+---------------------+---------------------+

Listing the Failed resources:

$  heat resource-list --nested-depth 5 overcloud | grep FAILED
| ControllerNodesPostDeployment                 | 7ae99682-597f-4562-9e58-4acffaf7aaac          | OS::TripleO::ControllerPostDeployment                                           | UPDATE_FAILED   | 2016-08-16T14:44:42 | overcloud

No deployment listed. How to display the error? We want to show the resource named ControllerNodesPostDeployment associated with the overcloud stack:

$ heat resource-show overcloud ControllerNodesPostDeployment
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property               | Value                                                                                                                                                               |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes             | {}                                                                                                                                                                  |
| creation_time          | 2016-08-16T13:38:46                                                                                                                                                 |
| description            |                                                                                                                                                                     |
| links                  | http://192.0.2.1:8004/v1/7ec16202298c41f696b3f326790bebd3/stacks/overcloud/816c67ab-d360-4f9b-8811-ed2a346dde01/resources/ControllerNodesPostDeployment (self)      |
|                        | http://192.0.2.1:8004/v1/7ec16202298c41f696b3f326790bebd3/stacks/overcloud/816c67ab-d360-4f9b-8811-ed2a346dde01 (stack)                                             |
|                        | http://192.0.2.1:8004/v1/7ec16202298c41f696b3f326790bebd3/stacks/overcloud-ControllerNodesPostDeployment-qelkqyung4xr/7ae99682-597f-4562-9e58-4acffaf7aaac (nested) |
| logical_resource_id    | ControllerNodesPostDeployment                                                                                                                                       |
| physical_resource_id   | 7ae99682-597f-4562-9e58-4acffaf7aaac                                                                                                                                |
| required_by            | BlockStorageNodesPostDeployment                                                                                                                                     |
|                        | CephStorageNodesPostDeployment                                                                                                                                      |
| resource_name          | ControllerNodesPostDeployment                                                                                                                                       |
| resource_status        | UPDATE_FAILED                                                                                                                                                       |
| resource_status_reason | Engine went down during resource UPDATE                                                                                                                             |
| resource_type          | OS::TripleO::ControllerPostDeployment                                                                                                                               |
| updated_time           | 2016-08-16T14:44:42                                                                                                                                                 |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Note this message:

Engine went down during resource

Looking in the journal:

Aug 16 15:16:15 undercloud kernel: Out of memory: Kill process 17127 (heat-engine) score 60 or sacrifice child
Aug 16 15:16:15 undercloud kernel: Killed process 17127 (heat-engine) total-vm:834052kB, anon-rss:480936kB, file-rss:1384kB

Just like Brody said, we are going to need a bigger boat.