You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We will be attempting to use TripleO to deploy both the OpenStack enviornment and the Baremetal OpenShift enviornment.
OpenStack Deployment
We will deploy with 3 Controllers + 2 Neutron Networker nodes, this will allow us to distribute the control plane load across additional nodes -- additionally the L3 traffic.
Note: We have never deployed in this fashion within the PerfScale DFG - this is new to Newton.
We plan on deploying Ceph through Director/OOO as well.
Baremetal Deployment
We plan on using Director/OOO to deploy the RHEL73 hosts for OpenShift Baremetal.
Note: Again, we have never used Director/OOO to deploy just baremetal nodes - this is new to Newton
My very crude test to time the cleaning. We need to get better insight into cleaning (I have a RFE on this).
To clean a 10 disk machine it took : 2:22
[stack@undercloud ~]$ while true ; do date ; ironic node-list | grep 2d35caec-0452-49ef-b382-a0a1b5c553f1; sleep 1; done
Fri Mar 10 12:46:08 EST 2017
| 2d35caec-0452-49ef-b382-a0a1b5c553f1 | None | None | power on | clean wait | False
|
...
Fri Mar 10 12:48:30 EST 2017
| 2d35caec-0452-49ef-b382-a0a1b5c553f1 | None | None | power off | available | False |
ironic cleaning on large import
For 181 nodes, two nodes failed to clean when everything switch from managable -> available.
Started Mistral Workflow. Execution ID: 7982326b-1c9e-4457-a03f-0667af66b4ae
Successfully set all nodes to available.
real 32m46.525s
user 0m0.778s
sys 0m0.088s
Ceph templates were configured for HCI and was a direct copy and paste from Sai's HCI deployment. Minus the storage-environmental.yaml, which was updated with the right OSD/Journal information.
Even if we provide disk hints to use sdX CNCF still installs a boot loader on sda which causes inconsistant boots in our env, the best solution is to simply reuse sda (even though it is a SSD) as the system disk to avoid this.
Ironic node cleanup
Node stuck in clean wait:
| 96948bd4-5bc2-4e2c-aabc-af1ab3356cb0 | None | None | power on | clean wait | True |
Networker kernel Panic
The Kernel Panic seems to be due to a update to the kernel which was done outside of OOO (by mistake)
Deployment wedged
Since the Baremetal nodes are currently in use, and the installation of OpenShift nuked two of my controllers (see the section about Kernel Panic). I needed to Scale the OpenStack side of the house to zero, to rebuild the Cluster. This has never been done before.
Which will populate /etc/hosts. In my case, running manually, this would work just fine, however, there was another operation that was overwriting /etc/hosts with what we see in [2]. Causing Pacemaker to fail.
Run the 51-hosts script manually then set chattr +i on /etc/hosts (currently trying this) - this failed. See section on "Deployment (After total chaos)"
[root@overcloud-openstack-controller-0 heat-admin]# ceph -s
cluster 04526bc4-0504-11e7-8885-f8bc121144a0
health HEALTH_ERR
1 pgs are stuck inactive for more than 300 seconds
2 pgs backfill_wait
15 pgs degraded
1 pgs recovering
12 pgs recovery_wait
1 pgs stuck inactive
14 pgs stuck unclean
3 pgs undersized
42 requests are blocked > 32 sec
recovery 156889/60263413 objects degraded (0.260%)
recovery 311984/60263413 objects misplaced (0.518%)
recovery 1/20061806 unfound (0.000%)
too many PGs per OSD (320 > max 300)
pool metrics has many more objects per pg than average (too few pgs?)
monmap e1: 3 mons at {overcloud-openstack-controller-0=172.18.0.18:6789/0,overcloud-openstack-controller-1=172.18.0.16:6789/0,overcloud-openstack-controller-2=172.18.0.14:6789/0}
election epoch 6, quorum 0,1,2 overcloud-openstack-controller-2,overcloud-openstack-controller-1,overcloud-openstack-controller-0
osdmap e3761: 90 osds: 87 up, 87 in; 3 remapped pgs
flags sortbitwise
pgmap v558382: 9280 pgs, 6 pools, 818 GB data, 19591 kobjects
2749 GB used, 155 TB / 158 TB avail
156889/60263413 objects degraded (0.260%)
311984/60263413 objects misplaced (0.518%)
1/20061806 unfound (0.000%)
9265 active+clean
12 active+recovery_wait+degraded
2 active+undersized+degraded+remapped+wait_backfill
1 recovering+undersized+degraded+remapped+peered
client io 0 B/s rd, 46767 B/s wr, 5 op/s rd, 5 op/s wr
We are disabling Ceilometer and Gnocchi in hopes to get past this issue. By Disabling, we are turning the services off, and then removing the services from the role-data for the controllers.
CNCF Provided a spreadsheet we converted to CSV, then captured em1's MAC address with
$ ansible -o -i inv.35 test -a "cat /sys/class/net/em1/bonding_slave/perm_hwaddr"
Mapping the IPMI Address to the mac addresses above.
CNCF Provided us with OPERATOR accounts via IPMI, OpenStack Director/OOO Assumes the account will be ADMINISTRATOR, we needed to update the instackenv.json to reflect this:
Since we will be using Director for the entire deployment. We need to map out all of our hosts. I created mapping files that contain the CSV data for each node type:
[stack@undercloud nodes]$ ls -tlarh
total 24K
-rw-rw-r--. 1 stack stack 879 Feb 27 13:48 ceph-storage
-rw-rw-r--. 1 stack stack 273 Feb 27 13:48 compute
-rw-rw-r--. 1 stack stack 524 Feb 27 14:23 control
-rw-rw-r--. 1 stack stack 7.6K Feb 27 14:40 baremetal
drwx------. 13 stack stack 4.0K Feb 27 14:40 ..
drwxrwxr-x. 2 stack stack 69 Feb 27 14:40 .
[stack@undercloud nodes]$
Then I constructed a loop to iterate through the nodes and determine their IPMI address and map to the right profile.
Hosts with multiple disks can come with pre-installed OSs, to avoid multiple boot-loaders we need to clean the hosts. The configuration below will make Ironic clean the hosts when moved back into available state.