jtaleric/About.md

CNCF : OpenShift on OpenStack

Deployment

We will be attempting to use TripleO to deploy both the OpenStack enviornment and the Baremetal OpenShift enviornment.

OpenStack Deployment

We will deploy with 3 Controllers + 2 Neutron Networker nodes, this will allow us to distribute the control plane load across additional nodes -- additionally the L3 traffic.

Note: We have never deployed in this fashion within the PerfScale DFG - this is new to Newton.

We plan on deploying Ceph through Director/OOO as well.

Baremetal Deployment

We plan on using Director/OOO to deploy the RHEL73 hosts for OpenShift Baremetal.

Note: Again, we have never used Director/OOO to deploy just baremetal nodes - this is new to Newton

Init

3 - Controllers
2 - Networker nodes
1 - Compute
1 - Baremetal
1 - Ceph node with 10 OSDs

2017-03-01 21:44:04Z [overcloud]: CREATE_COMPLETE  Stack CREATE completed successfully

 Stack overcloud CREATE_COMPLETE 

Started Mistral Workflow. Execution ID: 9b58302b-a575-415e-9aa1-743282c90ba2
/home/stack/.ssh/known_hosts updated.
Original contents retained as /home/stack/.ssh/known_hosts.old
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    32m39.446s
user    0m3.496s
sys     0m0.383s
[stack@undercloud ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks            |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| de2dc7f9-e315-460d-bbd5-65095e72697c | overcloud-baremetal-0   | ACTIVE | -          | Running     | ctlplane=192.0.2.6  |
| de68e2c5-b79b-456c-ab3c-9accb375920f | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.0.2.13 |
| 5eede78d-f0ec-4561-ae37-75c72dcd94f2 | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.0.2.15 |
| cea24127-218d-4c92-ac89-b5552971b595 | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.0.2.20 |
| 1e76b823-93a1-4f00-b1fe-8c34b0f20f32 | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.0.2.18 |
| b8b2eb7a-b2c2-42d5-8d30-6a208f9e5beb | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.0.2.10 |
| df1d2e1d-a0db-4ccc-8d3a-e3f5cc37c06c | overcloud-networker-0   | ACTIVE | -          | Running     | ctlplane=192.0.2.8  |
| ce087118-d8a4-44c3-aff5-1dbd55ac455f | overcloud-networker-1   | ACTIVE | -          | Running     | ctlplane=192.0.2.14 |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+

Utilization on the Undercloud

http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488402549271&to=1488404684633

Init 2

3 - Controllers
2 - Networker nodes
1 - Compute
33 - Baremetal
1 - Ceph node with 10 OSDs

2017-03-02 02:12:48Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: b9bd6fcd-e64f-4046-8040-406ce3e9b049
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    46m30.344s
user    0m3.826s
sys     0m0.373s

Utilization on the Undercloud

http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488383429882&to=1488386379457

Init 3

3 - Controllers
2 - Networker nodes
1 - Compute
63 - Baremetal
1 - Ceph node with 10 OSDs

2017-03-02 13:32:47Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: ac53b291-8b75-4bb0-90f8-9c8b500f47ff
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    49m10.397s
user    0m3.774s
sys     0m0.372s

Utilization on the Undercloud

http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488459329000&to=1488461619000

Init 4

3 - Controllers
2 - Networker nodes
1 - Compute
88 - Baremetal
1 - Ceph node with 10 OSDs

2017-03-02 14:33:00Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: c48ed320-6ca0-4dad-802a-9552368058cf
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    48m45.827s
user    0m3.737s
sys     0m0.395s

Utilization on the Undercloud

http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488462029000&to=1488465219000

Init 5

3 - Controllers
2 - Networker nodes
1 - Compute
88 - Baremetal
2 - Ceph node with 10 OSDs

2017-03-02 21:20:23Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 5b2fb817-87f5-4a32-8824-2a5c12e6329e
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    43m57.697s
user    0m3.621s
sys     0m0.387s

Utilization on the Undercloud

http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488482563000&to=1488486163000

Init 6

3 - Controllers
2 - Networker nodes
1 - Compute
88 - Baremetal
4 - Ceph node with 10 OSDs

2017-03-03 01:52:13Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 6d2473c9-3c03-410b-893f-df2c4ceccfba
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    36m32.317s
user    0m3.349s
sys     0m0.365s

Init 7

3 - Controllers
2 - Networker nodes
1 - Compute
88 - Baremetal
5 - Ceph node with 10 OSDs

2017-03-03 19:23:42Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: c177bd69-9994-4d09-9632-b7012d0e3a4a
Overcloud Endpoint: http://10.2.4.15:5000/v2.0
Overcloud Deployed

real    36m4.994s
user    0m3.243s
sys     0m0.377s

Dual Stack init 1

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
3 - Ceph node with 10 OSDs

2017-03-10 11:06:36Z [overcloud-openstack]: CREATE_COMPLETE  Stack CREATE completed successfully

 Stack overcloud-openstack CREATE_COMPLETE 

Started Mistral Workflow. Execution ID: 337a05e3-5c8a-4641-964b-2575c02e0193
/home/stack/.ssh/known_hosts updated.
Original contents retained as /home/stack/.ssh/known_hosts.old
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    32m16.639s
user    0m3.497s
sys     0m0.352s

Dual Stack init 2

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
2 - Compute nodes
4 - Ceph node with 10 OSDs

2017-03-10 17:20:59Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 9228b938-9aae-4253-b0cc-0f00b581e9a4
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    60m20.728s
user    0m4.115s
sys     0m0.358s

Dual Stack init 3

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
2 - Compute nodes
7 - Ceph node with 10 OSDs

2017-03-10 18:32:53Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 17b854a8-9f17-4825-9c23-ad2a04bf3f6e
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    28m41.327s
user    0m3.147s
sys     0m0.346s

Dual Stack init 4

Cloud 1

88 - Baremetal

Cloud 2

Note: this Ceph node had zero SSDs.

3 - Controllers
2 - Networker nodes
2 - Compute nodes
8 - Ceph node with 10 OSDs

2017-03-10 20:57:37Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 698e6efa-ccda-47ff-a111-76251eb5ffde
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    26m39.274s
user    0m2.927s
sys     0m0.328s

Dual Stack init 5

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
2 - Compute nodes
9 - Ceph node with 10 OSDs

2017-03-10 21:33:41Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: a3725ed6-95c0-4f03-b445-eaa878446239
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    29m0.377s
user    0m3.095s
sys     0m0.334s

Dual stack init6

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
42 - Compute nodes
9 - Ceph node with 10 OSDs

2017-03-15 13:38:01Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: a6efd612-71f5-48a7-9c62-9566c3b03e38
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    50m28.043s
user    0m3.933s
sys     0m0.404s

Dual stack init7

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
82 - Compute nodes
9 - Ceph node with 10 OSDs

2017-03-15 14:41:26Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 5e9ca87c-0dc6-4e61-b134-26fb44496382
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    59m14.408s
user    0m4.404s
sys     0m0.417s

Dual stack init9

Previous to this deploy we had 90 compute nodes, we failed to scale due to overlapping IPs.

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
140 - Compute nodes
9 - Ceph node with 10 OSDs

2017-03-15 20:11:47Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: 8a8e0fbb-f14c-4417-8a60-e60c532e64f8
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    91m27.824s
user    0m4.839s
sys     0m0.456s

Dual stack init10

This deployment was rough. I had to modify /usr/libexec/os-refresh-config/configure.d/51-hosts to simply exit 0.

Related bug

https://bugs.launchpad.net/tripleo/+bug/1674732

Cloud 1

88 - Baremetal

Cloud 2

3 - Controllers
2 - Networker nodes
180 - Compute nodes
9 - Ceph node with 10 OSDs

017-03-21 19:28:22Z [overcloud-openstack]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud-openstack UPDATE_COMPLETE 

Started Mistral Workflow. Execution ID: e9c64186-2be2-4d64-8db9-bdc84faae54d
Overcloud Endpoint: http://10.2.4.112:5000/v2.0
Overcloud Deployed

real    133m13.329s
user    0m6.700s
sys     0m0.463s

ironic-conductor CPU usage

http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488826288732&to=1488914950547&panelId=55&fullscreen

Additional data

After the Undercloud was idle for 48 hours: https://snapshot.raintank.io/dashboard/snapshot/Sp2wuk2M5adTpqfXMJenMXcSlCav2PiZ

View of the entire run : http://perf-infra.ec2.breakage.org:3000/dashboard/db/openstack-undercloud-general-system-performance?var-Cloud=cncf-openstack&var-NodeType=undercloud&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489182213691&to=1489343350547&panelId=55&fullscreen

View right after the restart : https://snapshot.raintank.io/dashboard/snapshot/Im3AxP6qUfMnTeB97kryUcQV6otY0bHP

ironic cleaning : Metadata

My very crude test to time the cleaning. We need to get better insight into cleaning (I have a RFE on this).

To clean a 10 disk machine it took : 2:22

[stack@undercloud ~]$ while true ; do date ; ironic node-list | grep 2d35caec-0452-49ef-b382-a0a1b5c553f1; sleep 1; done 
Fri Mar 10 12:46:08 EST 2017
| 2d35caec-0452-49ef-b382-a0a1b5c553f1 | None | None                                 | power on    | clean wait         | False 
      |
...
Fri Mar 10 12:48:30 EST 2017
| 2d35caec-0452-49ef-b382-a0a1b5c553f1 | None | None                                 | power off   | available          | False       |

ironic cleaning on large import

For 181 nodes, two nodes failed to clean when everything switch from managable -> available.

[stack@undercloud ~]$ cat new-nodes.json | grep "10.1" | wc -l
181

Started Mistral Workflow. Execution ID: 7982326b-1c9e-4457-a03f-0667af66b4ae
Successfully set all nodes to available.

real    32m46.525s
user    0m0.778s
sys     0m0.088s

Issues

Early on we ran into issues with the deployment:

Ceph templates were configured for HCI and was a direct copy and paste from Sai's HCI deployment. Minus the storage-environmental.yaml, which was updated with the right OSD/Journal information.
We also hit https://access.redhat.com/solutions/2908781 , due to these templates.
VLANs were not updated in the network-environmental.yaml file (had the previous CNCF deployment still)
We need to use the WWN for the root-disk-hint, not the device name -- this is a more stable solution.
Due to #4 we have this workaround https://access.redhat.com/solutions/2924181
Even if we provide disk hints to use sdX CNCF still installs a boot loader on sda which causes inconsistant boots in our env, the best solution is to simply reuse sda (even though it is a SSD) as the system disk to avoid this.

Ironic node cleanup

Node stuck in clean wait:

| 96948bd4-5bc2-4e2c-aabc-af1ab3356cb0 | None | None                                 | power on    | clean wait         | True        |

Networker kernel Panic

The Kernel Panic seems to be due to a update to the kernel which was done outside of OOO (by mistake)

Deployment wedged

Since the Baremetal nodes are currently in use, and the installation of OpenShift nuked two of my controllers (see the section about Kernel Panic). I needed to Scale the OpenStack side of the house to zero, to rebuild the Cluster. This has never been done before.

What did I do?

I simply updated the deploy.yml to :

parameter_defaults:
  ControllerCount: 0
  ComputeCount: 0
  CephStorageCount: 0
  CNCFNetworkerCount: 0
  CNCFBaremetalCount: 88
  
  OvercloudControlFlavor: control
  OvercloudCNCFNetworkerFlavor: control 
  OvercloudCNCFBaremetalFlavor: cncfbare 
  OvercloudCephStorageFlavor: ceph-storage
  OvercloudComputeFlavor: compute

Then kicked off the deployment, which succeded -- now I was just left baremetal nodes.

What did you do next?

I attempted to re-deploy the OpenStack env. At which poing I saw [1][2] which seems like the cluster could not come up due to hostname resolution.

Looking at /etc/hosts, it seems blank compared to the other baremetal nodes, typically OOO will run :

[root@overcloud-controller-0 ~]# /usr/libexec/os-refresh-config/configure.d/51-hosts

Which will populate /etc/hosts. In my case, running manually, this would work just fine, however, there was another operation that was overwriting /etc/hosts with what we see in [2]. Causing Pacemaker to fail.

How to proceed

Run the 51-hosts script manually then set chattr +i on /etc/hosts (currently trying this) - this failed. See section on "Deployment (After total chaos)"

Dual Stack issue

Review : https://bugs.launchpad.net/tripleo/+bug/1632327/comments/8

Solution?

Possibly update the subnet range to not use IPs in the range that stack 1 is using?

[stack@undercloud ~]$ neutron subnet-update 215f46a3-ec1b-42c1-8406-66263819a53b --allocation-pool start=172.16.1.10,end=172.16.5.200

Limit on Hostname elements

When attempting to scale to the rest of the Compute nodes the Validation failed with:

2017-03-15 23:38:38Z [overcloud-openstack]: UPDATE_FAILED  StackValidationFailed: resources.ComputeIpListMap: Property error: ComputeIpListMap.Properties.NetworkHostnameMap: Collection length exceeds 1000 elements

Possible workaround?

Updating limit_iterators to 10000 instead of 1000

[yaql]
memory_quota=100000
limit_iterators=10000

os-collect-config died on compute 107

The os-collect-config process died on a single compute node, which made all attempts to scale impossible.

What was I seeing?

https://gist.github.com/jtaleric/26f73d30f07f76ad3de99af81bbd55b2 https://gist.github.com/jtaleric/270ce10ee79ab36c58355288a8a2fbd9

Possible work around?

Restart os-collect-config on the node, and re-kick the failed Scale deployment.

OOM - Ceilometer collector blewup

Ceilometer collector caused the Controllers to hit OOM: https://snapshot.raintank.io/dashboard/snapshot/yGo2ktELp8pMXn6nkz90mygNkWr9Yb6N https://snapshot.raintank.io/dashboard/snapshot/2z43YZjrkMAp4vcRlgudCm4IdqpHEuuT

Ceph went into Ceph Health HEALTH_ERR

[root@overcloud-openstack-controller-0 heat-admin]# ceph -s
    cluster 04526bc4-0504-11e7-8885-f8bc121144a0
     health HEALTH_ERR
            1 pgs are stuck inactive for more than 300 seconds
            2 pgs backfill_wait
            15 pgs degraded
            1 pgs recovering
            12 pgs recovery_wait
            1 pgs stuck inactive
            14 pgs stuck unclean
            3 pgs undersized
            42 requests are blocked > 32 sec
            recovery 156889/60263413 objects degraded (0.260%)
            recovery 311984/60263413 objects misplaced (0.518%)
            recovery 1/20061806 unfound (0.000%)
            too many PGs per OSD (320 > max 300)
            pool metrics has many more objects per pg than average (too few pgs?)
     monmap e1: 3 mons at {overcloud-openstack-controller-0=172.18.0.18:6789/0,overcloud-openstack-controller-1=172.18.0.16:6789/0,overcloud-openstack-controller-2=172.18.0.14:6789/0}
            election epoch 6, quorum 0,1,2 overcloud-openstack-controller-2,overcloud-openstack-controller-1,overcloud-openstack-controller-0
     osdmap e3761: 90 osds: 87 up, 87 in; 3 remapped pgs
            flags sortbitwise
      pgmap v558382: 9280 pgs, 6 pools, 818 GB data, 19591 kobjects
            2749 GB used, 155 TB / 158 TB avail
            156889/60263413 objects degraded (0.260%)
            311984/60263413 objects misplaced (0.518%)
            1/20061806 unfound (0.000%)
                9265 active+clean
                  12 active+recovery_wait+degraded
                   2 active+undersized+degraded+remapped+wait_backfill
                   1 recovering+undersized+degraded+remapped+peered
  client io 0 B/s rd, 46767 B/s wr, 5 op/s rd, 5 op/s wr

We are disabling Ceilometer and Gnocchi in hopes to get past this issue. By Disabling, we are turning the services off, and then removing the services from the role-data for the controllers.

51-hosts failing to complete?

After a failed deployment due to timeout on all these resources: https://gist.github.com/jtaleric/39e92a0353f14df8e272dc0d9a7f7768

I saw:

[root@overcloud-openstack-cephstorage-1 heat-admin]# ps aux | grep 51-ho
root      331482  0.0  0.0 115248  1476 ?        S    18:10   0:00 /bin/bash /usr/libexec/os-refresh-config/configure.d/51-hosts
root      331483  0.0  0.0 115248   592 ?        S    18:10   0:00 /bin/bash /usr/libexec/os-refresh-config/configure.d/51-hosts
root      331497  0.0  0.0 112648   960 pts/0    S+   18:10   0:00 grep --color=auto 51-ho

Which means the 51-hosts file has not completed running.

Bug(s)

Setup

CNCF Provided a spreadsheet we converted to CSV, then captured em1's MAC address with

  $ ansible -o -i inv.35 test -a "cat /sys/class/net/em1/bonding_slave/perm_hwaddr"

Mapping the IPMI Address to the mac addresses above.

CNCF Provided us with OPERATOR accounts via IPMI, OpenStack Director/OOO Assumes the account will be ADMINISTRATOR, we needed to update the instackenv.json to reflect this:

        {
            "arch": "x86_64",
            "cpu": "2",
            "disk": "20",
            "ipmi_priv_level": "OPERATOR",
            "mac": [
                "f8:bc:12:11:97:20"
            ],
            "memory": "1024",
            "pm_addr": "10.1.1.59",
            "pm_password": "password",
            "pm_type": "pxe_ipmitool",
            "pm_user": "operator"
        }

Introspection

Due to issues with bulk introspection, I decided to use https://gist.github.com/jtaleric/fcca3811cd4d8f37336f9532e5b9c9ff in order to inspect all the 107 nodes.

Node Tagging

Since we will be using Director for the entire deployment. We need to map out all of our hosts. I created mapping files that contain the CSV data for each node type:

[stack@undercloud nodes]$ ls -tlarh
total 24K
-rw-rw-r--.  1 stack stack  879 Feb 27 13:48 ceph-storage
-rw-rw-r--.  1 stack stack  273 Feb 27 13:48 compute
-rw-rw-r--.  1 stack stack  524 Feb 27 14:23 control
-rw-rw-r--.  1 stack stack 7.6K Feb 27 14:40 baremetal
drwx------. 13 stack stack 4.0K Feb 27 14:40 ..
drwxrwxr-x.  2 stack stack   69 Feb 27 14:40 .
[stack@undercloud nodes]$

Then I constructed a loop to iterate through the nodes and determine their IPMI address and map to the right profile.

for UUID in `ironic node-list | grep manage | awk '{print $2}'`; do
ironic node-show $UUID | grep ipmi_address | awk '{print $5}' | grep $(egrep -oh '[0-9]+.[0-9]+.[0-9]+.[0-9]+') * | awk -F ":" '{print $1}'

ironic node-update $UUID add properties/capabilities='profile:PROFILE,cpu_vt:true,cpu_hugepages:true,boot_option:local,cpu_txt:true,cpu_aes:true,cpu_hugepages_1g:true'

Ironic Cleanup

Hosts with multiple disks can come with pre-installed OSs, to avoid multiple boot-loaders we need to clean the hosts. The configuration below will make Ironic clean the hosts when moved back into available state.

In Ironic.conf change the following:

    [conductor]
    automated_clean = True
    [deploy]
    erase_devices_priority = 0
    erase_devices_metadata_priority = 10
    [neutron]
    cleaning_network_uuid = <UUID of ctlplane>

then restart openstack-ironic-conductor

When a node moves from managed to avaialble you will see:

| 96948bd4-5bc2-4e2c-aabc-af1ab3356cb0 | None | None                                 | power off   | cleaning           | True        |

Which is the node booting to clean the disks.

Deployment Init 1-7

Here is the deployment command I constructed.

$ time openstack overcloud deploy --templates -r /home/stack/templates/roles_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ 
-e /home/stack/templates/network-environmental.yaml \
-e /home/stack/templates/storage-environmental.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
--libvirt-type=kvm \
--ntp-server 0.pool.ntp.org \
-e /home/stack/templates/deploy.yml

Deployment (After total chaos)

After mutliple issues tracked in the Issues section, the best path forward was to create a new stack and hope for the best.

In order to create the new stack, I needed to create new references to different networks in the templates

parameter_defaults:

  ExternalNetName: external2
  InternalApiNetName: internal_api2
  StorageMgmtNetName: storagemgt2
  StorageNetName: storage2
  TenantNetName: tenant2

  ExternalSubnetName: external_subnet2
  InternalSubnetName: internal_api_subnet2
  StorageMgmtSubnetName: storagemgmt_subnet2
  StorageSubnetName: storage_subnet2
  TenantSubnetName: tenant_subnet2

  InternalApiNetValueSpecs:
      'provider:physical_network': 'internal_api2'
      'provider:network_type': 'flat'
  ExternalNetValueSpecs:
      'provider:physical_network': 'external2'
      'provider:network_type': 'flat'
  StorageNetValueSpecs:
      'provider:physical_network': 'storage2'
      'provider:network_type': 'flat'
  StorageMgmtNetValueSpecs:
      'provider:physical_network': 'storagemgt2'
      'provider:network_type': 'flat'
  TenantNetValueSpecs:
      'provider:physical_network': 'tenant2'
      'provider:network_type': 'flat'

Since the baremetal cloud did not use any of the OpenStack networks (all but External), I just had to change the External subnet to not overlap:

ExternalAllocationPools: [{'start': '10.2.4.109', 'end': '10.2.5.55'}]

I also had to update the allocation pool for Neutron.

By default the Undercloud only can support 10 networks (quota) so I had to bump the quota for the Undercloud

$ neutron quota-update --networks 1000
$ neutron quota-update --subnets 1000

The newly constructed deployment command:

[stack@undercloud templates]$ time openstack overcloud deploy --templates -r /home/stack/templates/roles_data.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/templates/network-environmental-openstack.yaml -e /home/stack/templates/storage-environmental.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --libvirt-type=kvm --ntp-server 0.pool.ntp.org -e /home/stack/templates/deploy-openstack.yml --stack overcloud-openstack