OpenShift Container Platform 4.13: Any Platform インストールした際の失敗メモ

自宅ラボに、OpenShift4.13をインストールした際に詰まった箇所のメモを残しておく。若干構成等を変更したが基本以下の参照資料と同じ。

参照資料

https://docs.openshift.com/container-platform/4.13/installing/installing_platform_agnostic/installing-platform-agnostic.html
https://github.com/team-ohc-jp-place/OpenShift-ADP/tree/4.11/AnyPlatform
https://qiita.com/loftkun/items/03ca887ace88ef10496f
https://rheb.hatenablog.com/entry/openshift41-baremetal-upi

大別して4点。

一点目: iPXEスクリプトに不備があった

参照資料にあるように、iPXEスクリプトを作成してインストールを開始すると iPXEのブートメニューは表示されるが、master又はworkerを選択すると"Could not boot image: No such file or directory No more network devices" と出力され、ファイルの取得が開始しなかった。

bootstrapノードは正常にインストールが出来たのに、何故かmasterとworkerのみ失敗するという状況。 bootstrapは正常にインストールが出来ること、nginxのログにGETが来ていないことから、被疑箇所はiPXEスクリプトだと思い間違いがないか調べたが、一見問題ないように見えた。しかし、根気強く確認していくと、:master や :worker の後ろに全角スペースが含まれていることに気がついた。

:master　<--- ★ここに全角スペースがあった
kernel http://bastion01.ocplab.test:8008/rhcos-4.12.17-x86_64-live-kernel-x86_64 ip=dhcp rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=sda initrd=rhcos-4.12.17-x86_64-live-initramfs.x86_64.img coreos.live.rootfs_url=http://bastion01.ocplab.test:8008/rhcos-4.12.17-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://bastion01.ocplab.test:8008/master.ign
initrd http://bastion01.ocplab.test:8008/rhcos-4.12.17-x86_64-live-initramfs.x86_64.img
boot

:worker　<--- ★ここに全角スペースがあった
kernel http://bastion01.ocplab.test:8008/rhcos-4.12.17-x86_64-live-kernel-x86_64 ip=dhcp rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=sda initrd=rhcos-4.12.17-x86_64-live-initramfs.x86_64.img coreos.live.rootfs_url=http://bastion01.ocplab.test:8008/rhcos-4.12.17-x86_64-live-rootfs.x86_64.img coreos.inst.ignition_url=http://bastion01.ocplab.test:8008/worker.ign
initrd http://bastion01.ocplab.test:8008/rhcos-4.12.17-x86_64-live-initramfs.x86_64.img
boot

全角スペースを削除すると、masterもworkerもインストールが進行するようになった。基本的なことだが、適当にコピペしてるとこうなるので注意。エディタの拡張機能等で全角スペースの可視化やファイル保存時に自動でスペース削除するようにすると同じミスはしないと思う。

二点目: firewalldで必要なportを開放していなかった

masterとworkerのインストールが進行すると、以下の画像にあるようにエラーが延々と出力されて詰まった。ログやパケットキャプチャを確認して気づいたが、firewalld にてポート(22623)が開いてなかった。初歩的だけど、ありがちなミス。まあこれは大したことない。

三点目: アドレス重複

以下のように、PXEブートが失敗した。何度か試行錯誤している内に発生した事象。
少し前まではここで詰まることはなかったので、不思議に思いつつ調べてみるとdhcpdによりノード指定したIPアドレスが別のノードに割り当てられていたのが原因だった。アドレスのリース状況は、/var/lib/dhcpd/dhcpd.leases で確認できる。

四点目: オペレータがFalse

インストールも大詰めの所で、以下を実施したら失敗した。調べてみると、authentiation と console が False になっていることが原因らしい。

[root@bastion01 openshift]# ./openshift-install --dir=installdir wait-for install-complete
INFO Waiting up to 40m0s (until 3:37PM) for the cluster at https://api.ocp4130.ocplab.test:6443 to initialize...
ERROR Cluster operator authentication Degraded is True with OAuthServerRouteEndpointAccessibleController_SyncError: OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.ocp4130.ocplab.test/healthz": EOF
ERROR Cluster operator authentication Available is False with OAuthServerRouteEndpointAccessibleController_EndpointUnavailable: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ocp4130.ocplab.test/healthz": EOF
INFO Cluster operator baremetal Disabled is False with :
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.13.4, 0 replicas available
ERROR Cluster operator console Available is False with Deployment_InsufficientReplicas::RouteHealth_FailedGet: DeploymentAvailable: 0 replicas available for console deployment
ERROR RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ocp4130.ocplab.test): Get "https://console-openshift-console.apps.ocp4130.ocplab.test": EOF
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights SCAAvailable is False with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"code":"ACCT-MGMT-7","href":"/api/accounts_mgmt/v1/errors/7","id":"7","kind":"Error","operation_id":"c29b5877-fad8-4773-83bc-e91712b4dba8","reason":"The organization (id= 1eMTNJtEsI58UIvnjQyhhHz1gCQ) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management."}
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
ERROR failed to initialize the cluster: Cluster operators authentication, console are not available

[root@bastion01 openshift]# oc get clusteroperator
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.4    False       False         True       5h16m   OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ocp4130.ocplab.test/healthz": EOF
baremetal                                  4.13.4    True        False         False      5h14m
cloud-controller-manager                   4.13.4    True        False         False      5h19m
cloud-credential                           4.13.4    True        False         False      5h20m
cluster-autoscaler                         4.13.4    True        False         False      5h15m
config-operator                            4.13.4    True        False         False      5h15m
console                                    4.13.4    False       True          False      5h5m    DeploymentAvailable: 0 replicas available for console deployment...
control-plane-machine-set                  4.13.4    True        False         False      5h14m
csi-snapshot-controller                    4.13.4    True        False         False      5h15m
dns                                        4.13.4    True        False         False      5h15m
etcd                                       4.13.4    True        False         False      5h13m
image-registry                             4.13.4    True        False         False      5h5m
ingress                                    4.13.4    True        False         True       5h7m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
insights                                   4.13.4    True        False         False      5h2m
kube-apiserver                             4.13.4    True        False         False      5h5m
kube-controller-manager                    4.13.4    True        False         False      5h12m
kube-scheduler                             4.13.4    True        False         False      5h12m
kube-storage-version-migrator              4.13.4    True        False         False      5h15m
machine-api                                4.13.4    True        False         False      5h14m
machine-approver                           4.13.4    True        False         False      5h15m
machine-config                             4.13.4    True        False         False      5h
marketplace                                4.13.4    True        False         False      5h15m
monitoring                                 4.13.4    True        False         False      5h4m
network                                    4.13.4    True        False         False      5h15m
node-tuning                                4.13.4    True        False         False      5h14m
openshift-apiserver                        4.13.4    True        False         False      5h5m
openshift-controller-manager               4.13.4    True        False         False      5h11m
openshift-samples                          4.13.4    True        False         False      5h7m
operator-lifecycle-manager                 4.13.4    True        False         False      5h15m
operator-lifecycle-manager-catalog         4.13.4    True        False         False      5h15m
operator-lifecycle-manager-packageserver   4.13.4    True        False         False      5h7m
service-ca                                 4.13.4    True        False         False      5h15m
storage                                    4.13.4    True        False         False      5h15m
[root@bastion01 openshift]#
[root@bastion01 openshift]# oc -n openshift-console get pods -o wide
NAME                        READY   STATUS             RESTARTS         AGE    IP            NODE                           NOMINATED NODE   READINESS GATES
console-7f5b95d995-dsgfz    0/1     CrashLoopBackOff   53 (4m32s ago)   5h1m   10.128.0.37   master03.ocp4130.ocplab.test   <none>           <none>
console-7f5b95d995-xng76    0/1     CrashLoopBackOff   53 (4m29s ago)   5h1m   10.129.0.60   master01.ocp4130.ocplab.test   <none>           <none>
console-867db595c-nvvg8     0/1     CrashLoopBackOff   54 (101s ago)    5h4m   10.130.0.15   master02.ocp4130.ocplab.test   <none>           <none>
downloads-f6b75d65c-j9445   1/1     Running            0                5h4m   10.129.0.39   master01.ocp4130.ocplab.test   <none>           <none>
downloads-f6b75d65c-q8fsp   1/1     Running            0                5h4m   10.130.0.14   master02.ocp4130.ocplab.test   <none>           <none>
[root@bastion01 openshift]#

[root@bastion01 openshift]# oc logs console-867db595c-nvvg8  -n openshift-console | tail -n 3
E0720 11:02:54.033112       1 auth.go:239] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4130.ocplab.test/oauth/token failed: Head "https://oauth-openshift.apps.ocp4130.ocplab.test": EOF
E0720 11:03:04.038234       1 auth.go:239] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4130.ocplab.test/oauth/token failed: Head "https://oauth-openshift.apps.ocp4130.ocplab.test": EOF
E0720 11:03:14.043150       1 auth.go:239] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp4130.ocplab.test/oauth/token failed: Head "https://oauth-openshift.apps.ocp4130.ocplab.test": EOF

よくわかんねーなと思いつつ、ログを基に調べてみると以下のナレッジを発見。
結論として、これを実施したら、正常になった。
https://access.redhat.com/solutions/5691661

[root@bastion01 openshift]# oc get pods -n openshift-ingress
NAME                              READY   STATUS    RESTARTS        AGE
router-default-5768f5c985-ct5ds   1/1     Running   1 (5h24m ago)   5h26m
router-default-5768f5c985-tvrt7   1/1     Running   2 (5h21m ago)   5h26m


[root@bastion01 openshift]# oc delete pod router-default-5768f5c985-ct5ds router-default-5768f5c985-tvrt7 -n openshift-ingress
pod "router-default-5768f5c985-ct5ds" deleted
pod "router-default-5768f5c985-tvrt7" deleted
[root@bastion01 openshift]#　　　　　　　　　　　　　    <<< 少し時間がかかるので、のんびり待つ
[root@bastion01 openshift]# oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.4    True        False         False      38s
baremetal                                  4.13.4    True        False         False      5h28m
cloud-controller-manager                   4.13.4    True        False         False      5h33m
cloud-credential                           4.13.4    True        False         False      5h33m
cluster-autoscaler                         4.13.4    True        False         False      5h29m
config-operator                            4.13.4    True        False         False      5h29m
console                                    4.13.4    True        False         False      18s
control-plane-machine-set                  4.13.4    True        False         False      5h28m
csi-snapshot-controller                    4.13.4    True        False         False      5h28m
dns                                        4.13.4    True        False         False      5h28m
etcd                                       4.13.4    True        False         False      5h27m
image-registry                             4.13.4    True        False         False      5h18m
ingress                                    4.13.4    True        False         False      7s
insights                                   4.13.4    True        False         False      5h16m
kube-apiserver                             4.13.4    True        False         False      5h19m
kube-controller-manager                    4.13.4    True        False         False      5h26m
kube-scheduler                             4.13.4    True        False         False      5h26m
kube-storage-version-migrator              4.13.4    True        False         False      5h29m
machine-api                                4.13.4    True        False         False      5h28m
machine-approver                           4.13.4    True        False         False      5h28m
machine-config                             4.13.4    True        False         False      5h14m
marketplace                                4.13.4    True        False         False      5h29m
monitoring                                 4.13.4    True        False         False      5h18m
network                                    4.13.4    True        False         False      5h29m
node-tuning                                4.13.4    True        False         False      5h28m
openshift-apiserver                        4.13.4    True        False         False      5h19m
openshift-controller-manager               4.13.4    True        False         False      5h25m
openshift-samples                          4.13.4    True        False         False      5h21m
operator-lifecycle-manager                 4.13.4    True        False         False      5h28m
operator-lifecycle-manager-catalog         4.13.4    True        False         False      5h28m
operator-lifecycle-manager-packageserver   4.13.4    True        False         False      5h21m
service-ca                                 4.13.4    True        False         False      5h29m
storage                                    4.13.4    True        False         False      5h29m
[root@bastion01 openshift]#
[root@bastion01 openshift]# oc -n openshift-console get pods
NAME                        READY   STATUS    RESTARTS       AGE
console-7f5b95d995-dsgfz    1/1     Running   56 (11m ago)   5h20m
console-7f5b95d995-xng76    1/1     Running   56 (10m ago)   5h20m
downloads-f6b75d65c-j9445   1/1     Running   0              5h22m
downloads-f6b75d65c-q8fsp   1/1     Running   0              5h22m

オペレータが全て True になってから、consoleにアクセスを試してみると正常にログインできた。