Upgrading the Base OS in an M3 Anvil! Cluster: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
Tag: Reverted
Line 197: Line 197:
</syntaxhighlight>
</syntaxhighlight>


= Host Key Issue =
In some cases, other hosts will not update to the new key. If you see a message like the following in <span class="code">anvil.log</span>;
<syntaxhighlight lang="text">
2024/08/19 14:31:47:Remote.pm:683; The target's host key has changed. If the target has been rebuilt, or the target IP reused, the old key will need to be removed. If this is the case, remove line: [5] from: [/root/.ssh/known_hosts].
</syntaxhighlight>
If you see this, manually edit the referenced <span class="code">known_hosts</span> file and delete the referenced line number.


{{footer}}
{{footer}}

Revision as of 18:35, 19 August 2024

 Alteeve Wiki :: How To :: Upgrading the Base OS in an M3 Anvil! Cluster

Overview

This article covers rebuilding an Anvil! to upgrade or change the base operating system of all the machines in the cluster.

In short, the process will be to remove each machine from the cluster, reinstall the new OS, and integrate it back into the cluster. We'll do Strikers, Subnodes and a DR host.

This tutorial will also serve as a guide to rebuilding a machine that failed and was replaced. The only difference will be that the installed OS will stay the same.

For this tutorial, we will be updating from CentOS Stream 8 to AlmaLinux 9. The process should work between RHEL, as either the source and/or the destination OS.

Updating

It is critical to update the cluster to ensure all machines are running the latest versions of the various cluster components. This will maximise the chance that there are no compatibility issues as each machines is rebuilt.

If A Machine Has Failed

If you are replacing a failed machine, follow the "Replacing a Failed Machine in an M3 Anvil! Cluster" tutorial. You can rebuild the lost machine with the new OS version, so long as the new OS is one major version newer.

That is to say, if you have an EL 8 based M3 cluster, you can use EL 9 as the OS of the machine being replaced. However, if you do this, please proceed with this tutorial to upgrade the rest of the machines in the cluster as soon as possible after completing the replacement of the lost machine.

Whether replacing a machine that failed, or upgrading, it is critically important to update the cluster first. This will ensure that the versions of applications installed on the rebuilt machine matches / is compatible with the versions to be installed on the rebuilt systems.

If All Machines Are Online

If you're doing a planned update, and all machines in the Anvil! cluster is online, you can use striker-update-cluster.

This is the scenario we will be covering in this tutorial.

Pre-Upgrade Update

Note: This next step will include migrating servers between subnodes. This causes some amount of performance impact on the guests. As such, consider this potential impact when scheduling this update.

We can run the cluster update from either striker dashboard. For this tutorial, we will use an-striker01.

striker-update-cluster
16:53:36; - Verifying access to: [an-a03dr01]...
16:53:37; Connected on: [10.201.14.3] via: [bcn1]
16:53:37; - Verifying access to: [an-a03n01]...
16:53:37; Connected on: [10.201.14.1] via: [bcn1]
16:53:37; - Verifying access to: [an-a03n02]...
16:53:37; Connected on: [10.201.14.2] via: [bcn1]
16:53:37; - Verifying access to: [an-striker01]...
16:53:37; Connected on: [10.201.4.1] via: [bcn1]
16:53:37; - Verifying access to: [an-striker02]...
16:53:38; Connected on: [10.201.4.2] via: [bcn1]
16:53:38; - Success!
16:53:38; [  Note   ] - All nodes need to be up and running for the update to run on nodes. 
16:53:38; [  Note   ] - Any out-of-sync storage needs to complete before a node can be updated. 
16:53:38; [ Warning ] - Servers will be migrated between subnodes, which can cause reduced performance during
16:53:38; [ Warning ] - these migrations. If a sub-node is not active, it will be activated as part of the
16:53:38; [ Warning ] - upgrade process.

Here, we see that the update program verified access to all known machines. With this confirmed, you will be asked to proceed.

Proceed? [y/N]

If you're ready, enter 'y'.

y
Thank you, proceeding.
16:55:06; Disabling Anvil! daemons on all hosts...
16:55:06; - Disabling daemons on: [an-a03dr01]... 
16:55:06; anvil-daemon stopped.
16:55:07; scancore stopped.
16:55:07; - Done!
16:55:07; - Disabling daemons on: [an-a03n01]... 
16:55:07; anvil-daemon stopped.
16:55:07; scancore stopped.
16:55:07; - Done!
16:55:07; - Disabling daemons on: [an-a03n02]... 
16:55:08; anvil-daemon stopped.
16:55:08; scancore stopped.
16:55:08; - Done!
16:55:08; - Disabling daemons on: [an-striker01]... 
16:55:09; anvil-daemon stopped.
16:55:09; scancore stopped.
16:55:09; - Done!
16:55:09; - Disabling daemons on: [an-striker02]... 
16:55:09; anvil-daemon stopped.
16:55:09; scancore stopped.
16:55:09; - Done!
16:55:09; Enabling 'anvil-safe-start' on nodes to prevent hangs on reboot.
16:55:10; - Done!
16:55:10; Starting the update of the Striker dashboard: [an-striker01].
16:55:10; - Beginning OS update of: [an-striker01].
16:55:10; - Calling update now.
16:55:10; - NOTE: This can seem like it's hung! You can watch the progress using 'journalctl -f' on another terminal to
16:55:10; -       watch the progress via the system logs. You can also check with 'ps aux | grep dnf'.
<...snip...>

From here, the update will continue to run. How long this takes depends on several factors;

  • How many updates are needed and how long do they take to install?
  • How much data needs to be downloaded and how fast the Internet connection is.
  • How long it takes to migrate servers between subnodes in each node pair?
  • How many nodes you have, and how many DR hosts you have?

After the update completes

<...snip...>
16:57:35; - Both subnodes are online, will now check replicated storage.
16:57:35; - This is the second node, no need to wait for replication to complete.
16:57:35; - Running 'anvil-version-changes'.
16:57:37; - Done!
16:57:37; Enabling Anvil! daemons on all hosts...
16:57:37; - Enabling dameons on: [an-a03dr01]... 
16:57:37; anvil-daemon started.
16:57:38; scancore started.
16:57:38; - Done!
16:57:38; - Enabling dameons on: [an-a03n01]... 
16:57:38; anvil-daemon started.
16:57:39; scancore started.
16:57:39; - Done!
16:57:39; - Enabling dameons on: [an-a03n02]... 
16:57:39; anvil-daemon started.
16:57:39; scancore started.
16:57:39; - Done!
16:57:39; - Enabling dameons on: [an-striker01]... 
16:57:39; anvil-daemon started.
16:57:40; scancore started.
16:57:40; - Done!
16:57:40; - Enabling dameons on: [an-striker02]... 
16:57:40; anvil-daemon started.
16:57:40; scancore started.
16:57:40; - Done!
16:57:40; Updates complete!

We're ready to proceed with rebuilding each machine!

Rebuilding Machines

Each physical machine in the cluster will have to have their operating system reinstalled with the new version of the operating system. In this tutorial, that means we'll install AlmaLinux 9.

Note: Doing an in-place update of the operating system is not recommended. Upgrading from RHEL 8 to 9 is supported by Red Hat, but it is not tested by Alteeve.

Rebuild Strikers

Start by rebuilding the Striker dashboards, one at a time.

Power off the first dashboard and reboot using the install media, and follow the install instructions from the main install guide. Follow the tutorial until you configure the network. When you reach the 'Peering Striker Dashboards' stage, return here.

Once the dashboard has been configured, using the rebuilt node's web interface, peer the other striker. This will update the rebuilt dashboard's anvil.conf to add the other dashboard.

Once re-peered, please do the following to ensure the database has been resynced on the rebuilt Striker;

On both Striker dashboards, run (as root or via sudo):

systemctl stop anvil-daemon.service scancore.service

Then on the rebuilt striker, please run:

anvil-daemon --run-once --resync-db
WARNING:  skipping "pg_toast_1260" --- only superuser can vacuum it
WARNING:  skipping "pg_toast_1260_index" --- only superuser can vacuum it
<...snip...>
WARNING:  skipping "pg_replication_origin" --- only superuser can vacuum it
WARNING:  skipping "pg_shseclabel" --- only superuser can vacuum it

This could take some time to complete, so please be patient.

Once the above command completes, restart the daemons.

systemctl start anvil-daemon.service scancore.service

Done! The first striker has been upgrade. Repeat the process for the second striker!

Rebuild DR

Host Key Issue

In some cases, other hosts will not update to the new key. If you see a message like the following in anvil.log;

2024/08/19 14:31:47:Remote.pm:683; The target's host key has changed. If the target has been rebuilt, or the target IP reused, the old key will need to be removed. If this is the case, remove line: [5] from: [/root/.ssh/known_hosts].

If you see this, manually edit the referenced known_hosts file and delete the referenced line number.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.