Upgrading the Base OS in an M3 Anvil! Cluster: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
Line 12: Line 12:


= Updating =
= Updating =
It is critical to update the cluster to ensure all machines are running the latest versions of the various cluster components. This will maximise the chance that there are no compatibility issues as each machines is rebuilt.
== If A Machine Has Failed ==
If you are replacing a failed machine, follow the "[[Replacing a Failed Machine in an M3 Anvil! Cluster]]" tutorial. You can rebuild the lost machine with the new OS version, so long as the new OS is one major version newer.
That is to say, if you have an [[EL]] 8 based M3 cluster, you can use [[EL]] 9 as the OS of the machine being replaced. However, if you do this, please proceed with this tutorial to upgrade the rest of the machines in the cluster as soon as possible after completing the replacement of the lost machine.


Whether replacing a machine that failed, or upgrading, it is critically important to update the cluster first. This will ensure that the versions of applications installed on the rebuilt machine matches / is compatible with the versions to be installed on the rebuilt systems.
Whether replacing a machine that failed, or upgrading, it is critically important to update the cluster first. This will ensure that the versions of applications installed on the rebuilt machine matches / is compatible with the versions to be installed on the rebuilt systems.
Line 21: Line 29:
This is the scenario we will be covering in this tutorial.  
This is the scenario we will be covering in this tutorial.  


== If A Machine Has Failed ==
=== Pre-Upgrade Update ===
 
{{note|1=This next step will include migrating servers between subnodes. This causes some amount of performance impact on the guests. As such, consider this potential impact when scheduling this update.}}
 
We can run the cluster update from either striker dashboard. For this tutorial, we will use <span class="code">an-striker01</span>.
 
<syntaxhighlight lang="bash">
striker-update-cluster
</syntaxhighlight>
<syntaxhighlight lang="text">
16:53:36; - Verifying access to: [an-a03dr01]...
16:53:37; Connected on: [10.201.14.3] via: [bcn1]
16:53:37; - Verifying access to: [an-a03n01]...
16:53:37; Connected on: [10.201.14.1] via: [bcn1]
16:53:37; - Verifying access to: [an-a03n02]...
16:53:37; Connected on: [10.201.14.2] via: [bcn1]
16:53:37; - Verifying access to: [an-striker01]...
16:53:37; Connected on: [10.201.4.1] via: [bcn1]
16:53:37; - Verifying access to: [an-striker02]...
16:53:38; Connected on: [10.201.4.2] via: [bcn1]
16:53:38; - Success!
16:53:38; [  Note  ] - All nodes need to be up and running for the update to run on nodes.
16:53:38; [  Note  ] - Any out-of-sync storage needs to complete before a node can be updated.
16:53:38; [ Warning ] - Servers will be migrated between subnodes, which can cause reduced performance during
16:53:38; [ Warning ] - these migrations. If a sub-node is not active, it will be activated as part of the
16:53:38; [ Warning ] - upgrade process.
</syntaxhighlight>
 
Here, we see that the update program verified access to all known machines. With this confirmed, you will be asked to proceed.
 
<syntaxhighlight lang="text">
Proceed? [y/N]
</syntaxhighlight>
 
If you're ready, enter '<span class="code">y</span>'.
 
<syntaxhighlight lang="text">
y
</syntaxhighlight>
<syntaxhighlight lang="text">
Thank you, proceeding.
16:55:06; Disabling Anvil! daemons on all hosts...
16:55:06; - Disabling daemons on: [an-a03dr01]...
16:55:06; anvil-daemon stopped.
16:55:07; scancore stopped.
16:55:07; - Done!
16:55:07; - Disabling daemons on: [an-a03n01]...
16:55:07; anvil-daemon stopped.
16:55:07; scancore stopped.
16:55:07; - Done!
16:55:07; - Disabling daemons on: [an-a03n02]...
16:55:08; anvil-daemon stopped.
16:55:08; scancore stopped.
16:55:08; - Done!
16:55:08; - Disabling daemons on: [an-striker01]...
16:55:09; anvil-daemon stopped.
16:55:09; scancore stopped.
16:55:09; - Done!
16:55:09; - Disabling daemons on: [an-striker02]...
16:55:09; anvil-daemon stopped.
16:55:09; scancore stopped.
16:55:09; - Done!
16:55:09; Enabling 'anvil-safe-start' on nodes to prevent hangs on reboot.
16:55:10; - Done!
16:55:10; Starting the update of the Striker dashboard: [an-striker01].
16:55:10; - Beginning OS update of: [an-striker01].
16:55:10; - Calling update now.
16:55:10; - NOTE: This can seem like it's hung! You can watch the progress using 'journalctl -f' on another terminal to
16:55:10; -      watch the progress via the system logs. You can also check with 'ps aux | grep dnf'.
<...snip...>
</syntaxhighlight>
 
From here, the update will continue to run. How long this takes depends on several factors;
 
* How many updates are needed and how long do they take to install?
* How much data needs to be downloaded and how fast the Internet connection is.
* How long it takes to migrate servers between subnodes in each node pair?
* How many nodes you have, and how many DR hosts you have?
 
After the update completes
 
<syntaxhighlight lang="text">
<...snip...>
16:57:35; - Both subnodes are online, will now check replicated storage.
16:57:35; - This is the second node, no need to wait for replication to complete.
16:57:35; - Running 'anvil-version-changes'.
16:57:37; - Done!
16:57:37; Enabling Anvil! daemons on all hosts...
16:57:37; - Enabling dameons on: [an-a03dr01]...
16:57:37; anvil-daemon started.
16:57:38; scancore started.
16:57:38; - Done!
16:57:38; - Enabling dameons on: [an-a03n01]...
16:57:38; anvil-daemon started.
16:57:39; scancore started.
16:57:39; - Done!
16:57:39; - Enabling dameons on: [an-a03n02]...
16:57:39; anvil-daemon started.
16:57:39; scancore started.
16:57:39; - Done!
16:57:39; - Enabling dameons on: [an-striker01]...
16:57:39; anvil-daemon started.
16:57:40; scancore started.
16:57:40; - Done!
16:57:40; - Enabling dameons on: [an-striker02]...
16:57:40; anvil-daemon started.
16:57:40; scancore started.
16:57:40; - Done!
16:57:40; Updates complete!
</syntaxhighlight>
 
We're ready to proceed with rebuilding each machine!
 
= Rebuilding Machines =
 
Each physical machine in the cluster will have to have their operating system reinstalled with the new version of the operating system. In this tutorial, that means we'll install AlmaLinux 9.
 
{{note|1=Doing an in-place update of the operating system is not recommended. Upgrading from RHEL 8 to 9 is supported by Red Hat, but it is not tested by Alteeve.}}
 
== Rebuild Strikers ==
 
Start by rebuilding the Striker dashboards, one at a time.
 
Power off the dashboard and reboot using the install media, and follow the install instructions from the [[Build_an_M3_Anvil!_Cluster#Installation_of_Base_OS|main install guide]]. Follow the tutorial until you configure the network. When you reach the 'Peering Striker Dashboards' stage, return here.
 
Once the dashboard has been configured, ''using the rebuilt node's web interface'', [[Build_an_M3_Anvil!_Cluster#Peering_Striker_Dashboards|peer the other striker]]. This will update the rebuilt dashboard's <span class="code">anvil.conf</span> to add the other dashboard.
 
Once re-peered, please wait about five minutes before proceeding. This will give time for the database on the rebuilt striker time to synchronise.
 
{{note|1=If you notice any trouble, stop '<span class="code">anvil-daemon</span>' on both strikers, then manually run '<span class="code">anvil-daemon --run-once --resync-db</span>'. When it completes, restart '<span class="code">anvil-daemon</span>'.}}


If you are replacing a failed machine, follow the "[[Replacing a Failed Machine in an M3 Anvil! Cluster]]" tutorial. You can rebuild the lost machine with the new OS version, so long as the new OS is one major version newer.


That is to say, if you have an [[EL]] 8 based M3 cluster, you can use [[EL]] 9 as the OS of the machine being replaced. However, if you do this, please proceed with this tutorial to upgrade the rest of the machines in the cluster as soon as possible after completing the replacement of the lost machine.


<span class="code"></span>
<span class="code"></span>

Revision as of 22:53, 16 August 2024

 Alteeve Wiki :: How To :: Upgrading the Base OS in an M3 Anvil! Cluster

Overview

This article covers rebuilding an Anvil! to upgrade or change the base operating system of all the machines in the cluster.

In short, the process will be to remove each machine from the cluster, reinstall the new OS, and integrate it back into the cluster. We'll do Strikers, Subnodes and a DR host.

This tutorial will also serve as a guide to rebuilding a machine that failed and was replaced. The only difference will be that the installed OS will stay the same.

For this tutorial, we will be updating from CentOS Stream 8 to AlmaLinux 9. The process should work between RHEL, as either the source and/or the destination OS.

Updating

It is critical to update the cluster to ensure all machines are running the latest versions of the various cluster components. This will maximise the chance that there are no compatibility issues as each machines is rebuilt.

If A Machine Has Failed

If you are replacing a failed machine, follow the "Replacing a Failed Machine in an M3 Anvil! Cluster" tutorial. You can rebuild the lost machine with the new OS version, so long as the new OS is one major version newer.

That is to say, if you have an EL 8 based M3 cluster, you can use EL 9 as the OS of the machine being replaced. However, if you do this, please proceed with this tutorial to upgrade the rest of the machines in the cluster as soon as possible after completing the replacement of the lost machine.

Whether replacing a machine that failed, or upgrading, it is critically important to update the cluster first. This will ensure that the versions of applications installed on the rebuilt machine matches / is compatible with the versions to be installed on the rebuilt systems.

If All Machines Are Online

If you're doing a planned update, and all machines in the Anvil! cluster is online, you can use striker-update-cluster.

This is the scenario we will be covering in this tutorial.

Pre-Upgrade Update

Note: This next step will include migrating servers between subnodes. This causes some amount of performance impact on the guests. As such, consider this potential impact when scheduling this update.

We can run the cluster update from either striker dashboard. For this tutorial, we will use an-striker01.

striker-update-cluster
16:53:36; - Verifying access to: [an-a03dr01]...
16:53:37; Connected on: [10.201.14.3] via: [bcn1]
16:53:37; - Verifying access to: [an-a03n01]...
16:53:37; Connected on: [10.201.14.1] via: [bcn1]
16:53:37; - Verifying access to: [an-a03n02]...
16:53:37; Connected on: [10.201.14.2] via: [bcn1]
16:53:37; - Verifying access to: [an-striker01]...
16:53:37; Connected on: [10.201.4.1] via: [bcn1]
16:53:37; - Verifying access to: [an-striker02]...
16:53:38; Connected on: [10.201.4.2] via: [bcn1]
16:53:38; - Success!
16:53:38; [  Note   ] - All nodes need to be up and running for the update to run on nodes. 
16:53:38; [  Note   ] - Any out-of-sync storage needs to complete before a node can be updated. 
16:53:38; [ Warning ] - Servers will be migrated between subnodes, which can cause reduced performance during
16:53:38; [ Warning ] - these migrations. If a sub-node is not active, it will be activated as part of the
16:53:38; [ Warning ] - upgrade process.

Here, we see that the update program verified access to all known machines. With this confirmed, you will be asked to proceed.

Proceed? [y/N]

If you're ready, enter 'y'.

y
Thank you, proceeding.
16:55:06; Disabling Anvil! daemons on all hosts...
16:55:06; - Disabling daemons on: [an-a03dr01]... 
16:55:06; anvil-daemon stopped.
16:55:07; scancore stopped.
16:55:07; - Done!
16:55:07; - Disabling daemons on: [an-a03n01]... 
16:55:07; anvil-daemon stopped.
16:55:07; scancore stopped.
16:55:07; - Done!
16:55:07; - Disabling daemons on: [an-a03n02]... 
16:55:08; anvil-daemon stopped.
16:55:08; scancore stopped.
16:55:08; - Done!
16:55:08; - Disabling daemons on: [an-striker01]... 
16:55:09; anvil-daemon stopped.
16:55:09; scancore stopped.
16:55:09; - Done!
16:55:09; - Disabling daemons on: [an-striker02]... 
16:55:09; anvil-daemon stopped.
16:55:09; scancore stopped.
16:55:09; - Done!
16:55:09; Enabling 'anvil-safe-start' on nodes to prevent hangs on reboot.
16:55:10; - Done!
16:55:10; Starting the update of the Striker dashboard: [an-striker01].
16:55:10; - Beginning OS update of: [an-striker01].
16:55:10; - Calling update now.
16:55:10; - NOTE: This can seem like it's hung! You can watch the progress using 'journalctl -f' on another terminal to
16:55:10; -       watch the progress via the system logs. You can also check with 'ps aux | grep dnf'.
<...snip...>

From here, the update will continue to run. How long this takes depends on several factors;

  • How many updates are needed and how long do they take to install?
  • How much data needs to be downloaded and how fast the Internet connection is.
  • How long it takes to migrate servers between subnodes in each node pair?
  • How many nodes you have, and how many DR hosts you have?

After the update completes

<...snip...>
16:57:35; - Both subnodes are online, will now check replicated storage.
16:57:35; - This is the second node, no need to wait for replication to complete.
16:57:35; - Running 'anvil-version-changes'.
16:57:37; - Done!
16:57:37; Enabling Anvil! daemons on all hosts...
16:57:37; - Enabling dameons on: [an-a03dr01]... 
16:57:37; anvil-daemon started.
16:57:38; scancore started.
16:57:38; - Done!
16:57:38; - Enabling dameons on: [an-a03n01]... 
16:57:38; anvil-daemon started.
16:57:39; scancore started.
16:57:39; - Done!
16:57:39; - Enabling dameons on: [an-a03n02]... 
16:57:39; anvil-daemon started.
16:57:39; scancore started.
16:57:39; - Done!
16:57:39; - Enabling dameons on: [an-striker01]... 
16:57:39; anvil-daemon started.
16:57:40; scancore started.
16:57:40; - Done!
16:57:40; - Enabling dameons on: [an-striker02]... 
16:57:40; anvil-daemon started.
16:57:40; scancore started.
16:57:40; - Done!
16:57:40; Updates complete!

We're ready to proceed with rebuilding each machine!

Rebuilding Machines

Each physical machine in the cluster will have to have their operating system reinstalled with the new version of the operating system. In this tutorial, that means we'll install AlmaLinux 9.

Note: Doing an in-place update of the operating system is not recommended. Upgrading from RHEL 8 to 9 is supported by Red Hat, but it is not tested by Alteeve.

Rebuild Strikers

Start by rebuilding the Striker dashboards, one at a time.

Power off the dashboard and reboot using the install media, and follow the install instructions from the main install guide. Follow the tutorial until you configure the network. When you reach the 'Peering Striker Dashboards' stage, return here.

Once the dashboard has been configured, using the rebuilt node's web interface, peer the other striker. This will update the rebuilt dashboard's anvil.conf to add the other dashboard.

Once re-peered, please wait about five minutes before proceeding. This will give time for the database on the rebuilt striker time to synchronise.

Note: If you notice any trouble, stop 'anvil-daemon' on both strikers, then manually run 'anvil-daemon --run-once --resync-db'. When it completes, restart 'anvil-daemon'.



 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.