Node Removal - Storage Availability in AWS with Nutanix Clusters - Part 2

Node Removal - Storage Availability in AWS with Nutanix Clusters - Part 2

Part 1 - Storage Availability in AWS with Nutanix Clusters

Disk and Node Failure in AWS

The Nutanix unified component called Stargate receives and processes data. All read and write requests for a node are sent to the Stargate process on that node. The Hades service which also runs on the cluster helps to simplifies the break-fix procedures for disks and automates several tasks that previously required manual user actions. Hades helps fix failing devices before they become unrecoverable.

Once Stargate sees delays in responses to I/O requests to a disk, it marks the disk offline. Hades then automatically removes the disk from the data path and runs smartctl checks against it. If the checks pass, Hades marks the disk online and returns it to service. If the checks fail or if Stargate marks a disk offline three times in one hour (regardless of the smartctl check results), Hades automatically starts the EC2 removal process. Removing the EC2 instance triggers an API call to the cluster portal, which notifies the Nutanix Clusters portal. The Nutanix Clusters portal allocates a new instance, adds it to the cluster, and marks the instance with the unresponsive disk for removal . The cluster software automatically replicates the data on the bad EC2 instance to other instances, then finishes the removal of the bad EC2 instance.


Nutanix Clusters On AWS - Cluster Portal




Dwayne Lessner

Principal Technical Marketing Engineer at Nutanix

5y

Please share if you feel it's worth while.

Like
Reply

To view or add a comment, sign in

More articles by Dwayne Lessner

Insights from the community

Others also viewed

Explore topics