Node Removal - Storage Availability in AWS with Nutanix Clusters - Part 2
Part 1 - Storage Availability in AWS with Nutanix Clusters
Disk and Node Failure in AWS
The Nutanix unified component called Stargate receives and processes data. All read and write requests for a node are sent to the Stargate process on that node. The Hades service which also runs on the cluster helps to simplifies the break-fix procedures for disks and automates several tasks that previously required manual user actions. Hades helps fix failing devices before they become unrecoverable.
Once Stargate sees delays in responses to I/O requests to a disk, it marks the disk offline. Hades then automatically removes the disk from the data path and runs smartctl checks against it. If the checks pass, Hades marks the disk online and returns it to service. If the checks fail or if Stargate marks a disk offline three times in one hour (regardless of the smartctl check results), Hades automatically starts the EC2 removal process. Removing the EC2 instance triggers an API call to the cluster portal, which notifies the Nutanix Clusters portal. The Nutanix Clusters portal allocates a new instance, adds it to the cluster, and marks the instance with the unresponsive disk for removal . The cluster software automatically replicates the data on the bad EC2 instance to other instances, then finishes the removal of the bad EC2 instance.
Nutanix Clusters On AWS - Cluster Portal
Principal Technical Marketing Engineer at Nutanix
5yPlease share if you feel it's worth while.