Chapter 12. Deploying an AWS Lambda to disable a non-responding site
Deploy an AWS Lambda as part of the load-balancer building block in a multi-site deployment.
This chapter explains how to resolve split-brain scenarios between two sites in a multi-site deployment. It also disables replication if one site fails, so the other site can continue to serve requests.
This deployment is intended to be used with the setup described in the Concepts for multi-site deployments chapter. Use this deployment with the other building blocks outlined in the Building blocks multi-site deployments chapter.
We provide these blueprints to show a minimal functionally complete example with a good baseline performance for regular installations. You would still need to adapt it to your environment and your organization’s standards and security best practices.
12.1. Architecture Copy linkLink copied to clipboard!
In the event of a network communication failure between sites in a multi-site deployment, it is no longer possible for the two sites to continue to replicate the data between them. The Data Grid is configured with a FAIL
failure policy, which ensures consistency over availability. Consequently, all user requests are served with an error message until the failure is resolved, either by restoring the network connection or by disabling cross-site replication.
In such scenarios, a quorum is commonly used to determine which sites are marked as online or offline. However, as multi-site deployments only consist of two sites, this is not possible. Instead, we leverage “fencing” to ensure that when one of the sites is unable to connect to the other site, only one site remains in the load balancer configuration, and hence only this site is able to serve subsequent users requests.
In addition to the load balancer configuration, the fencing procedure disables replication between the two Data Grid clusters to allow serving user requests from the site that remains in the load balancer configuration. As a result, the sites will be out-of-sync once the replication has been disabled.
To recover from the out-of-sync state, a manual re-sync is necessary as described in Synchronizing sites. This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved. The remove site should only be re-added once the two sites have been synchronized using the outlined procedure Bringing a site online.
In this chapter we describe how to implement fencing using a combination of Prometheus Alerts and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the Data Grid server metrics, which results in the Prometheus AlertManager calling the AWS Lambda based webhook. The triggered Lambda function inspects the current Global Accelerator configuration and removes the site reported to be offline.
In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both sites will trigger the webhook simultaneously. We guard against this by ensuring that only a single Lambda instance can be executed at a given time. The logic in the AWS Lambda ensures that always one site entry remains in the load balancer configuration.
12.2. Prerequisites Copy linkLink copied to clipboard!
- ROSA HCP based multi-site Keycloak deployment
- AWS CLI Installed
- AWS Global Accelerator load balancer
-
jq
tool installed
12.3. Procedure Copy linkLink copied to clipboard!
Enable Openshift user alert routing
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Decide upon a username/password combination which will be used to authenticate the Lambda webhook and create an AWS Secret storing the password
Command:
aws secretsmanager create-secret \ --name webhook-password \ --secret-string changeme \ --region eu-west-1
aws secretsmanager create-secret \ --name webhook-password \
1 --secret-string changeme \
2 --region eu-west-1
3 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the Role used to execute the Lambda.
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create and attach the 'LambdaSecretManager' Policy so that the Lambda can access AWS Secrets
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Attach the
ElasticLoadBalancingReadOnly
policy so that the Lambda can query the provisioned Network Load BalancersCommand:
aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn arn:aws:iam::aws:policy/ElasticLoadBalancingReadOnly
aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn arn:aws:iam::aws:policy/ElasticLoadBalancingReadOnly
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Attach the
GlobalAcceleratorFullAccess
policy so that the Lambda can update the Global Accelerator EndpointGroupCommand:
aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn arn:aws:iam::aws:policy/GlobalAcceleratorFullAccess
aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn arn:aws:iam::aws:policy/GlobalAcceleratorFullAccess
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a Lambda ZIP file containing the required fencing logic
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the Lambda function.
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The AWS Region hosting your Kubernetes clusters
Expose a Function URL so the Lambda can be triggered as webhook
Command:
aws lambda create-function-url-config \ --function-name ${FUNCTION_NAME} \ --auth-type NONE \ --region eu-west-1
aws lambda create-function-url-config \ --function-name ${FUNCTION_NAME} \ --auth-type NONE \ --region eu-west-1
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The AWS Region hosting your Kubernetes clusters
Allow public invocations of the Function URL
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The AWS Region hosting your Kubernetes clusters
Configure the Lambda’s Environment variables:
In each Kubernetes cluster, retrieve the exposed Data Grid URL endpoint:
oc -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}'
oc -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}'
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
${NAMESPACE}
with the namespace containing your Data Grid server
Upload the desired Environment variables
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the AWS Global Accelerator used by your deployment
- 2
- The AWS Region hosting your Kubernetes cluster and Lambda function
- 3
- The name of one of your Data Grid sites as defined in Deploying Data Grid for HA with the Data Grid Operator
- 4
- The Data Grid endpoint URL associated with the CLUSER_1_NAME site
- 5
- The name of the second Data Grid site
- 6
- The Data Grid endpoint URL associated with the CLUSER_2_NAME site
- 7
- The username of a Data Grid user which has sufficient privileges to perform REST requests on the server
- 8
- The name of the AWS secret containing the password associated with the Data Grid user
- 9
- The username used to authenticate requests to the Lambda Function
- 10
- The name of the AWS secret containing the password used to authenticate requests to the Lambda function
Retrieve the Lambda Function URL
Command:
aws lambda get-function-url-config \ --function-name ${FUNCTION_NAME} \ --query "FunctionUrl" \ --region eu-west-1 \ --output text
aws lambda get-function-url-config \ --function-name ${FUNCTION_NAME} \ --query "FunctionUrl" \ --region eu-west-1 \
1 --output text
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The AWS region where the Lambda was created
Output:
https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws
https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In each Kubernetes cluster, configure a Prometheus Alert routing to trigger the Lambda on split-brain
Command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The username required to authenticate Lambda requests
- 2
- The password required to authenticate Lambda requests
- 3
- The Lambda Function URL
- 4
- The namespace value should be the namespace hosting the Infinispan CR and the site should be the remote site defined by
spec.service.sites.locations[0].name
in your Infinispan CR - 5
- The name of your local site defined by
spec.service.sites.local.name
in your Infinispan CR - 6
- The DNS of your Global Accelerator
12.4. Verify Copy linkLink copied to clipboard!
To test that the Prometheus alert triggers the webhook as expected, perform the following steps to simulate a split-brain:
In each of your clusters execute the following:
Command:
oc -n openshift-operators scale --replicas=0 deployment/infinispan-operator-controller-manager oc -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager oc -n ${NAMESPACE} scale --replicas=0 deployment/infinispan-router oc -n ${NAMESPACE} rollout status -w deployment/infinispan-router
oc -n openshift-operators scale --replicas=0 deployment/infinispan-operator-controller-manager
1 oc -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager oc -n ${NAMESPACE} scale --replicas=0 deployment/infinispan-router
2 oc -n ${NAMESPACE} rollout status -w deployment/infinispan-router
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Verify the
SiteOffline
event has been fired on a cluster by inspecting the ObserveAlerting menu in the Openshift console - Inspect the Global Accelerator EndpointGroup in the AWS console and there should only be a single endpoint present
Scale up the Data Grid Operator and Gossip Router to re-establish a connection between sites:
Command:
oc -n openshift-operators scale --replicas=1 deployment/infinispan-operator-controller-manager oc -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager oc -n ${NAMESPACE} scale --replicas=1 deployment/infinispan-router oc -n ${NAMESPACE} rollout status -w deployment/infinispan-router
oc -n openshift-operators scale --replicas=1 deployment/infinispan-operator-controller-manager oc -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager oc -n ${NAMESPACE} scale --replicas=1 deployment/infinispan-router
1 oc -n ${NAMESPACE} rollout status -w deployment/infinispan-router
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
${NAMESPACE}
with the namespace containing your Data Grid server
-
Inspect the
vendor_jgroups_site_view_status
metric in each site. A value of1
indicates that the site is reachable. - Update the Accelerator EndpointGroup to contain both Endpoints. See the Bringing a site online chapter for details.