14.3. 流程
启用 Openshift 用户警报路由
命令:
oc apply -f - << EOF apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true EOF oc -n openshift-user-workload-monitoring rollout status --watch statefulset.apps/alertmanager-user-workload
Decide on username/password 组合用于验证 Lambda Webhook 并创建存储密码的 AWS Secret
命令:
aws secretsmanager create-secret \ --name webhook-password \ 1 --secret-string changeme \ 2 --region eu-west-1 3
创建用于执行 Lambda 的 Role。
命令:
FUNCTION_NAME= 1 ROLE_ARN=$(aws iam create-role \ --role-name ${FUNCTION_NAME} \ --assume-role-policy-document \ '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "lambda.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }' \ --query 'Role.Arn' \ --region eu-west-1 \ 2 --output text )
创建并附加 'LambdaSecretManager' 策略,以便 Lambda 可以访问 AWS Secret
命令:
POLICY_ARN=$(aws iam create-policy \ --policy-name LambdaSecretManager \ --policy-document \ '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "secretsmanager:GetSecretValue" ], "Resource": "*" } ] }' \ --query 'Policy.Arn' \ --output text ) aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn ${POLICY_ARN}
附加
ElasticLoadBalancingReadOnly
策略,以便 Lambda 可以查询置备的 Network Load Balancers命令:
aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn arn:aws:iam::aws:policy/ElasticLoadBalancingReadOnly
附加
GlobalAcceleratorFullAccess
策略,以便 Lambda 可以更新全局加速器 EndpointGroup命令:
aws iam attach-role-policy \ --role-name ${FUNCTION_NAME} \ --policy-arn arn:aws:iam::aws:policy/GlobalAcceleratorFullAccess
创建包含所需的隔离逻辑的 Lambda ZIP 文件
命令:
LAMBDA_ZIP=/tmp/lambda.zip cat << EOF > /tmp/lambda.py from urllib.error import HTTPError import boto3 import jmespath import json import os import urllib3 from base64 import b64decode from urllib.parse import unquote # Prevent unverified HTTPS connection warning urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) class MissingEnvironmentVariable(Exception): pass class MissingSiteUrl(Exception): pass def env(name): if name in os.environ: return os.environ[name] raise MissingEnvironmentVariable(f"Environment Variable '{name}' must be set") def handle_site_offline(labels): a_client = boto3.client('globalaccelerator', region_name='us-west-2') acceleratorDNS = labels['accelerator'] accelerator = jmespath.search(f"Accelerators[?(DnsName=='{acceleratorDNS}'|| DualStackDnsName=='{acceleratorDNS}')]", a_client.list_accelerators()) if not accelerator: print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found") return accelerator_arn = accelerator[0]['AcceleratorArn'] listener_arn = a_client.list_listeners(AcceleratorArn=accelerator_arn)['Listeners'][0]['ListenerArn'] endpoint_group = a_client.list_endpoint_groups(ListenerArn=listener_arn)['EndpointGroups'][0] endpoints = endpoint_group['EndpointDescriptions'] # Only update accelerator endpoints if two entries exist if len(endpoints) > 1: # If the reporter endpoint is not healthy then do nothing for now # A Lambda will eventually be triggered by the other offline site for this reporter reporter = labels['reporter'] reporter_endpoint = [e for e in endpoints if endpoint_belongs_to_site(e, reporter)][0] if reporter_endpoint['HealthState'] == 'UNHEALTHY': print(f"Ignoring SiteOffline alert as reporter '{reporter}' endpoint is marked UNHEALTHY") return offline_site = labels['site'] endpoints = [e for e in endpoints if not endpoint_belongs_to_site(e, offline_site)] del reporter_endpoint['HealthState'] a_client.update_endpoint_group( EndpointGroupArn=endpoint_group['EndpointGroupArn'], EndpointConfigurations=endpoints ) print(f"Removed site={offline_site} from Accelerator EndpointGroup") take_infinispan_site_offline(reporter, offline_site) print(f"Backup site={offline_site} caches taken offline") else: print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup") def endpoint_belongs_to_site(endpoint, site): lb_arn = endpoint['EndpointId'] region = lb_arn.split(':')[3] client = boto3.client('elbv2', region_name=region) tags = client.describe_tags(ResourceArns=[lb_arn])['TagDescriptions'][0]['Tags'] for tag in tags: if tag['Key'] == 'site': return tag['Value'] == site return false def take_infinispan_site_offline(reporter, offlinesite): endpoints = json.loads(INFINISPAN_SITE_ENDPOINTS) if reporter not in endpoints: raise MissingSiteUrl(f"Missing URL for site '{reporter}' in 'INFINISPAN_SITE_ENDPOINTS' json") endpoint = endpoints[reporter] password = get_secret(INFINISPAN_USER_SECRET) url = f"https://{endpoint}/rest/v2/container/x-site/backups/{offlinesite}?action=take-offline" http = urllib3.PoolManager(cert_reqs='CERT_NONE') headers = urllib3.make_headers(basic_auth=f"{INFINISPAN_USER}:{password}") try: rsp = http.request("POST", url, headers=headers) if rsp.status >= 400: raise HTTPError(f"Unexpected response status '%d' when taking site offline", rsp.status) rsp.release_conn() except HTTPError as e: print(f"HTTP error encountered: {e}") def get_secret(secret_name): session = boto3.session.Session() client = session.client( service_name='secretsmanager', region_name=SECRETS_REGION ) return client.get_secret_value(SecretId=secret_name)['SecretString'] def decode_basic_auth_header(encoded_str): split = encoded_str.strip().split(' ') if len(split) == 2: if split[0].strip().lower() == 'basic': try: username, password = b64decode(split[1]).decode().split(':', 1) except: raise DecodeError else: raise DecodeError else: raise DecodeError return unquote(username), unquote(password) def handler(event, context): print(json.dumps(event)) authorization = event['headers'].get('authorization') if authorization is None: print("'Authorization' header missing from request") return { "statusCode": 401 } expectedPass = get_secret(WEBHOOK_USER_SECRET) username, password = decode_basic_auth_header(authorization) if username != WEBHOOK_USER and password != expectedPass: print('Invalid username/password combination') return { "statusCode": 403 } body = event.get('body') if body is None: raise Exception('Empty request body') body = json.loads(body) print(json.dumps(body)) if body['status'] != 'firing': print("Ignoring alert as status is not 'firing', status was: '%s'" % body['status']) return { "statusCode": 204 } for alert in body['alerts']: labels = alert['labels'] if labels['alertname'] == 'SiteOffline': handle_site_offline(labels) return { "statusCode": 204 } INFINISPAN_USER = env('INFINISPAN_USER') INFINISPAN_USER_SECRET = env('INFINISPAN_USER_SECRET') INFINISPAN_SITE_ENDPOINTS = env('INFINISPAN_SITE_ENDPOINTS') SECRETS_REGION = env('SECRETS_REGION') WEBHOOK_USER = env('WEBHOOK_USER') WEBHOOK_USER_SECRET = env('WEBHOOK_USER_SECRET') EOF zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py
创建 Lambda 功能。
命令:
aws lambda create-function \ --function-name ${FUNCTION_NAME} \ --zip-file fileb://${LAMBDA_ZIP} \ --handler lambda.handler \ --runtime python3.12 \ --role ${ROLE_ARN} \ --region eu-west-1 1
- 1
- 托管 Kubernetes 集群的 AWS 区域
公开功能 URL,以便 Lambda 可以作为 Webhook 触发
命令:
aws lambda create-function-url-config \ --function-name ${FUNCTION_NAME} \ --auth-type NONE \ --region eu-west-1 1
- 1
- 托管 Kubernetes 集群的 AWS 区域
允许对功能 URL 进行公共调用
命令:
aws lambda add-permission \ --action "lambda:InvokeFunctionUrl" \ --function-name ${FUNCTION_NAME} \ --principal "*" \ --statement-id FunctionURLAllowPublicAccess \ --function-url-auth-type NONE \ --region eu-west-1 1
- 1
- 托管 Kubernetes 集群的 AWS 区域
配置 Lambda 的环境变量:
在每个 Kubernetes 集群中,检索公开的 Data Grid URL 端点:
oc -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}' 1
- 1
- 将
${NAMESPACE}
替换为包含 Data Grid 服务器的命名空间
上传所需的环境变量
ACCELERATOR_NAME= 1 LAMBDA_REGION= 2 CLUSTER_1_NAME= 3 CLUSTER_1_ISPN_ENDPOINT= 4 CLUSTER_2_NAME= 5 CLUSTER_2_ISPN_ENDPOINT= 6 INFINISPAN_USER= 7 INFINISPAN_USER_SECRET= 8 WEBHOOK_USER= 9 WEBHOOK_USER_SECRET= 10 INFINISPAN_SITE_ENDPOINTS=$(echo "{\"${CLUSTER_NAME_1}\":\"${CLUSTER_1_ISPN_ENDPOINT}\",\"${CLUSTER_2_NAME}\":\"${CLUSTER_2_ISPN_ENDPOINT\"}" | jq tostring) aws lambda update-function-configuration \ --function-name ${ACCELERATOR_NAME} \ --region ${LAMBDA_REGION} \ --environment "{ \"Variables\": { \"INFINISPAN_USER\" : \"${INFINISPAN_USER}\", \"INFINISPAN_USER_SECRET\" : \"${INFINISPAN_USER_SECRET}\", \"INFINISPAN_SITE_ENDPOINTS\" : ${INFINISPAN_SITE_ENDPOINTS}, \"WEBHOOK_USER\" : \"${WEBHOOK_USER}\", \"WEBHOOK_USER_SECRET\" : \"${WEBHOOK_USER_SECERT}\", \"SECRETS_REGION\" : \"eu-central-1\" } }"
- 1
- 您的部署使用的 AWS Global Accelerator 的名称
- 2
- 托管 Kubernetes 集群的 AWS 区域和 Lambda 功能
- 3
- 使用 Data Grid Operator 在 Deploy Data Grid for HA 中定义的一个 Data Grid站点的名称
- 4
- 与 CLUSER_1_NAME 站点关联的 Data Grid 端点 URL
- 5
- 第二个 Data Grid 站点的名称
- 6
- 与 CLUSER_2_NAME 站点关联的 Data Grid 端点 URL
- 7
- Data Grid 用户的用户名,该用户有足够的特权在服务器上执行 REST 请求
- 8
- 包含与 Data Grid 用户关联的密码的 AWS secret 名称
- 9
- 用于验证对 Lambda Function 的请求的用户名
- 10
- 包含用于验证向 Lambda 功能请求的密码的 AWS secret 名称
检索 Lambda Function URL
命令:
aws lambda get-function-url-config \ --function-name ${FUNCTION_NAME} \ --query "FunctionUrl" \ --region eu-west-1 \1 --output text
- 1
- 创建 Lambda 的 AWS 区域
输出:
https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws
在每个 Kubernetes 集群中,配置 Prometheus Alert 路由,以便在脑裂时触发 Lambda
命令:
NAMESPACE= # The namespace containing your deployments oc apply -n ${NAMESPACE} -f - << EOF apiVersion: v1 kind: Secret type: kubernetes.io/basic-auth metadata: name: webhook-credentials stringData: username: 'keycloak' 1 password: 'changme' 2 --- apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: example-routing spec: route: receiver: default groupBy: - accelerator groupInterval: 90s groupWait: 60s matchers: - matchType: = name: alertname value: SiteOffline receivers: - name: default webhookConfigs: - url: 'https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws/' 3 httpConfig: basicAuth: username: key: username name: webhook-credentials password: key: password name: webhook-credentials tlsConfig: insecureSkipVerify: true --- apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: xsite-status spec: groups: - name: xsite-status rules: - alert: SiteOffline expr: 'min by (namespace, site) (vendor_jgroups_site_view_status{namespace="default",site="site-b"}) == 0' 4 labels: severity: critical reporter: site-a 5 accelerator: a3da6a6cbd4e27b02.awsglobalaccelerator.com 6