F5 BIG-IP deployment with OpenShift - platform and networking options
Introduction This article is an architectural overview on how F5 BIG-IP can be used with Red Hat OpenShift. Several topics are covered, including: 1-tier or 2-tier arrangements, where the BIG-IP load balance workload PODs directly or load balance ingress controllers (such as NGINX+ or OpenShift's built-in router) respectively. Multi-cluster arrangements, where the BIG-IP can load-balance, or do route sharding across two or more clusters. multi-tenancy, and IP address management options. While this article has a NetOps/infrastructure focus, the follow-up articleBIG-IP deployment with OpenShift—application publishing focuses in DevOps/applications. Overall architecture When using BIG-IP with Red Hat OpenShift, the container Container Ingress Services (CIS from now on) container is used to connect the BIG-IP APIs with the Kubernetes APIs. The source of truth is OpenShift. When a user configuration is applied or when a change occurs in the OpenShift cluster, then CIS automatically updates the configuration in the BIG-IP. Under the hood, CIS updates the BIG-IP configuration using the AS3 declarative API.It is not necessary to know if this applies, as all the configuration can be applied using Kubernetes resource types. IP Address Management (IPAM from now on) is important when it is desired that the DevOps teams operate independently from the infrastructure administrators. CIS supports IPAM by making use of the F5 IPAM Controller (FIC from now on), which is deployed as a container as well. It can be seen how these components fit together in the next picture. CIS and FIC are PODs deployed in the OpenShift cluster and AS3 is deployed in the BIG-IP. In the next sections, we cover the different deployment options and considerations to be taken into account. The full documentation can be found in F5 clouddocs. F5 BIG-IP container integrations are Open Source Software (OSS) and can be found in this github repository where you will find additional technical details and examples. Networking - CNI options Kubernetes' networking is provided by Container Networking Interface plugins (CNI from now on) and F5 BIG-IP supports all Openshift's native CNIs: OVNKubernetes - This is the preferred option. GA since Openshift 4.6, makes use of Geneve encapsulation, but BIG-IP interacts with this CNI in a routed mode in which the packets from/to the BIG-IP don't use encapsulation. Additionally, POD's cluster IPs are discovered dynamically by CIS when OpenShift nodes are added or removed. This latter makes this method also the easiest from BIG-IP management point of view. CheckCIS configuration for OVNKubernetesfor details. OpenshiftSDN - supported since Openshift 3.x, it is being phased out in favour of OVNKubernetes. It makes use of VXLAN encapsulation between the nodes and between the nodes and the BIG-IPs. This requires manual configuration of VXLAN tunnels in the BIG-IPs when OpenShift nodes are added or removed. CheckCIS configuration for OpenShiftSDNfor details. Feature-wise these CNIs we can compare them from the next table from the Openshift documentation. Besides the above features, performance should also be taken into consideration. The NICs used in the Openshift cluster should do encapsulation off-loading to reduce the CPU load in the nodes. Increasing the MTU is recommended specially for encapsulating CNIs; this is suggested in OpenShift's documentation as well, and needs to be set at installation time in the install-config.yaml file. See this OpenShift.com link for details. Networking - the importance of supporting clusters' CNI There are basically two modes to interact with a Kubernetes workload from outside the cluster: Using NodePort Service type. In this case, external hosts access the PODs using any of the cluster's nodes IPs. When a request reaches a node, Kubernetes' kube-proxy is reponsible for forwarding the request to a POD in the local or remote node. When sending to a remote node, it adds noticeable overhead. In two-tier deployments externalTrafficPolicy: local and could be used with appropriate monitoring to avoid this additional hop. NodePort is popular for other external Load Balancers because it is an easy method to access the PODs without having to support the CNI, as the name indicates by using Kubernete's nodes. IP address. This has the drawback of an additional indirection. This drawback is specially relevant for 1-tier deployments because application PODs cannot be accessed directly, eliminating the advantages of this deployment type. On the other hand, BIG-IP supports OpenShift CNI's, both OpenShiftSDN and OVNKubernetes. Using LoadBalancer Service type. The packet path in this mode is equivalent to NodePort, in which the external load balancers need an intermediate kube-proxy hop before reaching the POD. An alternative to bypassing kube-proxy is the use of hostNetwork access, but this is discouraged in general because of its security implications. Using ClusterIP Service type. This is the preferred mode because when sending a request, this is sent directly to the destination POD. This requires to support OpenShfit's CNIs, which is the case of BIG-IP.It is worth noting that BIG-IP also supports other CNIs such as Calico or Cilium. This arrangement can be seen next. Please note in the above figure the traffic path from the BIG-IP, where the arrow reaches the inside of the CNI area. This is to indicate that it can address the ingress controllers or the workload POD's IPs within the cluster network. Using this Service type Cluster IP is also more flexible because it allows CIS to use 1-tier and 2-tier arrangements simultaneously. Networking - Load Balancer arrangement options There are basically two arrangement options, 1 and 2 tier. In a nutshell: A 2-tier arrangement is the typical way in which Kubernetes clusters are deployed. In this arrangement, the BIG-IP has only the role of External Load Balancer (first tier only) and sends the client requests to the Ingress Controller Instances (second tier). The Ingress Controllers ultimately forward the requests to the workload PODs. In a 1-tier arrangement, the BIG-IP sends the requests to the workload PODs directly. This is a much simplified arrangement, in which the BIG-IP performs the role of both External Load Balancer and Ingress Controller. Next, we will see the advantages of each arrangement.Please note that when usingClusterIP,this selection can be doneonaper-Servicebasis.From BIG-IP point of view, it is irrelevant what are the endpoints. Load Balancer arrangement option - 2-tier arrangement Unlike most External Load Balancers, the BIG-IP can exposeservices with either Layer 4 functionalities or Layer 7 functionalities. In Layer 7 mode, SSL/TLS off-loading, HSM, Advanced WAF, and other advanced services can be used. A tier-2 arrangement provides greater scalability compared to 1-tier arrangements in terms of number of L7 routes exposed or number Kubernetes PODs because the control plane workload (the related Kubernetes events that are generated for these PODs and Routes) is split between BIG-IP/CIS and the in-cluster Ingress Controller. This arrangement also has strong isolation between the two tiers, ideal when each tier is managed by different teams (i.e.: platform and developer teams). A BIG-IP 2-tier arrangement is shown next: Load Balancer arrangement option - 1-tier arrangement In this arrangement, the BIG-IP typically operates in L7 mode and sends the traffic directly to the final workload POD. This is done by sendingtraffic to Services in ClusterIP mode. In this arrangement, persistence is handled easily and the worker's PODs can be directly monitored by the BIG-IP, providing an accurate view of the application's health. A BIG-IP 1-tierrangement is shown next: This arrangement is simpler to troubleshoot, has less latency and potentially higher per-session performance. An isolation between platform and developer teams can be achieved with CIS and FIC, yet this is not as strong isolated compared to 2-tier arrangements. This is described inBIG-IP deployment with OpenShift—application publishing options. BIG-IP platform flexibility: deployment, scalability, and multi-tenancy options Using BIG-IP, the deployment options are independent of the BIG-IP being an appliance, a scale-out chassis, or a Virtual Edition. The configuration is always the same down to the L2 (vlan/tunnel) config level. Only the L1 (physical interface) configuration changes. This platform flexibility also opens the possibilities of using different options for scalability, multi-tenancy, hardware accelerators, or Hardware Security Modules (HSMs). These latter are specially important to keepthe SSL/TLS private keys in an FIPS compliant manner. The HSMs can be onboard, on-prem Network HSMs, or cloud SaaS HSMs. Multi-tenancy Options In this section, multi-tenancy refers to the case in which different projects from one or more OpenShift clusters are serviced by a single BIG-IP. Next, it is outlined the different CIS deployment options: A CIS instance can manage all namespaces on a given OpenShift cluster or a subset of these.Namespaces can be specified with a list or a label selector (i.e.: envionment=test or environment=production). Multiple CIS instances, handling different namespaces, can share a single or different BIG-IPs. Each CIS instance will own a dedicated partition in a BIG-IP. For example, it is feasible to setup an OpenShift cluster with devevelopment, pre-production, and production labeled namespaces and these be serviced by different CIS instances in the same or different BIG-IPs for each environment. Multiple CIS instances in a single BIG-IP can also handle different OpenShift clusters. This is thanks to the soft isolation provided by BIG-IP partitions. Network isolation between these partitions can be achieved with routed domains. Some of these deployment options are shown next: IP address management (IPAM) CIS has the capability of dynamically allocating IP addresses using the F5 IPAM Controller (FIC) companion. At the time ofwriting, it is possible to retrieve IP addresses from the following providers: Infoblox F5 local DB provider, which makes use of a PVC for persistence. For the DevOps team, it is transparent which provider is used; it is only required to specify an ipamLabel attribute in the exposed L7 or L4 service. The DevOps team can also have the ability of indicating when it wants to share IP addresses between different L7 or L4 services by means of the HosGroup attribute. This is described in the follow-up article. BIG-IP data plane scalability options A single BIG-IP cluster can scale up horizontally with up to 8 BIG-IP instances and have the different projects distributed in these. This is referred to as Scale-N in the BIG-IP documentation. This mode is often not used because it requires additional orchestration or manual operation for optimal load distribution. In this mode, projectswould have soft-isolation between projects by means of BIG-IP partitions. When ultimate scalability or hard isolation is required, then TMOSvCMP technologyor in newer versions F5OS tenantsfacilities can be used in larger appliances and scale-out chassis. These multi-tenant facilities allow running independent BIG-IP instances, isolated at hardware level, even allowing using different versions of BIG-IP. The tenant BIG-IP instances can get allocated different amounts of hardware resources. In the next picture, the different tenants are shown in different colored bars using several blades (grey bars). Using chassis-based platforms allows to scale data plane performance and increase redundancy by adding blades to the systems without the need of a reconfiguration in the CIS/OpenShift side of things. BIG-IP control plane scalability options When using very large OpenShfit clusters with either a large number of services exposed or a large number of Pods and there is a high number of changes, these will trigger many events in the Kubernetes API. These events are processed by CIS and ultimately in the BIG-IP's control plane. In these cases, the following strategies can be used to improve BIG-IP's control plane scalability: Dissagregate the different projects in different BIG-IPs. These might be multiple BIG-IP VEs or instances in F5 vCMP or F5OS tenants when using hardware platforms. Use a 2-tier architecture, which reduces the number of Kubernetes objects and events that the BIG-IP is exposed to. In the upcoming months, CIS will be available in BIG-IP Next. This is a re-architecture of BIG-IP and incorporates major scalability improvements in the control plane. Multi-cluster OpenShift Since CIS version 2.14 it is also possible that BIG-IP load balances between 2 or more clusters in Active-Active, Active-Standby, or Ratio modes. 1-tier or 2-tier arrangements are possible. Next, it shows a single BIG-IP exposing workloads from 2 OpenShift clusters. Please note that OpenShift clusters don't require to be running with the same version, so this arrangement is also interesting for performing OpenShift upgrades. When using CIS in multi-cluster mode, an additional CIS instance in a secondary cluster is needed for redundancy. If there are more than 2 OpenShift clusters, no additional CIS instances are needed. Therefore, a typical BIG-IP cluster of 2 units load balancing 2 or more OpenShift clusters will always require 4 CIS instances. For each BIG-IP, one of the CIS instances has the (P)rimary role and is in charge of making changes in the BIG-IP by default. The (S)econdary CIS will be on standby. Both CIS instances access all OpenShift clusters. A more comprehensive view of this can be seen in the next diagram, which considers having more than 2 OpenShift clusters. OpenShift clusters that don't host a CIS instance are referred to as remotely managed. Conclusion F5 BIG-IPs provides unmatched deployment options and features with Openshift; these include: The support of OpenShift's CNIs which allows sending the traffic directly instead of using hostNetwork (which implies a security risk) or using the common NodePort which incursthe additional kube-proxy indirection. Both 1-tier or 2-tier arrangements (or both types simultaneously) are possible. F5´s Container Ingress Services provides the ability to handle multiple OpenShift clusters, exposing its services in a single VIP. This is a unique feature in the industry. To complete the circle, this integration also provides IP address management (IPAM) which provides great flexibility toDevOps teams. All these are available regardless. The BIG-IP is a Virtual Edition, an appliance or a chassis platform allowing great scalability and multi-tenancy options. The follow-up articleBIG-IP deployment with OpenShift—application publishing focuses on DevOps and applications. In this, it is described how CIS can also unleash all traffic management and security features in a Kubernetes native way. We are driven by your requirements. If you have any, please provide feedback through this post's comments section, your sales engineer, or via ourgithub repository.2.1KViews1like5CommentsActive/Active load balancing examples with F5 BIG-IP and Azure load balancer
Background A couple years ago Iwrote an article about some practical considerations using Azure Load Balancer. Over time it's been used by customers, so I thought to add a further article that specifically discusses Active/Active load balancing options. I'll use Azure's standard load balancer as an example, but you can apply this to other cloud providers. In fact, the customer I helped most recently with this very question was running in Google Cloud. This article focuses on using standard TCP load balancers in the cloud. Why Active/Active? Most customers run 2x BIG-IP's in an Active/Standby cluster on-premises, and it's extremely common to do the same in public cloud. Since simplicity and supportability are key to successful migration projects, often it's best to stick with architectures you know and can support. However, if you are confident in your cloud engineering skills or if you want more than 2x BIG-IP's processing traffic, you may consider running them all Active. Of course, if your totalthroughput for N number of BIG-IP's exceeds the throughput thatN-1 can support, the loss of a single VM will leave you with more traffic than the remaining device(s) can handle. I recommend choosing Active/Active only if you're confident in your purpose and skillset. Let's define Active/Active Sometimes this term is used with ambiguity. I'll cover three approaches using Azure load balancer, each slightly different: multiple standalone devices Sync-Only group using Traffic Group None Sync-Failover group using Traffic Group None Each of these will use a standard TCP cloud load balancer. This article does not cover other ways to run multiple Active devices, which I've outlined at the end for completeness. Multiple standalone appliances This is a straightforward approach and an ideal target for cloud architectures. When multiple devices each receive and process traffic independently, the overhead work of disaggregating traffic to spread between the devices can be done by other solutions, like a cloud load balancer. (Other out-of-scope solutions could be ECMP, BGP, DNS load balancing, or gateway load balancers). Scaling out horizontally can be a matter of simple automation and there is no cluster configuration to maintain. The only limit to the number of BIG-IP's will be any limits of the cloud load balancer. The main disadvantage to this approach is the fear of misconfiguration by human operators. Often a customer is not confident that they can configure two separate devices consistently over time. This is why automation for configuration management is ideal. In the real world, it's also a reason customers consider our next approach. Clustering with a sync-only group A Sync-Only device group allows us to sync some configuration data between devices, but not fail over configuration objects in floating traffic groups between devices, as we would in a Sync-Failover group. With this approach, we can sync traffic objects between devices, assign them to Traffic Group None, and both devices will be considered Active. Both devices will process traffic, but changes only need to be made to a single device in the group. In the example pictured above: The 2x BIG-IP devices are in a Sync-Only group called syncGroup /Common partition isnotsynced between devices /app1 partition issynced between devices the /app1 partition has Traffic Group None selected the /app1 partition has the Sync-Only group syncGroup selected Both devices are Active and will process traffic received on Traffic Group None The disadvantage to this approach is that you can create an invalid configuration by referring to objects that are not synced. For example, if Nodes are created in/Common, they will exist on the device on which they were created, but not on other devices. If a Pool in /app1 then references Nodes from /Common, the resulting configuration will be invalid for devices that do not have these Nodes configured. Another consideration is that an operator must use and understand partitions. These are simple and should be embraced. However, not all customers understand the use of partitions and many prefer to use /Common only, if possible. The big advantage here is that changes only need to be made on a single device, and they will be replicated to other devices (up to 32 devices in a Sync-Only group). The risk of inconsistent configuration due to human error is reduced. Each device has a small green "Active" icon in the top left hand of the console, reminding operators that each device is Active and will process incoming traffic onTraffic Group None. Failover clustering using Traffic Group None Our third approach is very similar to our second approach. However, instead of a Sync-Only group, we will use a Sync-Failover group. A Sync-Failover group will sync all traffic objects in the default /Common partition, allowing us to keep all traffic objects in the default partition and avoid the use of additional partitions. This creates a traditional Active/Standby pair for a failover traffic group, and a Standby device will not respond to data plane traffic. So how do we make this Active/Active? When we create our VIPs in Traffic Group None, all devices will process traffic received on these Virtual Servers. One device will show "Active" and the other "Standby" in their console, but this is only the status for the floating traffic group. We don't need to use the floating traffic group, and by using Traffic Group None we have an Active/Active configuration in terms of traffic flow. The advantage here is similar to the previous example: human operators only need to configure objects in a single device, and all changes are synced between device group members (up to 8 in a Sync-Failover group). Another advantage is that you can use the/Common partition, which was not possible with the previous example. The main disadvantage here is that the console will show the word "Active" and "Standby" on devices, and this can confuse an operator that is familiar only with Active/Standby clusters using traffic groups for failover. While this third approach is a very legitimate approach and technically sound, it's worth considering if your daily operations and support teams have the knowledge to support this. Other considerations Source NAT (SNAT) It is almost always a requirement that you SNAT traffic when using Active/Active architecture, and this especially applies to the public cloud, where our options for other networking tricks are limited. If you have a requirement to see true source IPandneed to use multiple devices in Active/Active fashion, consider using Azure or AWS Gateway Load Balancer options. Alternative solutions like NGINX and F5 Distributed Cloud may also be worth considering in high-value, hard-requirement situations. Alternatives to a cloud load balancer This article is not referring to F5 with Azure Gateway Load Balancer, or to F5 with AWS Gateway Load Balancer. Those gateway load balancer solutions are another way for customers to run appliances as multiple standalone devices in the cloud. However, they typically requirerouting, not proxying the traffic (ie, they don't allow destination NAT, which many customers intend with BIG-IP). This article is also not referring to other ways you might achieve Active/Active architectures, such as DNS-based high availability, or using routing protocols, like BGP or ECMP. Note that using multiple traffic groups to achieve Active/Active BIG-IP's - the traditional approach on-prem or in private cloud - is not practical in public cloud, as briefly outlined below. Failover of traffic groups with Cloud Failover Extension (CFE) One option for Active/Standby high availability of BIG-IP is to use the CFE , which can programmatically update IP addresses and routes in Azure at time of device failure. Since CFE does not support Active/Active scenarios, it is appropriate only for failover of a single traffic group (ie., Active/Standby). Conclusion Thanks for reading! In general I see that Active/Standby solutions work for many customers, but if you are confident in your skills and have a need for Active/Active F5 BIG-IP devices in the cloud, please reach out if you'd like me to walk you through these options and explore any other possibilities. Related articles Practical Considerations using F5 BIG-IP and Azure Load Balancer Deploying F5 BIG-IP with Azure Cross-Region Load Balancer1KViews2likes2CommentsCustomer driven Site Deployment Using AWS and F5 Distributed Cloud Terraform Modules
Introduction and Problem Scope F5 Distributed Cloud Mesh’s Secure Networking provides connectivity and security services for your applications running on the Edge, Private Clouds, or Public Clouds. This simplifies the deployment and configuration of connectivity and security services for your Multi-Cloud and Edge Cloud deployment needs across heterogeneous environments. F5 Distributed Cloud Services leverage the“Site” construct to deploy our Secure Mesh or AppStack Site instances to manage workloads. A Site could be a customer location like AWS, Azure, Google Cloud Platform (GCP), private cloud, or an edge site. To run F5 Distributed Cloud Services, the site needs to be deployed with one or more instances ofF5 Distributed Cloud Node, a software appliance that is managed by F5 Distributed Cloud Console. This site is where customer applications and F5 Distributed Cloud services are running. To deploy a Node, different options are available: Use F5 Distributed Cloud Services Console to deploy a site Leverage F5 Distributed Cloud Services Terraform provider to deploy a site following F5 Distributed Cloud Services Console user experience Use F5 Distributed Cloud Services Terraform modules Documentation of all the different deployment patterns found at https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management A customer may not want to leverage the above two options since they rely on using F5 Distributed Cloud Services Console. Reasons not to use the mentioned two options could be: Security and Privacy Concerns Data Security: Reluctance to share sensitive data with another organization. Access Keys: Not willing to share cloud provider access keys or credentials. Compliance: Need to comply with specific regulatory requirements (e.g., GDPR (General Data Protection Regulation), HIPAA) that require control over data. Control and Customization Customization: Need for a highly customized orchestration solution tailored to specific requirements, to create networking and service topologies considering brownfield realities Cost and Resource Management: Resource Allocation: Better control over resource allocation and optimization Operational Considerations Support: Preference for internal support and troubleshooting over relying on external support. Uptime and SLAs (Service Level Agreements): Concerns about meeting service level agreements (SLAs) and uptime requirements To be able to roll out a site despite the points mentioned above, it is possible for the customer to manage the lifecycle of a site outside the F5 Distributed Cloud Services Console. F5 Distributed Cloud Services created a set ofterraform modules to help customers manage the lifecycle of a site outside of F5 Distributed Cloud Services Console. Those modules are available at: AWS module Azure module GCP module The F5 Distributed Cloud Services Site Management documentation provides an overview of all available site types and their documentation on the topic of provisioning. Though many topologies could be deployed via F5 Distributed Cloud Services Console, the following AWS, Azure, GCP topologies can only be realized using Terraform modules: Single Node Single NIC existing VPC / subnet and 3rd party NAT GW Single Node Multi NIC existing VPC / subnet and 3rd party NAT GW Three Node Single NIC existing VPC / subnet and 3rd party NAT GW Three Node Multi NIC existing VPC / subnet and 3rd party NAT GW Any other external resource and its attributes that are to be used, e.g. credentials from Vault systems, IAM policies, SSH keys Deployment Scenario in AWS The F5 DevCentral GitHubproject contains Terraform templates to provision greenfield and / or brownfield Customer Edge (CE) topologies in AWS, GCP, and Azure with multiple use case script templates in respective repositories. To exemplify one of the scenarios, in this article, we walk through the journey a customer would undertake to provision a CE site in AWS using Terraformmodules. High-level Sequence workflow All of the AWS, GCP, and Azure scenarios follow similar high-level steps, as shown in Fig.1. Step 1: F5 Distributed Cloud Services tenant needs to be ready and user to access tenant set up. Step 2: Clone the desired AWS, GCP, or Azure repo from F5 DevCentral GitHub project. For AWS, this is https://github.com/f5devcentral/terraform-xc-aws-ce. Each of these repositories contains multiple deployment scenarios called topologies. Each topology is described by its own readme "readme.md" file. The description includes The resource objects that are created Use instructions and all requirements to be able to create the topology. Especially in brownfield environments Step 3: Customize the “terraform.tfvars” file to the customer’s specific context. These include Distributed Cloud specific parameters. The parameters in this file are described in relation to the function it serves for the specific scenario. Step 4: Run through the Init/Plan/Deploy workflow of Terraform deployment and verify the status of the CE Site using F5 Distributed Cloud Services Console. The Terraform reconciliation functions ensure meeting the intended objectives. Fig. 1: High-Level Sequence Diagram Customer deployment topology description We will explain the above steps in the context of a greenfield deployment, the Terraform scripts of which are available here. The corresponding logical topology view of this deployment is shown in Fig.2. This deployment scenario instantiates the following resources: Single-node CE cluster AWS SLO interface AWS VPC AWS SLO interface subnet AWS route tables AWS Internet Gateway Assign AWS EIP to SLO The objective of this deployment is to create a Site with a single CE node in a new VPC for the provided AWS region and availability zone. The CE will be created as an AWS EC2 instance. An AWS subnet is created within the VPC. The CE Site Local Outside (SLO) interface will be attached to the VPC subnet and the created EC2 instance. SLO is a logical interface of a site (CE node) through which reachability is achieved to external (e.g. Internet or other services outside the public cloud site). To enable reachability to the Internet, the default route of the CE node will point to the AWS Internet gateway. Also, the SLO will be configured with an AWS External IP address (Elastic IP). Fig.2. Customer Deployment Topology in AWS Description of input parameters in Terraform vars file Parameters must be customized to adapt to the customer’s environment. The definition of the parameters in the “terraform.tfvars” is as follows: Parameters Definitions owner Identifies the email of the IT manager used to authenticate to the AWS system project_prefix Prefix that will be used to identify the resource objects in AWS and XC. project_suffix The suffix that will be used to identify the site resources in AWS and XC ssh_public_key_file Local file system’s path to ssh public key file f5xc_tenant Full F5XC tenant name f5xc_api_url F5XC API url f5xc_cluster_name Name of the Cluster f5xc_api_p12_file Local file system path to api_cert_file (downloaded from XC Console) aws_region AWS region for the XC Site aws_existing_vpc_id Existing VPC ID (brownfield) aws_vpc_cidr_block CIDR Block of the VPC aws_availability_zone AWS Availability Zone (a) aws_vpc_slo_subnet_node0 AWS Subnet in the VPC for the SLO subnet Configuring other environmental variables Export the following environment variables in the working shell, setting it to customer’s deployment context. Environment Variables Definitions AWS_ACCESS_KEY AWS Access key for authentication AWS_SECRET_ACCESS_KEY AWS Secret key for authentication VES_P12_PASSWORD XC P12 Password from Console TF_VAR_f5xc_api_p12_cert_password Same as VES_P12_PASSWORD Deploy Topology Deploy the topology with: terraform init terraform plan terraform deploy –auto-approve and monitor the status of the Sites on the F5 Distributed Cloud Services Console. Created site object will be available in Secure Mesh Site section of the F5Distributed CloudServices Console. Video-based description of the deployment Scenario This demonstration video shows the procedure for provisioning the deployment topology described above in three steps. <p><iframe src="https://www.youtube.com/watch?v=8_T3dQSEdhc" width="750" height="422" frameborder="0" allowfullscreen></iframe></p> References https://docs.cloud.f5.com/docs-v2/platform/services/mesh/secure-networking https://docs.cloud.f5.com/docs-v2/platform/concepts/site https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management/deploy-aws-site-terraform https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/troubleshooting/troubleshoot-manual-ce-deployment-registration-issues Note: This project is open source and actively monitored by F5 XC on a best-effort basis. While there is no formal commitment regarding service level agreements (SLA) or support assistance, we encourage the community to report any issues through GitHub. Customers and partners are warmly invited to contribute to the code, fostering a collaborative environment that enhances the project's development and usability.108Views0likes0CommentsCustomer-driven Site Deployment Using AWS and F5 Distributed Cloud Terraform Modules
Introduction and Problem Scope F5 Distributed Cloud Mesh’s Secure Networking provides connectivity and security services for your applications running on the Edge, Private Clouds, or Public Clouds. This simplifies the deployment and configuration of connectivity and security services for your Multi-Cloud and Edge Cloud deployment needs across heterogeneous environments. F5 Distributed Cloud Services leverages the “Site” construct to deploy our Secure Mesh or AppStack Site instances to manage workloads. A Site could be a customer location like AWS, Azure, GCP (Google Cloud Platform), private cloud, or an edge site. To run F5 Distributed Cloud Services, the site needs to be deployed with one or more instances of F5 Distributed Cloud Node, a software appliance that is managed by F5 Distributed Cloud Console. This site is where customer applications and F5 Distributed Cloud services are running. To deploy a Node, different options are available: Customer deployment topology description We will explain the above steps in the context of a greenfield deployment, the Terraform scripts of which are available here. The corresponding logical topology view of this deployment is shown in Fig.2. This deployment scenario instantiates the following resources: Single-node CE cluster AWS SLO interface AWS VPC AWS SLO interface subnet AWS route tables AWS Internet Gateway Assign AWS EIP to SLO The objective of this deployment is to create a Site with a single CE node in a new VPC for the provided AWS region and availability zone. The CE will be created as an AWS EC2 instance. An AWS subnet is created within the VPC. CE Site Local Outside (SLO) interface will be attached to VPC subnet and the created EC2 instance. SLO is a logical interface of a site (CE node) through which reachability is achieved to external (e.g. Internet or other services outside the public cloud site). To enable reachability to the Internet, the default route of the CE node will point to the AWS Internet gateway. Also, the SLO will be configured with an AWS External IP address (Elastic IP). Fig.2. Customer Deployment Topology in AWS List of terraform input parameters provided in vars file Parameters must be customized to adapt to the customer environment. The definition of the parameters in the “terraform.tfvars” show in below table. Parameters Definitions owner Identifies the email of the IT manager used to authenticate to the AWS system project_prefix Prefix that will be used to identify the resource objects in AWS and XC. project_suffix The suffix that will be used to identify the site’s resources in AWS and XC ssh_public_key_file Local file system’s path to ssh public key file f5xc_tenant Full F5XC tenant name f5xc_api_url F5XC API url f5xc_cluster_name Name of the Cluster f5xc_api_p12_file Local file system path to api_cert_file (downloaded from XC Console) aws_region AWS region for the XC Site aws_existing_vpc_id Existing VPC ID (brownfield) aws_vpc_cidr_block CIDR Block of the VPC aws_availability_zone AWS Availability Zone (a) aws_vpc_slo_subnet_node0 AWS Subnet in the VPC for the SLO subnet Configuring other environmental variables Export the following environment variables in the working shell, setting it to customer’s deployment context. Environment Variables Definitions AWS_ACCESS_KEY AWS Access key for authentication AWS_SECRET_ACCESS_KEY AWS Secret key for authentication VES_P12_PASSWORD XC P12 Password from Console TF_VAR_f5xc_api_p12_cert_password Same as VES_P12_PASSWORD Deploy Topology Deploy the topology with: terraform init terraform plan terraform deploy –auto-approve And monitor the status of the Sites on the F5 Distributed Cloud Services Console. Created site object will be available in Secure Mesh Site section of the F5Distributed CloudServices Console. Video-based description of the deployment Scenario This demonstration video shows the procedure for provisioning the deployment topology described above in three steps. References https://docs.cloud.f5.com/docs-v2/platform/services/mesh/secure-networking https://docs.cloud.f5.com/docs-v2/platform/concepts/site https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/how-to/site-management/deploy-aws-site-terraform https://docs.cloud.f5.com/docs-v2/multi-cloud-network-connect/troubleshooting/troubleshoot-manual-ce-deployment-registration-issues137Views0likes0CommentsRunning F5 with managed Azure RedHat OpenShift
Summary In early 2020, Microsoft and RedHat announced a new release of Azure RedHat OpenShift. This article shows how to set up F5 to integrate with this offering. This is also an easy demo. Background OpenShift is now available as a managed service in Azure called ARO (as in, Azure RedHat OpenShift). Microsoft has published a tutorial to deploy a cluster into an existing virtual network, but this article shows a way to deploy an environment with F5 integrated in a single deployment. Use this for demo or learning purposes. Deploying Azure RedHat OpenShift (ARO) You can run OpenShift on your own servers on-premises or in the cloud. For example, these instructions were the way I first learned to deploy a cluster on AWS. Eric Ji from F5 recently published a guide that walks through these instructions and he includes deployment of F5 Container Ingress Services. This method is supported and gives you a high level of control. ARO is a deployment option where your servers are managed by Azure. Patching, upgrading, repair, and DR are all handled for you, along with joint support from Microsoft and RedHat. Microsoft have done a great job of documenting the process to deploy ARO in the tutorial already mentioned. If you were to follow their instructions, after about 35 minutes your deployment would produce something like this (image taken straight from OpenShift's announcement article): Microsoft's instructions to create the demo above require that you have the User Access Administrator role, or that you pass in the credentials of a ServicePrincipal that has contributor rights over the Resource Group in which the existing VNET resides. Deploying F5 + ARO Another way to build out the same environment in Azure is this automated demo, which will include the deployment of F5 and also takes around 35 minutes to complete. Click here to deploy this demo: https://github.com/mikeoleary/azure-redhat-openshift-f5 This does not require a User Access Administrator, but does require that you have a ServicePrincipal with Contributor permissions on the subscription. A ServicePrincipal is a principal in Azure ActiveDirectory to which you can assign roles at a scope like Resource Group or Subscription. For this demo, I recommend creating a ServicePrincipal and then assigning it the role of Contributor over your Subscription, or the Resource Group in which you intend to deploy. If you follow this demo, you'll have an environment that looks more like this: This demo adds the following resources to the environment. You could add these resources manually yourself, if you have an existing OpenShift environment. Adds 3x subnets for the F5 BIG-IP VM Deploys F5 VM's into those subnets using this ARM template Adds the BIG-IP into the OpenShift network following these instructions Installs CIS in OpenShift following these instructions. Deploys an app into OpenShift This includes a Route resource that is detected by CIS CIS then populates the app's pod IP addresses as pool members in BIG-IP Output values are added to the deployment, for users to verify successful completion Post-deployment verification This demo will deploy an app in OpenShift that is exposed by an OpenShift Route, and this requires that you manually change your DNS record on the Internet to point to the IP address value of the deployment output called publicExternalLoadBalancerAddress. After you have made this DNS change (optionally, use a local hosts record), you should see your demo app available on the Internet, like this: The outputs of this demo will also give you the public URL's of BIG-IP's and your OpenShift cluster. You can login to all of these to see the configuration at work. Deleting your environment Don't forget to delete your environment if you are just testing. I find the easiest way to do this is just to delete the Resource Group into which you deployed originally. You can delete individual resources via the Azure portal if you choose, but do remember that the Read-Only Resource Group that is created by ARO is deleted by deleting the OpenShift cluster resource, which is in the Resource Group into which you originally deployed. Conclusion To summarize, ARO allows us to deploy an OpenShift environment quickly. Integration with F5 is much like an on-prem installation of OpenShift. You integrate the BIG-IP with the OpenShift network, then deploy CIS so that it can configure the BIG-IP to expose your applications. Thanks for reading! Any questions, please leave a comment and I'll respond, thanks!1.3KViews1like1CommentF5 Distributed Cloud Customer Edge Migration Centos to RHEL
In this article, I will introduce a process to migrate a Customer Edge site from End of Life Centos OS to RHEL Operating System. Introduction: Back in December 2023, F5 Distributed Cloud Customer Edges image was based on Red Hat Enterprise Linux or RHEL. Operating System Prior to that the Customer Edge ran on Centos 7.x Operating System, which has been announced End of Life . In this article, I will provide a migration strategy from Centos to RHEL OS for customer edge sites that are in a SaaS-Hybrid Edge Deployment pattern (#2 in the slide below) where the VIP is on the Regional Edge and the tunnel termination and SNAT are on the customer edge. While we are using this deployment pattern as an example, the concepts for other patterns are the same with a few caveats which I will include at the end of this article. High-Level Concepts: Before we discuss the migration phases, I want to introduce a few concepts that we will be utilizing. The first concept is what we call a Virtual Site. A virtual Site provides us the ability to perform a given configuration on set (or group) of Sites. The second term is Origin Pool. An origin pool is a mechanism to configure a set of endpoints grouped together into a resource pool used in the load balancer configuration. The typical CE Site deployment consists of a HA cluster that discovers endpoints via a origin pool picked via the CE Site. This discovery is typically via Private DNS or RFC-1918 IP ranges, all though other methods are available. When we introduce the virtual site construct we will perform this discovery via a "Virtual Site" and not the original "CE Site". As depicted below on the right hand side of the drawing, you will see the origin pool is now discovered from all 6 nodes in the virtual site and will route traffic to the endpoint per the LB algorithm. Also, the Virtual Site construct can be utilized for more advanced HA design scenarios and even for additional bandwidth between RE and CE, but this will be discussed in other articles. Virtual Site Setup: Perquisites: Current Centos Customer Edge Site. New RHEL OS Customer Site We first start to setup the virtual site construct by logging into our Distributed Cloud tenant. Once logged in: Navigate to "Shared Configuration" Under "Manage" chose "Virtual Site" Provide a Name, Description, Site Type (in this case CE), and a Site Expression Once the Virtual Site label is created, we navigate to the existing Centos CE Cluster and add the Site Expression that we created in previous step to the site Labels section Goto Multi-Cloud Network Connect tile Goto "Manage" "Site Management" and choose the Site, Cloud Deployment site, or Secure Mesh Site. This will depend on how and where the site was deployed. Once you have the correct site click on the 3 ellipses at the right and go to Manage Configuration and Edit Add virtual Site Label Type in the Key from “Site Selector Expression” my example is ”netta-az-vsite” and click Assign a Custom Key ‘netta-az-vsite’ Type in Value from “Site Selector Expression” my example is ”true” and click Assign a Custom Key ‘true’ Proceed with adding this same label to all sites that will be in the virtual site. Virtual Site Origin Pool Configuration: Now that we have our virtual site configured, we need to configure the origin pool and discover from the virtual site. Go to Multi-Cloud Application Connect In origin pool configuration choose the discovery method, IP or DNS of Origin on given sites Under Site or Virtual Site, choose Virtual Site and pick your virtual site from drop-down: Choose the "Virtual Site" configured in the previous step. Rest of config should be the same Validate origin is successfully discovered from newly created Virtual Site. Go to HTTP LB Performance Click on Origins Servers and you should see 2 origins, one form each site (centos and rhel) in virtual site Migration: Now that we have the virtual site and the virtual site origin pool discovery method built, we can start the migration. Goto the HTTP LB and add the additional virtual site origin pool under the Origins section Leverage weights and Priorities with the 2 origin pools to start the migration from the Centos Site to the Virtual site origin pool. Typical starting point is both origin pools will have a Priority of 1 and Weight will be in a value to equal 100. SO Centos origin pool has a weight of 95 and Virtual Site Origin Pool 5 and decrement and increment both as you migrate. Once 100% of traffic is on the Virtual site origin pool remove the Virtual Site label from the centos site. Remove the original Centos Site origin pool form the HTTP LB Delete the Centos Cluster Additional Info: In the above example for the Customer Edge (CE) deployment, we were leveraging the RE's to publish VIPs to the internet and the CE's were used as tunnel termination points as well as SNAT to origin members. If you move the VIP to the CE there are a few caveats with the way to advertise that VIP to the network. For example to leverage all nodes within the cluster, you will need to provide a VIP Advertisement policy that consisted of an out-of-band DNS LB option or nested LB option. Also as mentioned earlier in this article there can also be HA and bandwidth advantages to leveraging virtual sites as depicted below in the last slide. For more info on the migration process or CE design options, reach out to your F5 sales specialist.162Views0likes0CommentsConfigure Generic Webhook Alert Receiver using F5 Distributed Cloud Platform
Generic Webhook Alerts feature in F5 Distributed Cloud (F5 XC) gives feasibility to easily configure and send Alert notifications related to Application Infrastructure (IaaS) to specified URL receiver. F5 XC SaaS console platform sends alert messages to web servers to receive as soon as the events gets triggered.112Views1like0CommentsThe Case of the Missing F5 AMI : F5 BIG-IP AMI Lifecycle Events
Today, many F5 customers use AWS, and use the AWS Marketplaceto procure F5 BIG-IP software. Customers that follow this route receivemultiple benefits, such as a simplified procurementprocess and the abilityto use their Enterprise Discount Program (EDP) committed spend for AWS and non AWS software, such as F5, that can be consumed in the marketplace. When a customer uses AWS hourly billing for F5 software, they will use the F5 provided prebuilt machine images (AMI). Other customers my procure a license key for BIG-IP via other means and leveragethe AMIs we provide in AWS for such scenarios. When using AMIs that are created by a thirdparty (any organization other than yours) there are lifecycle events that happen that may make it seem like the AMI has simply vanished. The AMI is there. You just need to use the tools and workflows to continue to see and use it. Lifecycle Event Simplified A lifecycle event is where an organization moves a version of their software fromone state to another. For example an AMI can be in a state of public to and moved to a state of restricted or archived. The event that causes this change could be an end-of sale or a patch has been released for a CVE in which F5 restricts the older version of software from being sold to new customers. New customers will only be able to access the BIG-IP builds that are listed as public.Customers that have already subscribed to anoffer,they will still be able to access the previous versions. Let's review what AWS has to say about a deprecated AMI. After an AMI is deprecated: For AMI users, the deprecated AMI does not appear in DescribeImages API calls unless you specify its ID or specify that deprecated AMIs must appear. AMI owners continue to see deprecated AMIs in DescribeImages API calls. For AMI users, the deprecated AMI is not available to select via the EC2 console. For example, a deprecated AMI does not appear in the AMI catalog in the launch instance wizard. AMI owners continue to see deprecated AMIs in the EC2 console. For AMI users, if you know the ID of a deprecated AMI, you can continue to launch instances using the deprecated AMI by using the API, CLI, or the SDKs. Launch services, such as launch templates and Auto Scaling groups, can continue to reference deprecated AMIs. EC2 instances that were launched using an AMI that is subsequently deprecated are not affected, and can be stopped, started, and rebooted. Accessing the Software Once an image has been deprecated, the user experience to locate the software can vary based on how you are normally interacting with AWS. In our example, we will look for a version of software that we had to recently deprecate: 15.1.5.1-0.0.14. Below you can see I am running a deprecated version; it could also be something you previously used. Please note that F5 recommends that customers always move to patch releases when there is a security fix. With that in mind, and if you cannot move to a new version yet, what are the options to continue to use this software? Marketplace Wizard This path works with out any changes to the normal workflow. Locate the software in marketplace and click through the subscribe (you must have already done this prior to the deprecation) and configuration. Select your verison and continue to launch. Complete the from and launch the instance. My Subscriptions This path works with moderate changes to the normal workflow. If you are looking at my subscriptions page, you will only see the most recently published version of the AMI. If you are an F5 user, this could be a major version that you do not use since the display filter is based on publication date. Let's navigate the Better 200 Mbs subscription Clicking in, we can see the subscriptioninformation and can launch another instance of it. But when we click in, we can only see the latest version by publication date. Just below the software version drop down, you can see a link to use other versions. Which takes you to a screen where you can select a different version of the software. Select the version you want to deploy. EC2 Launch Instance Wizard This path does not work In this path, if we search for the AMI id of the previous version, we will find the listing. Following the launch wizard, it only takes us to the latest version (not even all public versions) and we cannot access the AMI version we want. AWS CLI This path works with changes to the CLI flags. By default, the AWS CLI will not show an AMI version that has been deprecated. [cloudshell-user@ip-10-136-48-97 ~]$ aws ec2 describe-images --owners 679593333241 --filters 'Name=name,Values=F5 BIGIP-15.1.5.1-0.0.14*' --query 'Images[*].[ImageId,Name]' --output yaml [] To locate the AMI, you will need to add the flag of include-deprecated. [cloudshell-user@ip-10-136-48-97 ~]$ aws ec2 describe-images --owners 679593333241 --filters 'Name=name,Values=F5 BIGIP-15.1.5.1-0.0.14*' --query 'Images[*].[ImageId,Name]' --include-deprecated --output yaml - - ami-0a25c6b80ecaf6b81 - F5 BIGIP-15.1.5.1-0.0.14 BYOL-LTM 1Boot Loc-220328012805-8f2ed1fb-93bb-4f06-a8f5-eb84757d5fab - - ami-08b9e9627f579bee6 - F5 BIGIP-15.1.5.1-0.0.14 PAYG-Good 1Gbps-220328013426-7fb2f9db-2a12-4915-9abb-045b6388cccd - - ami-0a3aa4f2b6a3cdeb2 - F5 BIGIP-15.1.5.1-0.0.14 PAYG-Best 25Mbps-220328014320-3e567b08-20a9-444f-a72a-7e8da3c2cbdf - - ami-0de86f325238540d8 - F5 BIGIP-15.1.5.1-0.0.14 PAYG-Better 200Mbps-220328014315-bfe1c762-fc65-48ef-a205-29e2770cb15b - - ami-07dc37ae1b50682ac - F5 BIGIP-15.1.5.1-0.0.14 PAYG-Adv WAF Plus 3Gbps-220328014327-fd904f36-3781-4002-8075-a1ce0da76185 Once you have the AMI ID, you can launch from the CLI (or CFT). - - ami-0de86f325238540d8 - F5 BIGIP-15.1.5.1-0.0.14 PAYG-Better 200Mbps-220328014315-bfe1c762-fc65-48ef-a205-29e2770cb15b [cloudshell-user@ip-10-132-62-194 ~]$ aws ec2 run-instances --image-id ami-0de86f325238540d8 --instance-type m5.2xlarge --subnet-id subnet-0a9daa849fb5f1075 { "Groups": [], "Instances": [ { "AmiLaunchIndex": 0, "ImageId": "ami-0de86f325238540d8", "InstanceId": "i-0a7e6854fdce7c850", "InstanceType": "m5.2xlarge", F5 CloudFormation Templates This path works, but you will need to provide an AMI ID. F5 provides example cloudformation templates that customers can use. In our templates, we allow customers to specify an AMI ID via the customImageID field bigIpCustomImageId No string Provide a custom BIG-IP AMI ID you wish to deploy. Otherwise, can leave empty. If we specify our AMI, the templates will launch as expected. F5 BIG-IP Terraform Module This path works, but you will need to customize your terraform files. By default, the F5 BIG-IP terraform module uses a variable and a data search function to find an AMI. You will need to make changes that make sense in your terraform tools. Let's take a look at the main.tf file that deploys BIG-IP in AWS. resource "aws_instance" "f5_bigip" { instance_type = var.ec2_instance_type ami = data.aws_ami.f5_ami.id key_name = var.ec2_key_name root_block_device { delete_on_termination = true encrypted = var.ebs_volume_encryption kms_key_id = var.ebs_volume_kms_key_arn volume_size = var.ebs_volume_size volume_type = var.ebs_volume_type } Let's trace the logic. We have a viarable that has the name we want to search for in the variables. tf file. variable "f5_ami_search_name" { description = "BIG-IP AMI name to search for" type = string default = "F5 BIGIP-15.1.5* PAYG-Better*" } This variable goes into a data resource via the data.tf file data "aws_ami" "f5_ami" { most_recent = true // owners = ["679593333241"] owners = ["aws-marketplace"] filter { name = "description" values = [var.f5_ami_search_name] } } In reviewing the terraform moduleto locate an AMI, the flags are similar to the cli where you need add the include-deprecated if you are not using an express reference in your terraform files. include_deprecated - (Optional) If true, all deprecated AMIs are included in the response. If false, no deprecated AMIs are included in the response. If no value is specified, the default value is false. Your terraform data.tf file will need to be updated to include deprecated images. data "aws_ami" "f5_ami" { most_recent = true owners = ["679593333241"] include_deprecated = true filter { name = "description" values = [var.f5_ami_search_name] } } Proper Planning Required Many customers are able to use the latest version, but if you need to use a specific version, you need a plan. Lifecycle events will happen and they may happen quickly, such as a CVE that has a high CVSS score. If you cannot automatically move to the patched build, then you need to plan and document the workflows you will use to ensure you can continue operations until such time as you can move to the new AMI. To stay two steps ahead: CLI, API, and automated solutions that search for AMIs — you will need to include deprecated AMIs GUI use cases make sure that your users know how to find different versions in the web portal Always document the AMI id that you use in each region.653Views3likes1CommentF5 Distributed Cloud – CE High Availability Options: A Comparative Exploration
This article explores an alternative approach to achieve HA across single CE nodes, catering for use cases requiring higher performance and granular control over redundancy and failover management. Introduction F5 Distributed Cloud offers different techniques to achieve High Availability (HA) for Customer Edge (CE) nodes in an active-active configuration to provide redundancy, scaling on-demand and simplify management. By default, F5 Distributed Cloud uses a method for clustering CE nodes, in which CEs keep track of peers by sending heartbeats and facilitating traffic exchange among themselves. This method also handles the automatic transfer of traffic, virtual IPs, and services between CE peers —excellent for simplified deployment and running App Stack sites hosting Kubernetes workloads. However, if CE nodes are deployed mainly to manage L3/L7 traffic and application security, this default model might lack the flexibility needed for certain scenarios. Many of our customers tell us that achieving high availability is not so straightforward with the current clustering model. These customers often have a lot of experience in managing redundancy and high availability across traditional network devices. They like to manage everything themselves—from scheduling when to switch over to a redundant pair (planned failover), to choosing how many network paths (tunnels) to use between CEs to REs (Regional Edges) or other CEs. They also want to handle any issues device by device, decide the number of CE nodes in a redundancy group, and be able to direct traffic to different CEs when one is being updated. Their feedback inspired us to write this article, where we explore a different approach to achieve high availability across CEs. The default clustering model is explained in this document: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Throughout this article, we will dive into several key areas: An overview of the default CE clustering model, highlighting its inherent challenges and advantages. Introduction to an alternative clustering strategy: Single Node Clustering, including: An analysis of its challenges and benefits. Identification of scenarios where this approach is most applicable. A guide to the configuration steps necessary to implement this model. An exploration of failover behavior within this framework. A comparison table showing how this new method differs from the default clustering method. By the end of this article, readers will gain an understanding of both clustering approaches, enabling informed decisions on the optimal strategy for their specific needs. Default CE Clustering Overview In a standard CE clustering setup, a cluster must have at least three Master nodes, with subsequent additions acting as Worker nodes. A CE cluster is configured as a "Site," centralizing operations like pool configuration and software upgrades to simplify management. In this clustering method, frequent communication is required between control plane components of the nodes on a low latency network. When a failover happens, the VIPs and services - including customer’s compute workloads - will transition to the other active nodes. As shown in the picture above, a CE cluster is treated as a single site, regardless of the number of nodes it contains. In a Mesh Group scenario, each mesh link is associated with one single tunnel connected to the cluster. These tunnels are distributed among the master nodes in the cluster, optimizing the total number of tunnels required for a large-scale Mesh Group. It also means that the site will be connected to REs only via 2 tunnels – one to each RE. Design Considerations for Default CE Clustering model: Best suited for: 1- App Stack Sites: Running Kubernetes workloads necessitates the default clustering method for container orchestration across nodes. 2- Large-scale Site-Mesh Groups (SMG) 3- Cluster-wide upgrade preference: Customers who favour managing nodes collectively will find cluster-wide upgrades more convenient, however without control over the upgrade sequence of individual nodes. Challenges: o Network Bottleneck for Ingress Traffic: A cluster connected to two Regional Edge (RE) sites via only 2 tunnels can lead to only two nodes processing external (ingress) traffic, limiting the use of additional nodes to process internal traffic only. o Three-master node requirement: Some customers are accustomed to dual-node HA models and may find the requirement for three master nodes resource-intensive. o Hitless upgrades: Controlled, phased upgrades are preferred by some customers for testing before widespread deployment, which is challenging with cluster-wide upgrades. o Cross-site deployments: High network latency between remote data centers can impact cluster performance due to the latency sensitivity of etcd daemon, the backbone of cluster state management. If the network connection across the nodes gets disconnected, all nodes will most likely stop the operation due to the quorum requirements of etcd. Therefore, F5 recommends deploying separate clusters for different physical sites. o Service Fault Sprawl and limited Node fault tolerance: Default clusters can sometimes experience a cascading effect where a fault in a node spreads throughout the cluster. Additionally, a standard 3-node cluster can generally only tolerate the failure of two nodes. If the cluster was originally configured with three nodes, functionality may be lost if reduced to a single active node. These limitations stem from the underlying clustering design and its dependency on etcd for maintaining cluster state. The Alternative Solution: HA Between Multiple Single Nodes The good news is that we can achieve the key objectives of the clustering – which are streamlined management and high availability - without the dependency on the control plane clustering mechanisms. Streamlined management using “Virtual Site”: F5 Distributed Cloud provides a mechanism called “Virtual Site” to perform operations on a group of sites (site = node or cluster of nodes), reducing the need to repeat the same set of operations for each site. The “Virtual Site” acts as an abstraction layer, grouping nodes tagged with a unique label and allows collectively addressing these nodes as a single entity. Configuration of origin pools and load balancers can reference Virtual Sites instead of individual sites/nodes, to facilitate cluster-like management for two or more nodes and enabling controlled day 2 operations. When a node is disassociated from Virtual Site by removing the label, it's no longer eligible for new connections, and its listeners are simultaneously deactivated. Upgrading nodes is streamlined: simply remove the node's label to exclude it from the Virtual Site, perform the upgrade, and then reapply the label once the node is operational again. This procedure offers you a controlled failover process, ensuring minimal disruption and enhanced manageability by minimizing the blast radius and limiting the cope of downtime. As traffic is rerouted to other CEs, if something goes wrong with an upgrade of a CE node, the services will not be impacted. HA/Redundancy across multiple nodes: Each single node in a Virtual Site connect to dual REs through IPSec or SSL/TLS tunnels, ensuring even load distribution and true active-active redundancy. External (Ingress) Traffic: In the Virtual Site model, the Regional Edges (REs) distribute external traffic evenly across all nodes. This contrasts with the default clustering approach where only two CE nodes are actively connected to the REs. The main Virtual Site advantage lies in its true active/active configuration for CEs, increasing the total ingress traffic capacity. If a node becomes unavailable, the REs will automatically reroute the new connections to another operational node within the Virtual Site, and the services (connection to origin pools) remain uninterrupted. Internal (East-West) Traffic: For managing internal traffic within a single CE node in a Virtual Site (for example, when LB objects are configured to be advertised within the local site), all network techniques applicable to the default clustering model can be employed in this model as well, except for the Layer 2 attachment (VRRP) method. Preferred load distribution method for internal traffic across CEs: Our preferred methods for load balancing across CE nodes are either DNS based load balancing or Equal-Cost Multi-Path (ECMP) routing utilizing BGP for redundancy. DNS Load Balancer Behavior: If a node is detached from a Virtual Site, its associated listeners and Virtual IPs (VIPs) are automatically withdrawn. Consequently, the DNS load balancer's health checks will mark those VIPs as down and prevent them from receiving internal network traffic. Current limitation for custom VIP and BGP: When using BGP, please note a current limitation that prevents configuring a custom VIP address on the Virtual Site. As a workaround, custom VIPs should be advertised on individual sites instead. The F5 product team is actively working to address this gap. For a detailed exploration of traffic routing options to CEs, please refer to the following article here: https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Design Considerations for Single Node HA Model: Best suited for: 1- Customers with high throughput requirement: This clustering model ensures that all Customer Edge (CE) nodes are engaged in managing ingress traffic from Regional Edges (REs), which allows for scalable expansion by adding additional CEs as required. In contrast, the default clustering model limits ingress traffic processing to only two CE nodes per cluster, and more precisely, to a single node from each RE, regardless of the number of worker nodes in the cluster. Consequently, this model is more advantageous for customers who have high throughput demands. 2- Customers who prefer to use controlled failover and software upgrades This clustering model enables a sequential upgrade process, where nodes are updated individually to ensure each node upgrades successfully before moving on to the other nodes. The process involves detaching the node from the cluster by removing its site label, which causes redirecting traffic to the remaining nodes during the upgrade. Once upgraded, the label is reapplied, and this process is repeated for each node in turn. This is a model that customers have known for 20+ years for upgrade procedures, with a little wrinkle with the label. 3- Customers who prefer to distribute the load across remote sites Nodes are deployed independently and do not require inter-node heartbeat communication, unlike the default clustering method. This independence allows for their deployment across various data centers and availability zones while being managed as a single entity. They are compatible with both Layer 2 (L2) spanned and Layer 3 (L3) spanned data centers, where nodes in different L3 networks utilize distinct gateways. As long as the nodes can access the origin pools, they can be integrated into the same "Virtual Site". This flexibility caters to customers' traditional preferences, such as deploying two CE nodes per location, which is fully supported by this clustering model. Challenges: Lack of VRRP Support: The primary limitation of this clustering method is the absence of VRRP support for internal VIPs. However, there are some alternative methods to distribute internal traffic across CE nodes. These include DNS based routing, BGP with Equal-Cost Multi-Path (ECMP) routing, or the implementation of CEs behind another Layer 4 (L4) load balancer capable of traffic distribution without source address alteration, such as F5 BIG-IPs or the standard load balancers provided by Azure or AWS. Limitation on Custom VIP IP Support: Currently, the F5 Distributed Cloud Console has a restriction preventing the configuration of custom virtual IPs for load balancer advertisements on Virtual Sites. We anticipate this limitation will be addressed in future updates to the F5 Distributed Cloud platform. As a temporary solution, you can advertise the LB across multiple individual sites within the Virtual Site. This approach enables the configuration of custom VIPs on those sites. Requires extra steps for upgrading nodes Unlike the Default clustering model where upgrades can be performed collectively on a group of nodes, this clustering model requires upgrading nodes on an individual basis. This may introduce more steps, especially in larger clusters, but it remains significantly simpler than traditional network device upgrades. Large-Scale Mesh Group: In F5 Distributed Cloud, the "Mesh Group" feature allows for direct connections between sites (whether individual CE sites or clusters of CEs) and other selected sites through IPSec tunnels. For CE clusters, tunnels are established on a per-cluster basis. However, for single-node sites, each node creates its own tunnels to connect with remote CEs. This setup can lead to an increased number of tunnels needed to establish the mesh. For example, in a network of 10 sites configured with dual-CE Virtual Sites, each CE is required to establish 18 IPSec tunnels to connect with other sites, or 19 for a full mesh configuration. Comparatively, a 10-site network using the default clustering method—with a minimum of 3 CEs per site—would only need up to 9 tunnels from each CE for full connectivity. Opting for Virtual Sites with dual CEs, a common choice, effectively doubles the number of required tunnels from each CE when compared to the default clustering setup. However, despite this increase in tunnels, opting for a Mesh configuration with single-node clusters can offer advantages in terms of performance and load distribution. Note: Use DC Groups as an alternative solution to Secure Mesh Group for CE connectivity: For customers with existing private connectivity between their CE nodes, running Site Mesh Group (SMG) with numerous IPsec tunnels can be less optimal. As a more scalable alternative for these customers, we recommend using DC Cluster Group (DCG). This method utilizes IP-in-IP tunnels over the existing private network, eliminating the need for individual encrypted IPsec tunnels between each node and streamlining communication between CE nodes via IP-n-IP encapsulations. Configuration Steps The configuration for creating single node clusters involves the following steps: Creating a Label Creating a Virtual Site Applying the label to the CE nodes (sites) Review and validate the configuration The detailed configuration guide for the above steps can be found here: https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Example Configuration: In this example, you can create a label called "my-vsite" to group CE nodes that belong to the same Virtual Site. Within this label, you can then define different values to represent different environments or clusters, such as specific Azure region or an on-premise data center. Then a Virtual Site of “CE” type can be created to represent the CE cluster in “Azure-AustraliaEast-vSite" and tied to any CE that is tagged with the label “my-vsite=Azure-AustraliaEast-vSite”: Now, any CE node that should join the cluster (Virtual Site), should get this label: Verification: To confirm the Virtual Site configuration is functioning as intended, we joined two CEs (k1-azure-ce2 and k1-azure-ce03) into the Virtual Site and evaluated the routing and load balancing behavior. Test 1: Public Load Balancer (Virtual Site referenced in the pool) The diagram shows a public "Load Balancer" advertised on the RE referencing a pool that uses the newly created Virtual Site to access the private application: As shown below, the pool member was configured to be accessed through the Virtual Site: Analysis of the request logs in the Performance dashboard confirmed that all requests to the public website were evenly distributed across both CEs. Test 2: Internal Load Balancer (LB advertised on the Virtual Site) We deployed an internal Load Balancer and advertised it on the newly created Virtual Site, utilizing the pool that also references the same Virtual Site (k1-azure-ce2 and k1-azure-ce03). As shown below, the Load Balancer was configured to be advertised on the Virtual Site. Note: Here we couldn't use a "shared" custom VIP across the Virtual Site due to a current platform constraint. If a custom VIP is required, we should use "site" as opposed to "Virtual Site" and advertise the Load Balancer on all sites, like below picture: Request logs revealed that when traffic reached either CE node within the Virtual Site, the request was processed and forwarded locally to the pool member. In the example below: src_site: Indicates the CE (k1-azure-ce2) that processed the request. src_ip: Represents the client's source IP address (192.168.1.68). dst_site: Indicates the CE (k1-azure-ce2) from which the pool member is accessed. dst_ip: Represents the IP address of the pool member (192.168.1.6). Resilience Testing: To assess the Virtual Site's resilience, we intentionally blocked network access from k1-azure-ce2 CE to the pool member (192.168.1.6). The CE automatically rerouted traffic to the pool member via the other CE (k1-azure-ce03) in the Virtual Site. Note:By default, CEs can communicate with each other via the F5 Global Network. This can be customized to use direct connectivity through tunnels if the CEs are members of the same DC Cluster Group (IP-n-IP tunneling) or Secure Mesh Groups (IPSec tunneling). The following picture shows the traffic flow via F5 Global Network. The following picture shows the traffic flow via the IP-n-IP tunnel when a DC Clustering Group (DCG) is configured across the CE nodes. Failover Behaviour When a CE node is tied to a Virtual Site, all internal Load Balancers (VIPs) advertised on that Virtual Site will be deployed in the CE. Additionally, the Regional Edge (RE) begins to use this node as one of the potential next hops for connections to the origin pool. Should the CE become unavailable, or if it lacks the necessary network access to the origin server, the RE will almost seamlessly reroute connections through the other operational CEs in the Virtual Site. Uncontrolled Failover: During instances of uncontrolled failover, such as when a node is unexpectedly shut down from the hypervisor, we have observed a handful of new connections experiencing timeouts. However, these issues were resolved by implementing health checks within the origin pool, which prevented any subsequent connection drops. Note: Irrespective of the clustering model in use, it's always recommended to configure health checks for the origin pool. This practice enhances failover responsiveness and mitigates any additional latency incurred during traffic rerouting. Controlled Failover: The moment a CE node is disassociated from the Virtual Site — by the removal of its label— the CE node will not be used by RE to connect to origin pools anymore. At the same time, all Load Balancer listeners associated with that Virtual Site are withdrawn from the node. This effectively halts traffic processing for those applications, preventing the node from receiving related traffic. During controlled failover scenarios, we have observed seamless service continuity on externally advertised services (to REs). On-Demand Scaling: F5 Distributed Cloud provides a flexible solution that enables customers to scale the number of active CE nodes according to demand. This allows you to easily add more powerful CE nodes during peak periods (such as promotional events) and then remove them when demand subsides. With the Virtual Sites method, you can even mix and match node sizes within your cluster (Virtual Site), providing granular control over resources. It's advisable to monitor CE node performance and implement node related alerts. These alerts notify you when nodes are operating at high capacity, allowing for timely addition of extra nodes as needed. Moreover, you can monitor node’s health in the dashboard. CPU, Memory and Disk utilizations of nodes can be a good factor in determining if more nodes are needed or not. Furthermore, the use of Virtual Sites makes managing this process even easier, thanks to labels. Node Based Alerts: Node-based alerts are essential for maintaining efficient CE operations. Accessing the alerts in the Console: To view alerts, go to Multi-Cloud Network Connect > Notifications > Alerts. Here, you can see both "Active Alerts" and "All Alerts." Alerts related to node health fall under the "infrastructure" alert group. The following screenshot shows alerts indicating high loads on the nodes. Configuring Alert Policies: Alert policies determine the notification process for raised alerts. To set up an alert policy, navigate to Multi-Cloud Network Connect > Alerts Management > Alert Policies. An alert policy consists of two main elements: the alert receiver configuration and the policy rules. Configuring Alert Receiver: The configuration allows for integration with platforms like Slack and PagerDuty, among others, facilitating notifications through commonly used channels. Configuring Alert Rules: For alert selection, we recommend configuring notifications for alerts with severity of “Major” or “Critical” at a minimum. Alternatively, the “infrastructure” group which includes node-based alerts can be selected. Comparison Table Criteria Default Cluster Single Node HA Minimum number of nodes in HA 3 2 Upgrade operations Per cluster Per Node Network redundancy and client side routing for east-west traffic VRRP, BGP, DNS, L4/7 LB DNS, L4/7 LB, BGP* Tunnels to RE 2 tunnels per cluster 2 tunnels per node Tunnels to other CEs (SMG or DCG) 1 tunnel from each cluster 1 tunnel from each node External traffic processing Limited to 2 nodes All nodes will be active Internal traffic processing All nodes can be active All nodes can be active Scale management in Public Cloud Sites Straightforward, by configuring ingress interfaces in Azure/AWS/GCP sites Straightforward, by adding or removing the labels Scale management in Secure Mesh Sites Requires reconfiguring the cluster (secure mesh site) - may cause interruption Straightforward, by adding or removing the labels Custom VIP IP Available Not Available (Planned to be available in future releases), workaround available. Node sizes All nodes should be same size. Upgrading node size in a cluster is a disruptive operation. Any node sizes or clusters can join the Virtual Site * When using BGP, please note a current limitation that prevents configuring custom VIP address on the Virtual Site. Conclusion: F5 Distributed Cloud offers a flexible approach to High Availability (HA) across CE nodes, allowing customers to select the redundancy model that best fits their specific use cases and requirements. While we continue to advocate for default clustering approach due to their operational simplicity and shared VRRP VIP or, unified network configuration benefits, especially for routine tasks like upgrades, the Virtual Site and single node HA model presents some great use cases. It not only addresses the limitations and challenges of the default clustering model, but also introduces a solution that is both scalable and adaptable. While Virtual Sites offer their own benefits, we recognize they also present trade-offs. The overall benefits, particularly for scenarios demanding high ingress (RE to CE) throughput and controlled failover capabilities cater to specific customer demands. The F5 product and development team remains committed to addressing the limitations of both default clustering and Virtual Sites discussed throughout this article. Their focus is on continuous improvement and finding the solutions that best serve our customers' needs. References and Additional Links: Default Clustering model: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Configuration guide for Virtual Sites:https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Routing Options for CEs:https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Configuration guide for DC Clustering Group:https://docs.cloud.f5.com/docs/how-to/advanced-networking/configure-dc-cluster-group997Views4likes0Comments