Compliant Infrastructure: Network Configuration
August 30, 2023
Robert Konarskis, Co-founder and CTO, Savages Corp.
This is the second article in a 7-piece series on compliant IT infrastructure. It is intended for software engineers, technical leaders and decision makers to improve the understanding of networking when operating in regulated environments with tight security requirements.
- Organization setup
- Network configuration
- Perimeter security
- Data protection
- Logging and monitoring
- Source code security
- Change management and auditing
Network Configuration
From all the available topics, partners, auditors, penetration testers, and everyone else involved in ensuring regulatory compliance, like to focus a disproportionately large amount of time on IT network security. However, I believe that the main reason is not the network itself, but rather that the network setups can be audited, enforced, and strictly monitored, as opposed to humans.
According to Verizon’s 2023 Data Breach Incident Report, 74% of breaches involved the human element, which includes social engineering attacks, errors or misuse, while 83% of breaches involved external actors – with the majority being financially motivated. The two key takeaways are: humans are exploited more often than code, and attackers are more interested in regulated industries because the potential catch is usually bigger than in other industries. The latter we can take as a constant and a part of reality when operating in industries such as financial services, but there is a lot we can do to make the people harder to exploit by adding a layer of network security in front of the otherwise vulnerable systems.
Think of networks as a concierge in an apartment building, systems with user data as apartments with valuables in that building, and tenants as your employees. You have limited control over tenants losing the key to the room or even leaving the door open, and with the absence of a concierge controlling who gets to enter and exit the building, the rooms are vulnerable. If the concierge does their job well, nobody except the tenants will even be able to enter the building, making the individual apartments inaccessible from the outside. In this example, the concierge is still human, and it’s on purpose – to illustrate that no individual measures are strong enough to guarantee 100% safety, but multiple measures together make the overall system much less vulnerable.
By the way, if you read the previous article about IT Organization Setup, you can imagine building security cameras being an equivalent of audit logs in IT systems.
Three Kinds of Resources
Since the majority of cyber attacks on IT organizations come from the internet, it’s essential to introduce the distinction between the kinds of resources relative to their access to the internet. A resource can be a database server, load balancer, application server, etc.
TL;DR public resources have ingress and egress internet access, private resources only have egress internet access, and isolated resources have no internet access
Public resources are components that both are accessible from the internet, i.e. have a public IP address by which they can be reached, and can reach other public resources on the internet themselves. For example, most load balancers, client-facing web servers, bastion hosts and NAT appliances are public, internet-facing resources.
Private resources, just like the public ones, are able to reach other public resources on the internet. The difference between public and private resources is that private resources are not assigned a public IP, hence they are not directly reachable from the internet. This makes private resources harder to access since they can only directly be reached within the network they reside in. However, they can still initiate a connection to the internet, meaning they can theoretically download and/or execute malicious code from inside the private portion of the network.
Lastly, isolated resources are neither accessible from the internet, nor can they access the internet themselves. Not being able to access the internet means that any code running in such isolated resources can not download a virus, or execute external API calls from within the network.
You may have noticed that the resource kinds described above are not actually attributes of the resource itself, but rather its networking capabilities. A database server, just like any other resource, can be configured to be either isolated, private, or public depending on its placement within the network. For convenience, networks are split into subnets – portions of the network, with different configurations depending on the kind of resources that are intended to be placed there.
Subnets and Resource Placement
TL;DR The principle of least privilege is to be followed. Public subnets should only be used for resources that must have full internet access, such as load balancers, NAT appliances, bastion hosts. Most of the application code can reside in the private subnets with egress-only internet connectivity, typically to reach the APIs of other systems. Systems that store any kind of user data such as databases, caches, search indexes, and others should stay in isolated subnets for minimum exposure.
Let us walk through the common setup evolution explaining the needs, advantages and disadvantages every step of the way. Physical network separation using availability zones is excluded from the examples for the sake of simplicity and are introduced later in the article.
The simplest setup would involve a network with a single public subnet hosting a single resource, such as a web server. It can serve HTTP traffic on port 80, perhaps also HTTPS on port 443, but it can equally well be pinged via ICMP or connected to via SSH protocols. It’s very simple, but difficult to scale, and the server is open to be directly reached from the internet.
As the application matures, it may need to be able to serve more traffic and provide better security and availability guarantees. This is a good time to introduce a load balancer fronting the application, and move the web server “behind” the load balancer. This way, only the load balancer would have a public IP address, and would only forward traffic on the desired ports to the web server. At this point, anyone from outside the network can not directly connect to the application server, even if they have credentials.
At some point, usually at the very beginning, the need to store user data arises, and we need to decide where to place a database, a cache, or a similar resource. Since these systems do not need to be connected to the internet, they just store data, it’s best practice to place them in isolated subnets. Only other resources from within the same network can physically reach anything in the isolated subnets, hence, in this example, access to the database is limited to our application servers.
The examples above are largely simplified on purpose to illustrate the key principles behind resource placement in networks, and how this provides an additional layer of security to the system. The database still needs its own set of credentials that can be exposed one way or another, but even if that happens, the attacker would have to find a way to first get inside the network to even be able to reach the database.
In practice, a large variety of configurations are valid depending on the specific use case, as long as the principles of least access are respected, which is also what the auditors are looking for. Some compliance frameworks such as PCI-DSS are more strict than others, in which case it’s best to fully isolate the cardholder data environment from the rest of the system not just on the network level (separate VPC), but, depending on the cloud provider of choice, on the organization level, e.g. by hosting the setup in a separate AWS account.
Access to Private and Isolated Resources
Following the principle of least access, it has been established that any resources that do not require internet connectivity should be placed in non-public subnets. However, there are cases when engineers, admins, or other team members are required to connect directly to these resources for troubleshooting the system. Since nobody can connect to them directly from outside the network, additional resources need to be deployed to enable such access.
The simpler setup involves deploying a bastion host in a public subnet, which is a server both accessible from the internet, and able to access private and isolated resources inside the network at the same time. Its sole purpose is to provide access to private resources within the network from outside the network itself, e.g. a developer’s computer.
Since deploying a bastion host in a public subnet creates an attack vector to all resources within the network, there is a set of best practices to follow:
- Do not share the IP address or the domain name outside the organization.
- Configure routing rules in such a way that the bastion host can only access the resources it is meant to access within the network. For example, you may have a password-protected database, as well as some applications with open SSH ports running inside the network: make sure that the bastion host can not access any unprotected resources in case it has been compromised.
- Configure source IP whitelisting to ensure that nobody can access the bastion host unless their IP has been permitted. Even better, the whitelisting should be temporary, e.g. maximum for one hour, and behind an approval process.
- Ensure that the bastion host does not accept any unauthorized SSH connections. Computers with a need to access the network must have their public SSH keys uploaded to the bastion host ahead of time.
Bastion hosts are a proven solution to reaching private resources inside the network, but it has a few drawbacks. It has to be, arguably, the most secure part of the system, and maintaining whitelisted IPs, keeping the OS up-to-date, managing SSH keys and other operations can be somewhat time-consuming and complicated.
Another way to securely access resources inside the network is using a VPN (Virtual Private Network) solution. VPN is a way of creating a secure connection between a computer and the network (or two networks) by giving the computer its own IP address inside the network. It is a little trickier to set up than bastion hosts and requires more network planning, but can give a series of advantages over the latter:
- It enables the use of SSO (Single Sign-On) for authentication, the benefits of which were explored in the previous article. Most importantly, it allows managing access from a single place as opposed to additionally handling IP addresses and SSH keys in multiple places, which, in turn, not only increases security but simplifies compliance at the same time.
- It greatly reduces the human intervention required to provision access to the resources, consequently, reducing the risk of human error.
- For an additional layer of protection, VPN can be used together with a bastion host, where users would first need to connect using VPN, then bastion host, and only then access the resources.
Cloud service providers know the benefits and offer “out-of-the-box” solutions such as AWS Client VPN, but at a much higher cost than a tiny EC2 bastion host, orders of magnitude higher. As with everything, the best solution depends on the specific needs and abilities of an IT organization.
Connections to External Systems
So far, we have covered networking within a single system and between a system and the user outside the network. More often than not, a need to connect to external systems with their own networks arises, such as calling external APIs, receiving webhooks, subscribing to events, etc. With respect to internet connectivity, the same principle of least access should be applied.
By far the most simple and secure approach is using cloud priver’s private connectivity options, such as AWS Private Link. This allows two systems to only expose their endpoints to each other on the provider’s private network without exposing them to the internet. The downside is that both systems must be using the same cloud provider which is not always possible.
In case the above approach is not possible because the two systems use different cloud providers, or not desirable for vendor lock-in or other reasons, there is an equally secure although more complicated solution: site-to-site VPN. In this case, an encrypted connection is established between the two networks, but the traffic flows over the public internet instead of the cloud provider’s private network.
The solutions described so far are ideal since neither system has to expose any resources to the public internet. However, in practice, it is usually limited to connections between systems within the same organization, enterprise companies with tight long-term business relationships, and very highly regulated environments such as aerospace. In the vast majority of cases, we build systems that integrate with SaaS solutions’ public endpoints, or receive webhooks from such systems over the public internet. In this case, while predominantly relying on application credentials, there are ways to add a layer of network security into the mix.
When calling public APIs from within the system, the risk is usually lower since our system initiates the communication and does not expose any public endpoints. However, in highly regulated environments, it is best to implement an egress firewall solution with whitelisted domains that the system is allowed to access. Such a firewall would inspect all outgoing traffic, perform reverse DNS lookups, and refuse connections to any non-whitelisted systems.
When receiving webhooks from external public systems, we must expose a public endpoint that those systems can reach, immediately making the system more vulnerable to attacks. Besides webhook verification in-code, it is best to maintain a whitelist of source IPs that are allowed to reach the public endpoint, to block traffic from anywhere other than the intended system.
Stripe takes this seriously and encourages their partners to take advantage of the additional network-level security available: https://stripe.com/docs/ips. However, not all SaaS providers have such options in place.
Two or Three Availability Zones
Besides security, most systems have high availability requirements, meaning they must remain operational if a part of the system experienced a failure of some sorts. For this reason, most corporate and cloud data centers span multiple physical locations to account for fires, floods, earthquakes, electricity or internet outages, configuration mistakes, or any other disasters. Most cloud providers allow choosing between one and three AZs (availability zones), each of which span one or more physical locations, to use within a single network. Any single physical location is likely to experience issues, so it is advisable to spread resources across multiple availability zones, at least two, for the system to be considered highly available.
In practice, for a vast majority of systems, running in two AZs is enough and can provide minor cost savings over the three-AZ setup. It allows to failover the database, applications, run message brokers in active-standby configuration, and remains resilient to a single AZ failure.
However, there are two main reasons to consider starting off with a 3 AZ setup, even if it may come at a slightly higher cost. The most important reason is distributed systems such as Kafka, ElasticSearch and others requiring a quorum to operate (majority of nodes being available and agreeing on decisions). Even if there is no need for such systems initially, once the need arises, they must be deployed across at least 3 locations for maximum availability. If the network is configured to use 2 AZs in the beginning, switching to a 3 AZ setup can be very time-consuming and risky.
Additionally, the system (or most of its parts) is now able to withstand failure of two independent physical locations at the same time while remaining operational, further increasing availability guarantees. It does not happen that often though.
IP Allocation
This section is more technical and may be hard to follow, but crucial to understand since the choices of IP allocation often have to be made early, and the consequences are significant since in most cloud providers, networks and subnets can be difficult and risky to modify in production.
Networks can be large, but not unlimited in size, meaning there are only a certain number of IP addresses available for resources within the network. Since most networks are divided into subnets based on the resource kinds they are intended for, as well as availability zones for high availability, IP address allocation to subnets is one of the most important decisions that must be made upfront and can often hurt if done without enough forward thinking. For example, a typical VPC in AWS has a CIDR block of 10.0.0.0/16, making 65,536 IP addresses usable (slightly less in reality, as some IP addresses in each of the subnets are reserved for internal usage). In the case of AWS, additional CIDR blocks can later be added to expand the network for particularly large systems.
Since we have established that only a limited number of resources needs to be exposed publicly, public subnets can usually be small in size, limiting them to 16 or 32 IP addresses (CIDR /28 or /27). Private, or application, subnets are meant to host most of the application code, hence it is recommended to reserve a large IP space for resources in these subnets, anywhere from 1024 to 8192 IP addresses (CIDR /22 to /19). Lastly, isolated subnets can contain databases, caches, and perhaps code not requiring any internet access, meaning these can be kept relatively small, but not too small. Our recommendation is leaving anywhere between 256 and 1024 IP addresses for this space (CIDR /24 to /22). For more granular resource isolation and access control, it is not uncommon to create subnet groups per resource type, e.g. separate subnets for Kafka, Postgres, Elastic, etc. In this case, the subnets can be sized appropriately based on the requirements of the given system.
In addition to public, private and isolated subnets, it is common to need at least two more kinds of subnets – Firewall and VPN subnets, to host the aforementioned resources, even if they are not used from the beginning. Firewall subnets can be small, 16 to 32 IPs, as they will only host packet inspection software, while VPN subnets should remain larger, 256 to 1024 IPs, since in many cases, each client connection would take up a single IP address. The larger the organization, the larger these subnets need to be.
Overall, a safe starting point for most networks is the following:
- /16 CIDR for the whole network
- /27 CIDR for public subnets
- /19 CIDR for private subnets
- /22 CIDR for isolated subnets
- /27 CIDR for firewall subnets
- /22 CIDR for VPN subnets
In total, this reserves a space of 10,304 IPs per availability zone. Multiplied by 3 AZs, this means that 30,912 IP addresses, roughly half of the total CIDR block, is allocated, leaving space for both scale, and changes to network configuration in the future.
Tip: it is recommended to use overlapping CIDR ranges for production, staging, development, and any other environment stages. Meaning that if the production network uses 10.0.0.0/16 CIDR, others should use the same. Not only does this simplify configuration, but it makes it impossible to peer the networks together exposing production data to non-production resources, and vice-versa.
Moreover, it is recommended to always leave space for expansion of the network. If a need for a separate environment arises, e.g. for storing cardholder data under PCI-DSS regulatory framework, it’s best to not assign 10.1.0.0/16 (next available) CIDR to it, but at least 10.6.0.0/16. This way, the original network at 10.0.0.0/16 can still be expanded by up to 5x, maintaining a continuous IP address space.
A Balanced Starting Point
Concluding the best practices and approaches mentioned throughout this article, below is a recommended generic starting point for most production systems running in the cloud. The example uses AWS, but the same principles apply to any major cloud provider. For non-production accounts, most resources can be scaled down or deployed in single-instance configuration, e.g. NAT appliance, RDS instance, etc. for lower costs.
In Practice
Networking best practices do not only apply during audits, but also doing normal business due diligence. One of our clients had a digital product acting as a lead generation tool for life insurance sales, and was about to partner up with an insurance underwriter, which required integrating with their system. As part of the underwriter’s due diligence process, they required network diagrams of our client’s system, and discovered that a database server is hosted on a public subnet, meaning that it was accessible to the public internet and only protected by user credentials. It was a no-go for the underwriter and the integration had to be postponed until the database server was migrated to isolated subnets, causing business delays.
Wisdom is prevention, and disasters rarely happen due to a single mistake, but rather a series of unfortunate circumstances on different levels. When it comes to IT organization security, It is our responsibility to create multiple security layers, making individual mistakes less costly, and use forward thinking to reduce the overhead of managing and scaling such systems over time.
In the next article, we will look at the network perimeter security concepts and best practices.