This is part 3 of a series of blogs on the integration between Google Cloud External Key Manager (EKM) and Fortanix DSM™. If you have not read Part 1 and Part 2, we encourage you to read those blogs that provide an introduction to this topic.
Google Cloud External Key Manager (EKM) service is now generally available and the integration of Fortanix DSM with Google Cloud EKM as an external key manager is also generally available. Since announcing the integration in November 2019, we have been working with many large enterprises to incorporate this new service and we have learned a lot during the process. In this blog, we explain some of the common requirements large enterprises have for implementing external key management and describe how Fortanix DSM meets those requirements. We also discuss some best practices for configuring your account in Fortanix when enabling the integration with Google Cloud EKM.
Cloud EKM enables enterprises to mitigate their security and compliance risk by retaining control of their encryption keys and securing them outside of the public cloud .To ensure the availability of keys, we recommend the following list of requirements when implementing external key management:
- High availability – External Key Manager (EKM) used with Cloud EKM Service should be at least as available as the GCP KMS service with which it integrates.
- Disaster recovery - If EKM is destroyed and keys are lost, there should be a way to recover the keys.
- Performance - Latency and throughput should be within acceptable limits.
- Role-based access control - Access to the EKM keys should be tightly controlled in the external key manager.
- Auditability - While GCP logs activities inside the cloud, enterprises expect operations performed outside the cloud on the EKM to be logged with a similar level of granularity.
High availability is critical for GCP Cloud EKM because if the EKM key is not available to Big Query or Google Compute Engine (GCE), then the service would be unavailable to the user. These are often business critical services for an enterprise.
Fortanix DSM is architected to always work as a cluster of three or more nodes to provide high availability. The cluster allows for the service to be available and the keys to be usable as long as a majority of nodes are available in the cluster. In case of loss of a majority of nodes, Google Cloud EKM keys are still usable, but the ability to create new keys is lost.
The cluster of nodes can easily be spread across physically separate data centers, thus allowing for failure of entire data centers or failure of connectivity between data centers. The service still stays fully available if a majority of nodes are available in the cluster. In a multi-data-center deployment scenario, a global load balancer is used to route incoming traffic to the nearest data center.
Disaster Prevention and Recovery
While HA provides continuous access to the EKM key, it would still be disastrous if the EKM key is permanently lost or deleted, either accidentally by a human error, through a software error, or through a hardware failure or disaster at the facilities. Services like Big Query would lose all data in this scenario, so it is critical to have measures to recover from such a disaster.
Fortanix DSM provides several mechanisms to prevent disasters from happening and to recover from disaster when they might happen.
- Preventing against accidental or malicious deletion of an EKM key or change of permissions: Fortanix DSM supports quorum based policies to build an approval process where multiple administrators need to approve before a sensitive action can be performed with a key, e.g., deletion of key or change of permissions. Customers are recommended to configure a quorum policy with multiple administrators on the group which they use to create and store the Google Cloud EKM keys.
In case that the key does get deleted or lost even after having a quorum policy configured, there are two ways to recover the key.
- Soft deletion of keys: There is an option to configure a policy inside a Fortanix DSM account or group to ensure that keys are not deleted permanently. By enabling this “soft deletion” policy, customers can ensure that even if some user accidently tries to delete a key, it will only be archived, and the key may be recovered from that.
- Restore from full backup: In a worst-case scenario where a key gets deleted and the above measures were not implemented, or there is some other form of data corruption, it is still possible to restore a cluster of Fortanix DSM from a backup taken at some point of time previously. This would revert the state of the cluster to an earlier time which means that data created in the interim would be lost. However, if all other measures fail, this method would ensure that an EKM key can still be restored.
There are three important aspects of performance – throughput, latency, and quality of service. For enterprises using Google Cloud EKM, they may have a certain peak throughput and maximum latency they expect from the external key manager. They also want these numbers to be predictable and consistent which imposes quality of service (QoS) requirements.
- Throughput – The clustered architecture of Fortanix DSM allows for linear increase in capacity with the size of the cluster. Capacity planning is done by measuring the current load and having an estimate of the capacity for a given cluster size for traffic coming through the Google Cloud EKM interface.
- Latency – Big Query and Google Compute Engine (GCE) expect consistently low latency to the EKM to operate efficiently. The biggest component of the latency is often the network latency between Google Cloud KMS and the EKM. The implication of this is that the best latency for Cloud EKM is often obtained from a Google Cloud region closest to the deployment of Fortanix DSM. Conversely, Fortanix DSM should be deployed and used from a site which is closest to the Google Cloud region where Big Query or GCE is running for good performance and low latency. The clustered architecture of Fortanix DSM enables low latency by replicating keys across a geographically distributed region, and configuring a global load balancer to route traffic to the nearest region.
- Quality of Service – A multi-tenant external key manager may be susceptible to a denial of service attack where a rogue tenant may hog all resources for the KMS. Fortanix DSM allows configuration for a minimum quality of service that can be controlled by the system administrators to assure predictable and consistent performance for Google Cloud EKM.
Role-based access control (RBAC)
As we have seen above, the Google Cloud EKM key is very sensitive, and access to the key should be managed very strictly by a very small set of administrators in an organization. This is achieved using a very elaborate RBAC system in Fortanix DSM. Users and keys may be organized in separate groups, and access to keys for users is determined by their group membership and roles in those groups. By following the principles of least privileged access, it is possible to restrict access to a small number of users who can create and manage the EKM keys. As mentioned above, actions can further be restricted by configuring a quorum-based policy.
Auditability of operations is often a major requirement for enterprises. While this may seem simple, getting a consistent set of logs from a cluster deployed across multiple sites is challenging. Fortanix DSM provides a secure way to store and manage such cluster-wide logs, capturing various types of operations, such as administrative (user creation, group creation, etc.), authentication (all login and logout actions are logged), and cryptographic (key creation, editing, usage, etc.).
The logs collected in Fortanix DSM can be forwarded to a log collection software or services in real time with Splunk or any Syslog-based logging systems. We also have native support for forwarding logs to Google Cloud StackDriver. This enables users of Google Cloud EKM to stay in the Google Cloud ecosystem and use more native Google Cloud services.