Towards Monitoring Applications in a Multi-Cloud Environment

Modern applications, as targeted in the PrEstoCloud project, necessitate geographically-distributed components over several data centers (DCs). Typical examples include stream/batch processing of geo-distributed data, or the redirection of clients to the optimal data replica.

While little is known on inter-DC networks, the common belief is that they are not as well provisioned as intra-DC networks. As Amazon Web Services (AWS) is arguably the most popular Infrastructure-as-a-Service (IaaS) cloud provider, we are conducting, within the PrEstoCloud project, measurements on the WAN connectivity offered by AWS in-between its DCs from an application perspective.

We aim to shed light on the WAN connectivity offered by AWS, which operates their own WAN networks between its data centers. Indeed, the ability to correctly interpret runtime measurements from distributed applications relies on the thorough understanding of the design choices made by the cloud provider. For example, it is a complex task to relate bandwidth measurements to the effective bandwidth available to application fragments if the cloud operator heavily relies on multi-paths in its infrastructure. Indeed, multi-path is often implemented with hash-based functions, leading to measurement flows possibly taking a different path from the one actually used by the application fragments.

We performed large-scale traceroute measurements in order to uncover the AWS network infrastructure. As a glimpse on our results, illustration 1 below presents the cumulative number of edges and vertices discovered when performing paris-traceroute tests for all the possible 65,536 source ports between the Canada and California DCs of AWS. We notice that the size of the graph is unusually large, with close to 3k nodes and 100k links. This implies that AWS heavily relies on multi-paths.

Our ongoing efforts are now on: (i) understanding the structure of the graph, (ii) relating this structure to the technologies used by AWS (we observe a mix of Carrier-Grade NAT (CGNAT) and MPLS) and (iii) studying key metrics from an application perspective, especially delay.

Illustration 1: AWS interconnection graph for between California and Canada