AppFormix Blog

Meet Your Noisy Neighbor, Container

By Parantap Roy on March 31, 2016


We’ve all had one at one point or another. Usually, the culprit is a person or a group of people who are misbehaving, acting out, and causing a ruckus in your neighborhood. Well, our development environments aren’t much different. The cloud is a shared environment, much like our neighborhoods. And while a noisy neighbor may inhibit rest, in the cloud it means something completely different. In the cloud workloads compete with each other for resources, it is typical to spin up 100s of containers on a single server. In this environment, contention is inevitable and performance suffers.

With the AppFormix solution using Intel Resource Director Technology Technology you can both identify the noisy containers and control the resource allocation on the server before the application suffers any loss in performance.

AppFormix provides a complete integration with Kubernetes and presents the container management system components in a usable and interactive dashboard. It allows you to easily analyze and control services, replication controllers, pods and containers running across a cluster.

In Figure 1 below, you see an Infrastructure Dashboard view that lists all components running on your infrastructure and component health in a single view. In this particular instance, you are running three replication controllers:

  • redis-nginx-rc2 has five replicas
  • webserver-1 has two replicas
  • m-noisy-22 is running one replica

Each of these replicas have pods associated with them. The pods have some some workload containers associated with the pod container.

For example: *-redis-bench-set-1, *-redis-bench-get-1, and the *-k8s_POD associated with it mapped to a single instance of a redis-nginx-rc2 pod.


Figure 1:  AppFormix - Container Management System Infrastructure Dashboard

The AppFormix dashboard also provides a navigation link to the host charts. In Figure 2 below, you see a view of the metric charts page on any particular system. From a developer’s perspective this information is useful because he/she observes a sudden slowdown in his/her web server response times and requests(s). In this scenario, you can view the host and all container instances running on that host and their corresponding CPU, memory, disk, and normalized load usage as a timeseries (...quite a few more metrics to debug specific issues and cases of resource contention).


Figure 2:  AppFormix Metrics Chart View

In this particular case, the slowdown in response times and requests(s) cannot be attributed to any of the metrics shown in Figure 2. Figure 3, however, tells a different story.


Figure 3
:  CMT, MBM Counters show Noisy Aggressor

With the AppFormix solution on a container management system using Intel Resource Director Technology, you can identify the noisy containers in this workload scenario which places 21 containers on an Intel Xeon E5 v4 server with a two Socket server - each socket with 44 cores (SMT on) and 55MB of shared L3 Cache. This workload is quite realistic - if not conservative - of a shared workload scheduled on a single compute Host with isolations in terms of CPU, memory, and disk. Figure 3 shows how disruptive a noisy aggressor can be on the cache occupancy for a webserver and in-memory message bus workloads. AppFormix leverages CMT and MBM to upstream the usage counters of these shared resources in a dashboard view. This real-time interface provides developers and infrastructure operators an insight into the impact their shared workloads have on the limited resources on a particular system. Prior to this joint release, such issues were hard to debug (for example, Figure 2). Note how the noisy container m-noisy-22 starts around 10:35 with ~20% normalized CPU utilization on the host but using:

  • ~40MB of the L3 cache and generating
  • ~150-200k cache misses

AppFormix provides an integration with container management systems to identify noisy entities on a physical host in real time. Additionally, we expose this on a dashboard for manual debug in real-time and allow configuring alerts and events to notify a user of noisy behaviors on the system whenever they occur. And finally, we take appropriate action based on the notifications that don’t require any manual intervention.

In addition to providing deep insights into monitoring, AppFormix also provides real-time control of Cache allocation using a REST endpoint exposed as part of its SDK. We use the CAT feature provided in the hardware and provide several policies for controlling the Cache allocation for a particular VM, Container, or applications running on a physical server. In Figure 4, we apply a policy on the noisy container at 11:15 to reduce its cache allocation to use around 10% of the cache on the system. As a result of this policy, we see the L3 cache footprint of the noisy aggressor reduce greatly - in real-time. This directly results in improved performance for the webserver in increased requests(s) and reduced average and 99th percentile latency.

The policies can be the following:

  • Specifying cache lines by xMB in the system
  • Specifying cache lines by % of L3 Cache in the system
  • Specifying overlapped or isolated cache lines to applications based on their priority

And the policies can be applied while bootstrapping containers, during workload placement, or at runtime on identifying a noisy entity:

curl -i  \
  -H 'Content-Type: application/json'  \
  -X PUT  \
  -d '{ "InstanceCacheAllocationPercentage": 10 }'  \

_BLOG_2_Fig4_Noisy_Qos_5perc.pngFigure 4:  Cache Partitioned for the Noisy Aggressor to use 10% of the L3 Cache on the Host at 11:15

Another key metric exposed with Intel Resource Director Technology is MBM,  which allows monitoring local and remote memory bandwidth consumed by individual threads in the system. This information can be used to identify noisy memory bandwidth entities in the system. Alerts can be configured on such behavior at the scope of a host, a replication controller, a pod, or a container. Applications using more remote memory than local memory can be classified in real-time as numa unoptimized as a trigger to developers to change their memory allocation requests to be numa aware.

Shared infrastructure workloads provide challenges like resource contention and performance impact by noisy entities in the system. AppFormix and Intel, together, provide innovation in hardware that now provides control over another critical piece of the puzzle - L3 Cache. Monitoring entities like L3 Cache, remote and local memory can provide key inputs on identifying application performance bottlenecks. Controlling cache allocation provides great performance gains in throughput and reduced average and 99th percentile latency. Shared workloads and shared L3 Cache is an issue at several avenues.

The isolation is not perfect. Containers cannot prevent interference in resources that the operating system kernel doesn't manage, such as level 3 processor caches and memory bandwidth, and containers need support by an additional security layer (such as virtual machines) to protect against the kinds of malicious actors found in the cloud.

AppFormix provides solutions to the real and imminent contention issues faced when you schedule memory sensitive workloads on shared infrastructure. AppFormix provides developers and cloud operators insights into issues that we only previously speculated on. Moreover, we provide strategies to mitigate them and better workload placement.



Subscribe to AppFormix Blog!

Increase Your Cloud ROI with AppFormix Analytics.

  Get a FREE Trial Now!


Subscribe to the AppFormix Cloud Operations Blog

About AppFormix

AppFormix is the leading provider of infrastructure performance optimization for cloud-based datacenters and the best OpenStack Cloud Analytics Solution on the market. AppFormix increases the ROI of existing enterprise infrastructure through software that enables consistent performance of applications running in Virtual Machines or containers, either on-premise or in the public cloud.

Get a FREE Trial of AppFormix Analytics and try our new alerting feature.