Plastic brain learning/un-learning

Posts Tagged ‘JeTM’

Java Observability, Manageability in the Cloud

In API, Manageability on January 30, 2009 at 10:14 pm

Have you done this fire-drill: you have a high traffic/volume web application, it is sluggish or unstable. Your team is called to figure out if the problem is in the application, middleware stack, operating system or somewhere else in the deployment configuration??

This is where Observability comes in: Being able to dynamically probe resource usage (across all levels of the infrastructure/application stack) at a granular level, control probing overheads and associating actions or triggers with those probes at all levels.At a system level,  DTrace is a good example, on Solaris & Mac OS X. There’s even a Java API for DTrace. Of course, there are other profiling libraries that offer low over-head, extremely fine-grained probes, aggregation capabilities (e.g. JETM).

Manageability is another important consideration: Being able to control and manage applications via standard management systems. This involves (hopefully) consistent instrumentation mechanisms, and standard isolation mechanisms between IT resources being managed, and external management systems. JMX is an example of one such standard. Other options such as JMX to SNMP bridge, MIBs compiled to MXBeans are some methods of integrating managed resources in to higher level (management) frameworks.

Flexible binding of computing infrastructure to workloads, is a key value proposition of the Cloud Computing model. Cloud computing providers like Amazon serve the needs of typical web workloads, by providing access to their dynamic infrastructure.

The problem is workload diversity. Vendors like Amazon, Google essentially offer Distributed computing workload platforms. Imagine tracing application performance issues on a distributed platform!

So, why is Observability and Manageability more important in the Cloud?

Because consumption of a fixed resource like CPU hrs is not optimal when you are on a “elastic”/dynamic infrastructure. You’d want to pay for exact utilization levels.

Because you want monitoring arrangements that work seamlessly, as they do on a single system, but on top of “elastic”/dynamic infrastructure.

There are many other reasons, of course: 

Being able to manage the lifecycle of applications in an automated manner, in a distributed computing environment is perhaps the biggest use case for Manageability in the Cloud.

Metering, Billing, Performance monitoring, Sizing and Capacity planning are some examples of activities in the Cloud computing model that leverage the same underlying Observability principles (Instrumentation, dynamic probes, support for aggregating metrics etc).

Lets look at a couple of examples of what Observability and Manageability capabilities can enable in the Cloud computing context:

  • Customers can pay at a more granular level of Throughput levels (Requests or Transactions processed per unit of time) for a given latency (and other SLA items), so your charge back model makes sense, in the context of your business activities.
  • Enable better  business opportunities, in a cost-effective manner. e.g. Mashery focuses metering/instrumentation at the API level, but the proposition in this context: Open up your API’s, meter it to provide you with an automated “business development filtering” mechanism….attractive if you’re in the right business, even on the “long tail” i.e customer traffic/volume is not high, but at least you have an ecosystem (100’s of developers or partners) to support the long tail, without burning your bank account. See my earlier post on RESTful business for more context here…these considerations are more valid in the Cloud computing paradigm.
  • Ensuring DoS style attacks don’t give you a heart attack (because your elastic cloud racked up a huge bill)

Observability and Manageability are key “infrastructure” capabilities in the Cloud computing model that enable features/value proposition such as the ones discussed above. These are not new ideas, but adaption of time tested ideas to a new computing paradigm (i.e predominantly distributed computing over adaptive/dynamic infrastructure, but other variations will crop up).