Plastic brain learning/un-learning

Archive for the ‘Cloud Computing’ Category

Puppet in the Clouds

In Cloud Computing, Manageability on March 25, 2009 at 9:52 pm



Java Observability, Manageability in the Cloud

In API, Manageability on January 30, 2009 at 10:14 pm

Have you done this fire-drill: you have a high traffic/volume web application, it is sluggish or unstable. Your team is called to figure out if the problem is in the application, middleware stack, operating system or somewhere else in the deployment configuration??

This is where Observability comes in: Being able to dynamically probe resource usage (across all levels of the infrastructure/application stack) at a granular level, control probing overheads and associating actions or triggers with those probes at all levels.At a system level,  DTrace is a good example, on Solaris & Mac OS X. There’s even a Java API for DTrace. Of course, there are other profiling libraries that offer low over-head, extremely fine-grained probes, aggregation capabilities (e.g. JETM).

Manageability is another important consideration: Being able to control and manage applications via standard management systems. This involves (hopefully) consistent instrumentation mechanisms, and standard isolation mechanisms between IT resources being managed, and external management systems. JMX is an example of one such standard. Other options such as JMX to SNMP bridge, MIBs compiled to MXBeans are some methods of integrating managed resources in to higher level (management) frameworks.

Flexible binding of computing infrastructure to workloads, is a key value proposition of the Cloud Computing model. Cloud computing providers like Amazon serve the needs of typical web workloads, by providing access to their dynamic infrastructure.

The problem is workload diversity. Vendors like Amazon, Google essentially offer Distributed computing workload platforms. Imagine tracing application performance issues on a distributed platform!

So, why is Observability and Manageability more important in the Cloud?

Because consumption of a fixed resource like CPU hrs is not optimal when you are on a “elastic”/dynamic infrastructure. You’d want to pay for exact utilization levels.

Because you want monitoring arrangements that work seamlessly, as they do on a single system, but on top of “elastic”/dynamic infrastructure.

There are many other reasons, of course: 

Being able to manage the lifecycle of applications in an automated manner, in a distributed computing environment is perhaps the biggest use case for Manageability in the Cloud.

Metering, Billing, Performance monitoring, Sizing and Capacity planning are some examples of activities in the Cloud computing model that leverage the same underlying Observability principles (Instrumentation, dynamic probes, support for aggregating metrics etc).

Lets look at a couple of examples of what Observability and Manageability capabilities can enable in the Cloud computing context:

  • Customers can pay at a more granular level of Throughput levels (Requests or Transactions processed per unit of time) for a given latency (and other SLA items), so your charge back model makes sense, in the context of your business activities.
  • Enable better  business opportunities, in a cost-effective manner. e.g. Mashery focuses metering/instrumentation at the API level, but the proposition in this context: Open up your API’s, meter it to provide you with an automated “business development filtering” mechanism….attractive if you’re in the right business, even on the “long tail” i.e customer traffic/volume is not high, but at least you have an ecosystem (100’s of developers or partners) to support the long tail, without burning your bank account. See my earlier post on RESTful business for more context here…these considerations are more valid in the Cloud computing paradigm.
  • Ensuring DoS style attacks don’t give you a heart attack (because your elastic cloud racked up a huge bill)

Observability and Manageability are key “infrastructure” capabilities in the Cloud computing model that enable features/value proposition such as the ones discussed above. These are not new ideas, but adaption of time tested ideas to a new computing paradigm (i.e predominantly distributed computing over adaptive/dynamic infrastructure, but other variations will crop up).

What color is your Cloud fabric ?

In Business models, Cloud Computing, Use Cases on January 15, 2009 at 9:30 pm

Cloud Fabric is a phrase that creates the notion of myriad complex capabilities, and unique features depending on the vendor you talk to. Here, I take the wabi-sabi approach of simplifying things down to essentials, and then try to make sense of all the different variations from vendors, with in a simple, high-level framework or mental model.

Before we go in to details, a caveat: Different types of Cloud fabric discussed in this framework here, is always provisional, in the sense that technological improvements (e.g. Intel already has a research project with 80 cores on a single chip) will influence this space in significant ways.

Back to the topic: Loosely defined, Cloud fabric is a term used to describe the abstraction for the Cloud platform support. It should cover all aspects of the  life-cycle and (automated) operational model for Cloud computing. While there are many different flavors of Cloud fabric from vendors, it is important to keep in mind the general context that drives the features in to the Cloud fabric: The “Cloud fabric” must support, as transparently as possible, the binding of an increasing variety of Workloads to (varying degrees of )Dynamic infrastructure. Of course, vendors may specialize in fabric support for specific types of Workloads, and with limited or no support for Dynamic infrastructure. By Dynamic infrastructure, I mean that it is not only “elastic”, but also “adaptive” to your deployment optimization needs.

That means your compute, storage and even networking capacity  is sufficiently “virtualized” and “abstracted” that elasticity and adaptiveness directly address application workload needs (as transparently as possible — for example is the PaaS or IaaS provider on converged data center networks or do they still have i/o and storage networks in parallel universes?). If VM’s running your application components can migrate, so should storage, and interconnection topologies (firewall, proxies, load balancer rules etc) , seamlessly – and from a customer perspective, “transparently”. 

Somewhere all of this virtualization meets the “un-virtualized” world! 

Whether you are a PaaS/IaaS, “Cloud-center” or “Virtual private Data center” provider etc, variations in the Cloud fabric capabilities stem from the degree of support across both these dimensions:

  • Breadth of Workload categories that can be supported. E.g. AWS and GoogleApp platform support is geared towards workloads that fit the Shared-nothing, distributed computing model based on Horizontal scalability using OSS, commodity servers and (mostly) generic switch network infrastructure. Keep in mind, Workload characterization is an active research area, so I won’t pretend to boil this ocean (because I am not an expert by any stretch of imagination), but just represent the broad areas that seem to be driving the Cloud business models.
  • How “Dynamic” is the Cloud infrastructure layer, in supporting the Workload categories above: Hardware capacity, configuration and general end-end topology of the deployment infrastructure matters, when it comes to performance. No single Cloud Fabric can accommodate the performance needs of any type of Workload, in a transparent manner. E.g. I wouldn’t try to dynamically provision vm’s for my Online gaming service on AWS unless I can also control routing tables and multicast traffic strategy (generally, in a distributed computing architecture, you want to push the processing of state information close to where your data is. So in the case of Online gaming use case, that could mean your multi-cast packets need to reach only a limited set of nodes that are in the same “online neighborhood” as the players). Imagine provisioning vm’s, storage dynamically for such an application across your data center without any thought on deployment optimization. Without an adaptive infrastructure under the covers, an application that has worked perfectly in the past might experience serious performance issues in the Cloud environment. Today, PaaS programming and deployment model determines the Workload match.


So, in general, Cloud fabric should support Transparent scaling, Metering/Billing,”Secure”, Reliable and Automated management, Fault tolerance, Deployment templates and good programming/deployment support, “adaptive” clustering etc, but specific Cloud fabric features and Use Case support depends on where your application is, at the intersection of the Workload and need for Dynamic/adaptive Infrastructure. Here’s what I mean…


Example of functional capabilities required of Cloud fabric based on Workload type and Infrastructure/deployment needs

Example of functional capabilities required of Cloud fabric based on Workload type and Infrastructure/deployment needs

Let me explain the above table: Y-axis represents a continuum of the deployment technologies & infrastructure used, to support “Task level parallelism” (typically, thread parallel computation where each core might be running different code) at the bottom to “Data parallelism” (typified by multiple cpu’s or cores running the same code against different data) at the top. X-axis broadly represents the 2 major Workload categories:

  1. High Performance workloads: where you want to speed up the processing of a single task on a parallel architecture (where you can exploit Data parallelism e.g. Rendering, Astronomical simulations etc)
  2. High Throughput workloads: where you want to maximize the number of independent tasks per unit time , on a distributed architecture (e.g. Web scale applications)

Where your application(s) fall in to this matrix determines what type of support you need out of your Cloud fabric. There’s room for lots of niche players, each exposing the advantages of their unique choice of the Deployment Infrastructure (how dynamic is it?), PaaS features that are friendly to your Workload.

The above diagram shows some examples. Most of what we call Cloud computing, falls in to the lower right quadrant, as many vendors are proving the viability of this business model for this type of Workload (market) . Features are still evolving.

Of course, Utility computing (top right) and “Main-frame” class, massive parallel computing (top left) capacity has always been available for hire for many, many years. What’s new here is how this can be accessed and used effectively over the internet (along with all that it entails: simple programming/deployment model, manageability, friendly Cloud fabric to help you with all of this in a variable cost model that is the promise of Cloud computing). Vendors will no doubt leverage their “Grid” management framework.

Personal HPC (bottom left) is another burgeoning area.

Many of these may not even be viable, in the market place..that will be determined by Market demand (ok, may be with some marketing too).

Hope this provides a good framework to think about the Cloud Fabric support you need, depending on where your applications fall in this continuum. I’m sure I might have missed other Workload categories, I’d be always interested in hearing your thoughts and insights, of course.

So, what color is your Cloud Fabric?

So, you want to migrate to the Cloud…

In Cloud Computing, Data center on January 1, 2009 at 2:15 am

As we head in to 2009, I can’t help thinking about how Cloud computing will affect ITSM as more companies think about leveraging the variable cost structure/dynamic infrastructure model it enables. While IT capabilities provide or enable competitive advantages (and hence require “agility” in terms of features and capabilities, systemic qualities such as Scalability), other business considerations such as IT service criticality (Is it Mission Critical?) also influence the migration strategy. Here’s one way to look at this map:


High level mapping strategy for Enterprise app Cloud migration

High level mapping strategy for Enterprise app Cloud migration

You can imagine mapping your application portfolio along these dimensions.

Many of these applications could be considered Mission Critical, depending on the business you’re in. You may even want some of these applications to be “Agile” in the sense that you want quick/constant changes or additional features or capabilities to stay ahead of your competition. The Evaluation grid above is a framework to help you quickly identify Cloud migration candidates, and priorities based on both business/economic viability as well as technical feasibility.

Step 1:

First lets talk about types of applications we deal with in the enterprise. From the perspective of the enterprise business owner, applications could fall under:

  • Business infrastructure applications (generic): Email, Calendar, Instant Messaging, Productivity apps (Spreadsheets etc), Wiki’s. These are applications that every company needs, to be in business and this category of applications might not necessarily provide any competitive advantage…. you just expect this to work (i.e available, and supports a frictionless information-bonded organization). 
  • Business platform applications: ERP, CRM, Content and Knowledge management etc. These are key IT capabilities on which core business processes are built. These capabilities need to evolve with the business process, as enterprises identify and target new market opportunities.
  • Business “vertical” applications, supporting specific business processes for your industry type. E.g. Online stores, Catalogs, Back-end services, Product life-cycle support services etc.

Once you identify applications falling in to the above categories, you can map those in to the Evaluation grid above, based on the business needs (“Agility”) versus current levels of Support (Mission Critical, Business Critical, Business Operational etc) as well as Architectural consideration (mainly Scalability).

Generally, applications falling under Quadrants 1,5 are your Tier 1 candidates for migration requiring further due diligence (RoI, they’re already scalable, so you have fewer engineering development challenges than business challenges & data privacy, IT control issues).

Applications falling in Quadrants 2, 6 require extensive preparation.

Applications falling in Quadrants 3,4,7,8 could be your “Tier 2” candidates, with those in Quadrants 3 & 4 taking higher precedence. Whether it is internal business process support or for external service delivery, primary drivers for Enterprises to consider Cloud computing are (among other things, of course):

  • Cost
  • Time to market considerations
  • Business Agility

Step 2:

Once you map your application portfolio, you can revisit your “Agile IT” strategy by re-considering current practices w.r.t :

  1. ITSM: While virtualization technologies introduce a layer of complexity, deploying to a Cloud computing platform on top, requires revisiting our current Data Center/IT practices:
    • Change & Configuration Management: You can’t necessarily look at this in an application specific way any more. 
    • Incident Management – we need automated response based on policies
    • Service Provisioning – requires proactive automation (this where “elastic” computing hits the road) as in dynamic provisioning, you need to have business rules that enable automation in this area in addition to billing, metering.
    • Network Management – That is “distributed” computing aware
    • Disaster Recovery, Backups – how would this work in a Cloud environment?
    • Security, System Management – do you understand how your current model would work in a dynamic infrastructure environment?
    • Capacity planning, Procurement, Commissioning and de-commissioning – Since it is easier to provision more instances of your IT “service” on to VM’s do you have solid business rules/SLA’s driving automated provisioning policies. Does your engineering team understand how to validate their architecture (via Performance qualification for the target Cloud environment)? Is there a seamless hand-off to the Operations team, so they can tune Capacity plans based on Performance qualification inputs from Engineering? 
  2. Revisiting Software engineering practices
    • Programming model – Make sure the Engineering organization is ready across the board, in terms of both skills and attitude, to make the transition. Adopt common REST api standards & patterns, leverage good practices (remember Yahoo Pipes?), even a “cloud simulator” if possible…
    • Application packaging & Installation – Make sure Engineering and Operations team agree and understand on common standards for application level packaging and installation standards regardless of the Cloud platform you adopt.
    • Deployment model – Ensure you have a standard “operating environment” down to the last detail (OS/patch levels, HW configurations for each of your architecture tiers, Middle ware versions, your internal infrastructure platform software versions etc)
    • Performance qualification, Infrastructure standards and templates – Focus on measuring Throughput (Requests processed per unit time) & Latency levels on your “deployment model” defined above. You need this process to ensure adequate service levels on the Cloud platform.
    • How do you maintain Performance/SLA’s & Application Monitoring: Ensure you have Manageability (e.g. JMX to SNMP bridge) and Observability (e.g. JMX to DTrace bridge or just plain JMX interfaces for providing application specific Observability) built in to all the layers of your application stack at each architecture tier.
    Happy migration! Let me know what your experiences are….

Path to greener Data Centers, Clouds and beyond….

In Cloud Computing, Data center, Energy efficient on December 7, 2008 at 7:49 am

Data center power consumption (w.r.t IT equipment like Servers, Storage) is a hot topic these days. The cumulative install base of Servers around the globe is estimated to be in the range of 50 million units, and in the grand scheme of things, might consume 1% of the world-wide energy consumption (according to estimates from Google, let me know if you need a reference). If you’re wondering, yes, world-wide install base of PC’s might consume several multiples of Server power consumption in Data center….why? Because world-wide PC install base is upwards of 600 million units!

Data center power consumption obviously gets more attention…because of the concentration of power consumption. 50 Mega Watts of power consumption at one Data center location is more conspicuous than the same amount of power consumed by PC’s in millions of households and businesses.

Trying to understand the trends towards more eco-friendly computing requires understanding of developments at many levels. Starting from the VLSI design innovations at the Chip or Processor level, Board level, Software level (Firmware, OS/Virtualization, Middleware, Application level), and finally at the Data center level.

Chip or Processor level: Processor chips are already designed to work at a lower frequency based on load, in addition to providing support for virtualization. In Multi-core chips, cores can be off-lined, depending on need (or problems). Chips are designed with multiple power domains….so CPU’s can draw less power based on utilization. The issue is with other parts of the computer system such as Memory, Disks etc. Can you ask the memory chips to offline pages and draw less power? Can you distribute data across Flash or Disks optimally to allow similar proportional power consumption based on utilization levels? These are certainly some of the dominant design issues that need to be addressed, keeping in mind constraints such as low-latency, little or no “wake-up” penalty.

Board level: Today, Server virtualization falls short of end-end virtualization. When machine resources are carved up, guest VM’s don’t necessarily carve up the hardware resources in proportion. Network level virtualization is just beginning to evolve. For e.g. Crossbow in OpenSolaris. Another example is  Intel’s VT technology: enables allocation of specific I/O resources (graphics card or network interface card) to guest VM instances. If Chips and Board level hardware elements are power (and virtualization) savvy, you can ensure power consumption that is (almost) proportional to utilization levels, dictated by the workload.

Firmware level: Hypervisors, whether Emulated or Para-virtualized, present a single interface to the hardware, and can exploit all the Chip-level or board-level support for “proportional energy use” against a given workload.

OS level: Over a sufficiently long time interval (months), server utilization is predominantly characterized by low utilization intervals. Average utilization of Servers in Data centers is usually less than 50%. That means there is plenty of opportunity for Servers to go in to “low-power” mode. How can you design the OS to co-operate here? 

System level: Manageability (e.g. responding to workload changes, migrating workload seamlessly etc), Observability (e.g DTrace ), API’s to manage Middleware or Application stack in response to low-power mode of operation (again, proportional power usage w.r.t workload) are going to be paramount considerations.

Cloud level: shouldn’t Clouds look like operating systems (seamless storage, networking, backups, replication, migration of apps/data, dependencies similar to pkg dependecies). 3Tera and Rightscale solve only some of these problems…but many areas need to be addressed: Dynamic, workload based Performance qualification, Mapping application criticality to Cloud Deployment Models, Leveraging Virtualization technologies seamlessly…

Data center level: Several innovations outside of IT (HVAC systems, again enabled by IT/sensor technologies), as well as innovations at all of the levels discussed above will help drive down PUE (Power Usage Effectiveness) at the Data center level closer to the holy grail (PUE = 1, i.e all the energy supplied to the Data Center goes to useful compute work done). Microsoft’s Generation4 effort represents a leap in this domain, as more and more companies realize that this is a big change of paradigm, as computing business truly goes in to utility scale/mode.

So, there are plenty of problems to be solved in the IT space….pick yours at any of these levels 🙂

Scalability, OSS, Cloud computing business models…

In Business models, Cloud Computing, Open Source Software, Scalability on November 17, 2008 at 5:37 pm

Just to get a perspective on how technology trends (cloud computing, open source software, need for web scale infrastructure) and network effects impact business dynamics, consider:

  1. Moore’s law (CPU power doubles every 18 months)
  2. Gilder’s law (Fiber bandwidth triples every 12 months)
  3. Storage capacity doubles every 12 months (is this called Storage law?)

While these propel improvements such as CMT, greater I/O throughput, greater storage capacity and performance at lower costs, they also enable other developments such as Virtualization etc resulting in lower cost, greater performance, better utilization levels.

Now consider some of the business dynamics based on network effects:
  1. Metcalfe’s law: Value of a network is the square of the number of users (think LinkedIn, your very own Ning network etc)
  2. Power law: Proverbial long-tail and the so called “Freemium” business model (i.e free software == zero cost marketing), you pay when you need extra services.

While few companies enjoy the “head” side of the power curve, things like AWS, Zembly speed up the deployment cycle all along the power curve, even the long tail…it is clear that helping Companies go up the power curve is a good business to be in, faster hardware just makes it faster!  Why? Because there is room for every type of community or customer base, however small it may be, and the marginal cost of delivering to those segments is becoming smaller every day, because of web scale frameworks and variable cost infrastructure made possible by these innovations.

Open Source software + Cloud compute model helps with Economies of Scope (i.e breadth of activities you need to handle in a profitable way) in an unprecedented way: you just don’t worry about IT needs in the traditional manner anymore.
Meanwhile HW innovation trends discussed above + Network effects + Cloud compute model can also help with Economies of Scale (e.g. pay as you go model with AWS) by letting you build on a business model with miniscule “conversion rates” off of a very large online customer base.