Assessing the Security Benefits of Cloud Computing

Is the glass half empty or half full?

With all this talk and reporting about security concerns, lets change the channel for a moment and assess the potential security benefits of Cloud Computing.

In my view, there are some strong technical security arguments in favour of Cloud Computing - assuming we can find ways to manage the risks.

With this new paradigm come challenges and opportunities. The challenges are getting plenty of attention - I’m regularly afforded the opportunity to comment on them, plus obviously I cover them on this blog. However, lets not lose sight of the potential upside.

In this post, I walk through seven technical security benefits. Some are immediate, others may arise over time and have conditions attached (some unstated for the sake of brevity). However, I’m including the longer-range benefits now to raise awareness. Some of the outcomes listed are available today without the Cloud, but they are either complex and slow to implement (and thus less likely to happen) or prohibitive for capital cost reasons. I don’t claim this is a definitive list - it reflects where my thinking is today.

Some benefits depend on the Cloud service used and therefore do not apply across the board. For example; I see no solid forensic benefits with SaaS. Also, for space reasons, I’m purposely not including the ‘flip side’ to these benefits, however if you read this blog regularly you should recognise some.

On a sidenote, I believe the Cloud offers Small and Medium Businesses major potential security benefits. Frequently SMBs struggle with limited or non-existent in-house INFOSEC resources and budgets. The caveat is that the Cloud market is still very new - security offerings are somewhat foggy - making selection tricky. Clearly, not all Cloud providers will offer the same security.

Seven Technical Security Benefits of the Cloud

1. Centralised Data

  • Reduced Data Leakage: this is the benefit I hear most from Cloud providers - and in my view they are right. How many laptops do we need to lose before we get this? How many backup tapes? The data “landmines” of today could be greatly reduced by the Cloud as thin client technology becomes prevalent. Small, temporary caches on handheld devices or Netbook computers pose less risk than transporting data buckets in the form of laptops. Ask the CISO of any large company if all laptops have company ‘mandated’ controls consistently applied; e.g. full disk encryption. You’ll see the answer by looking at the whites of their eyes. Despite best efforts around asset management and endpoint security we continue to see embarrassing and disturbing misses. And what about SMBs? How many use encryption for sensitive data, or even have a data classification policy in place?
  • Monitoring benefits: central storage is easier to control and monitor. The flipside is the nightmare scenario of comprehensive data theft. However, I would rather spend my time as a security professional figuring out smart ways to protect and monitor access to data stored in one place (with the benefit of situational advantage) than trying to figure out all the places where the company data resides across a myriad of thick clients! You can get the benefits of Thin Clients today but Cloud Storage provides a way to centralise the data faster and potentially cheaper. The logistical challenge today is getting Terabytes of data to the Cloud in the first place.

2. Incident Response / Forensics

  • Forensic readiness: with Infrastructure as a Service (IaaS) providers, I can build a dedicated forensic server in the same Cloud as my company and place it offline, ready for use when needed. I would only need pay for storage until an incident happens and I need to bring it online. I don’t need to call someone to bring it online or install some kind of remote boot software - I just click a button in the Cloud Providers web interface. If I have multiple incident responders, I can give them a copy of the VM so we can distribute the forensic workload based on the job at hand or as new sources of evidence arise and need analysis. To fully realise this benefit, commercial forensic software vendors would need to move away from archaic, physical dongle based licensing schemes to a network licensing model.
  • Decrease evidence acquisition time: if a server in the Cloud gets compromised (i.e. broken into), I can now clone that server at the click of a mouse and make the cloned disks instantly available to my Cloud Forensics server. I didn’t need to “find” storage or have it “ready, waiting and unused” - its just there.
  • Eliminate or reduce service downtime: Note that in the above scenario I didn’t have to go tell the COO that the system needs to be taken offline for hours whilst I dig around in the RAID Array hoping that my physical acqusition toolkit is compatible (and that the version of RAID firmware isn’t supported by my forensic software). Abstracting the hardware removes a barrier to even doing forensics in some situations.
  • Decrease evidence transfer time: In the same Cloud, bit fot bit copies are super fast - made faster by that replicated, distributed filesystem my Cloud provider engineered for me. From a network traffic perspective, it may even be free to make the copy in the same Cloud. Without the Cloud, I would have to a lot of time consuming and expensive provisioning of physical devices. I only pay for the storage as long as I need the evidence.
  • Eliminate forensic image verification time: Some Cloud Storage implementations expose a cryptographic checksum or hash. For example, Amazon S3 generates an MD5 hash automagically when you store an object. In theory you no longer need to generate time-consuming MD5 checksums using external tools - its already there.
  • Decrease time to access protected documents: Immense CPU power opens some doors. Did the suspect password protect a document that is relevant to the investigation? You can now test a wider range of candidate passwords in less time to speed investigations.

3. Password assurance testing (aka cracking)

  • Decrease password cracking time: if your organisation regularly tests password strength by running password crackers you can use Cloud Compute to decrease crack time and you only pay for what you use. Ironically, your cracking costs go up as people choose better passwords ;-).
  • Keep cracking activities to dedicated machines: if today you use a distributed password cracker to spread the load across non-production machines, you can now put those agents in dedicated Compute instances - and thus stop mixing sensitive credentials with other workloads.

4. Logging

  • “Unlimited”, pay per drink storage: logging is often an afterthought, consequently insufficient disk space is allocated and logging is either non-existant or minimal. Cloud Storage changes all this - no more ‘guessing’ how much storage you need for standard logs.
  • Improve log indexing and search: with your logs in the Cloud you can leverage Cloud Compute to index those logs in real-time and get the benefit of instant search results. What is different here? The Compute instances can be plumbed in and scale as needed based on the logging load - meaning a true real-time view.
  • Getting compliant with Extended logging: most modern operating systems offer extended logging in the form of a C2 audit trail. This is rarely enabled for fear of performance degradation and log size. Now you can ‘opt-in’ easily - if you are willing to pay for the enhanced logging, you can do so. Granular logging makes compliance and investigations easier.

5. Improve the state of security software (performance)

  • Drive vendors to create more efficient security software: Billable CPU cycles get noticed. More attention will be paid to inefficient processes; e.g. poorly tuned security agents. Process accounting will make a comeback as customers target ‘expensive’ processes. Security vendors that understand how to squeeze the most performance from their software will win.

6. Secure builds

  • Pre-hardened, change control builds: this is primarily a benefit of virtualization based Cloud Computing. Now you get a chance to start ’secure’ (by your own definition) - you create your Gold Image VM and clone away. There are ways to do this today with bare-metal OS installs but frequently these require additional 3rd party tools, are time consuming to clone or add yet another agent to each endpoint.
  • Reduce exposure through patching offline: Gold images can be kept up securely kept up to date. Offline VMs can be conveniently patched “off” the network.
  • Easier to test impact of security changes: this is a big one. Spin up a copy of your production environment, implement a security change and test the impact at low cost, with minimal startup time. This is a big deal and removes a major barrier to ‘doing’ security in production environments.

7. Security Testing

  • Reduce cost of testing security: a SaaS provider only passes on a portion of their security testing costs. By sharing the same application as a service, you don’t foot the expensive security code review and/or penetration test. Even with Platform as a Service (PaaS) where your developers get to write code, there are potential cost economies of scale (particularly around use of code scanning tools that sweep source code for security weaknesses).

Your Thoughts?

What benefits do you see that I haven’t included in the above list? Where do you agree/disagree and importantly, why?

If you are curious about Cloud Computing and security, don’t miss out on future posts: subscribe by RSS or subscribe by email.

Collaboration in the Cloud, Virtual Worlds and the Hacker Mindset

Collaboration in the Cloud

Forward thinking companies use collaboration technologies to melt away the physical distance between disparate offices, remote workers and suppliers.  Investments in R&D projects to create the next generation of business collaboration technologies and starting to bear early fruits and are worth paying attention to - especially if you get paid to “do security”.  One major focus area is Virtual Worlds.

Teleporting Virgins

The big news in the Second Life research community is that avatars (”virtual people”) have successfully teleported between distinct virtual worlds.  The virgin teleporters went from a Second Life Preview Grid - an experimental grid completely disconnected from the Main Grid - to a virtual world running IBM OpenSIM.

At this stage there is intentionally no asset transfer going on at all - in other words, you can’t take your “stuff” from one world to another - but that will come in time as the Open Grid Protocol is extended.  Today just login and teleport are supported.  No stealing those trade secret “assets” yet ;-).

Linden Labs speaks to this issue:

Q: How will Linden Lab prevent property from being copied into other virtual worlds?
We’re paying extremely close attention to that question. We will be designing this with the Second Life community to ensure their needs are met. We want to stress that when it does become possible to move avatars between worlds, we will take the utmost care to protect the rights of Second Life property owners and creators. Linden Lab will not design a system that lets people openly violate the permissions of SL goods and take them to other worlds. We recognize that intellectual property is the engine that drives Second Life, and we are completely committed to preserving the qualities that make Second Life the unique, innovative and dynamic place that it is today.

With my “hacker-vision” ™ enabled I see *all kinds* of opportunities for mischief here.  I’m betting we’ll see imaginative attacks as the usual cat and mouse game of vulnerability research and vendor response plays out.  “Sorry boss, someone hijacked my avatar and now I’m stuck on this desert island for who knows how long!”.

Threat Profiling Second Life

Getting back to reality, people are already exploring Virtual World security.  Michael Thumann of ERNW in Germany is a pen-tester and security researcher and in this 10 minute video, Michael shares the result of his security research on Second Life.

He covers:

  • In-game cheating
  • Identity theft
  • Attacking 3rd party servers using Linden Scripting Language (think about the liability issues and the providers ability to track abusers)

For those interested in more detail, the full presentation he gave at BlackHat Europe 2008 in Amsterdam is here (pdf).

Of particular note, Michael applied a formal threat model approach to the research - STRIDE from Microsoft.

In a future post I’ll talk more about threat profiling in the context of Cloud Computing vulnerability research and specific API security vulnerability classes we can expect to see exploited.

Is Your Amazon Machine Image Vulnerable to SSH Spoofing Attacks?

SSH - Clones may bites!On the 23rd June, Amazon quietly rolled out a security fix for an issue originally discussed in the Amazon developer forums. Amazon documentation was revised to reflect the change as follows:

“Amazon EC2 public AMIs (Amazon Machine Image) generate unique SSH (Secure Shell) host keys each time you launch an instance. This enables you to get the host SSH keys from the console output and verify the host to which you are connecting.”

Important note: SSH host keys enable clients to verify the server identity (”are you really my server?”) and are separate from SSH user keys that allow the user to prove their identity to the server (”he really is Jeff”).

What does this mean?

It means that EC2 instances created from a public AMI after June 23rd have unique SSH host keys and thus are not vulnerable to a man in the middle attack against the SSH protocol, but only *if* you manually verify the host SSH key during your initial SSH connection.

OK, but I created my AMI before June 23rd - am I vulnerable?

According to Amazon, yes.  Every EC2 instance copied from a public AMI will have the same SSH host keys as the original AMI.  The only exception to this is if the original AMI creator spotted this problem and used a hook to force SSH host key regeneration upon first boot. This means that an attacker who say, uses a DNS cache poisoning attack, can intercept the communication between your SSH client and your AMI.

How can I fix my pre-June 23rd AMIs?

Regenerate the SSH host key.  The exact commands will depend on your operating system (hint: ssh-keygen).

Who is to blame?

Either the creators of the original AMI or Amazon - depends how you look at it.  If Amazon created the public AMI then it could be argued they are responsible.  However, anyone can submit a public AMI and Amazon makes no guarantee they are fit for use (Amazon do review the AMI listing according to their documentation).

Amazon can in fact make the argument they are acting in the interests of their users by implementing a shared solution to key regeneration (rather than requiring each user to manually regenerate the ssh host keys after booting an image).   That’s fine going forward but what of potential exposure to customers using the pre-June 23rd public AMI copies?

Just to be clear, its not the fault of SSH - ’secure channels’ require proper key management and the need for unique host keys is well documented.

Are there any mitigating factors?

Yes, if you have used security groups to limit SSH access to your AMI from IP ranges you trust (rather than the entire Internet).  You’ll still want to regenerate the ssh host keys sooner than later.

Is the Amazon environment vulnerable to Man-in-the-middle attacks?

I don’t know.  But that isn’t the real question - is the path between you and your AMI immune to MITM attacks and the answer is most definitely no.  If SSH on your AMI is only accessible from another AMI then its a fair question but its unlikely Amazon are going to show you their network diagrams ;-).  From experience performing MITM attacks, I would assume most networks are vulnerable (one of the reasons why we use SSH).

Why Didn’t Amazon Tell Me I’m Vulnerable?  They know from their logs what AMIs I use!

Didn’t they?  Whoops - naughty Amazon :P.

But seriously, Amazon are not responsible for the configuration of the public AMIs you use.  Its important not to confuse the AMI selection and cloning mechanism that Amazon provides, with the content of an AMI itself.

Does Amazon have a mailing list for customers to learn about new security problems (even if its not Amazon’s fault).

Not that I know of.   Right now you have to search forum posts and monitor documentation updates - which is time consuming and makes it easy to miss something.  I also can’t find an area on the AWS website where they collect security related items together (e.g. best practices, advisories, key management).   In my view, this is a shame as it probably undermines the effort that Amazon are putting into their security  (for some customers, if they don’t “see it”, it doesn’t “exist”).

A ‘Security’ link on the main AWS homepage pointing to those resources would go a long way to improving the visibility of the AWS security related information.

Interview on IMI Tech Talk / KFNX: Cloud Computing and Security

KFNX Radio LogoIMI Techtalk

A quick post to say a very warm welcome to IMI Tech Talk / KFNX listeners!

I was recently approached to take part in an interview about Cloud Computing and Security on IMI Tech Talk, broadcast on KFNX News Talk Radio.  KFNX is a US based radio station based out of Phoenix, Arizona.  More in-depth than the previous opportunity, a range of Cloud Computing technologies were discussed in the 30 minute segment:

  • Who am I?
  • What is cloud computing? (*that* question!).
  • Introduction to virtualization.
  • Examples of cloud computing services that exist today.
  • Barriers to entry.
  • Security issues of processing or storing data in the cloud
  • cloudsecurity.org
I will update this post when the audio archive of the show is posted.

I did mention I would provide links to useful Cloud Computing resources (as my mind went totally blank during the interview!) - watch for a post next week covering the blogs I read regularly.

Cloudsecurity.org was born as I couldn’t find any dedicated web resource discussing Cloud Computing and Security.  If there are subjects you want to see covered, feel free to leave a suggestion in the Skribit sidebar to the right.

I do welcome comments in response to blog posts on the blog itself - don’t be shy :-).

For private communications I can be reached at craig.balding@gmail.com.

My thanks to the IMI Tech Talk team, particularly Tom and Eric.

Enjoy the blog,

Craig

Cloudsecurity.org Interviews Guido van Rossum: Google App Engine, Python and Security

Guido Homepage

In this interview, cloudsecurity.org talks to Guido van Rossum about Python, Google App Engine and security.

Guido is the creator of the Python programming language and more recently, Google App Engine team member.  His involvement with the App Engine project was pretty late - the code “was almost ready for release” when he get involved.  The security architect of App Engine was primarily project lead, Kevin Gibbs, supported by the rest of the App Engine crew and the Google Security Team.

The Interview

cloudsecurity.org: What security principles did you follow for App Engine?

GvR: While I can’t share any specifics on what we’re doing to secure App Engine, I can say that the main principle we’ve followed could be called “defense in depth”. We’re not relying exclusively on a secure interpreter, or any other single security layer, to protect our users.

cloudsecurity.org: Please provide some examples of how those principles played out in terms of the current implementation?

GvR: Sorry, we don’t divulge such information.

cloudsecurity.org: What criteria did you apply to Python module selection?

GvR: We first looked for modules that were useful and straightforward to audit. If a module was large or complex, we’d only audit it (fixing things we found) if it was deemed essential or at least useful for a large number of users; otherwise we’d exclude it.

cloudsecurity.org: What do you see as the security risks inherent in exposing an interpreter runtime in a shared environment?

GvR: I presume you’re asking about risks to users, like providing accidental access to data belonging to another app. We’ve taken extensive measures to isolate different apps from each other. For example, each app runs in a separate process, and the datastore prevents an app from accessing data belonging to other apps.

cloudsecurity.org: I recently attended a fascinating talk by Justin Ferguson (a Seattle based security consultant) at eusecwest in London.  He gave a great talk exploring security vulnerabilities in language interpreters and specifically highlighted some security weaknesses in Python App Engine.  What are your thoughts on his research and specifically the Python issues he highlighted?  When do you anticipate they will get fixed?

GvR: We’ve anticipated all of the possibilities raised in Justin’s talk, and took measures to protect our users. Justin highlighted weaknesses in Python, but not in App Engine. Furthermore, our security model does not rely solely upon protections within the Python interpreter; there are additional protections that these external analyses have missed.

cloudsecurity.org: How do you contain an attacker that exploits bugs in App Engine from exploiting the underlying OS and potentially interfering with other users processes or attacking backend systems?

GvR: You are correct that there are strong measures in place, but I’m not at liberty to discuss details.

cloudsecurity.org: Python was the first language to get the App Engine treatment, what language is next and what are some of the language specific security challenges the team has had to deal with?

GvR: Although I can’t comment on what language is next, we are working on this, and have gotten a lot of great feedback from our developers. As far as language-specific security challenges, they stemmed mostly from the complexity of the Python interpreter. We spent a lot of time auditing this, and did a great deal more than just identifying buffer overflows.  I can also add that Google is actively researching the security of interpreted languages.  Google engineers routinely contribute security fixes to open source projects, including but not limited to Python.

cloudsecurity.org: How does the team decide when ‘enough is enough’ in terms of hardening the interpreter?

GvR: That’s not really how we approach it. We realize that security is an ongoing effort, and try to stay ahead of threats through continuous monitoring and testing.

cloudsecurity.org: Some commentators have suggested that perhaps the difficulty of auditing the implementation led to some modules being more heavily restricted than perhaps necessary.  What are your thoughts on that and what plans, if any, are there to bring back code objects/functions that were eliminated in the initial release?  (with the benefit of hindsight).

GvR: The only thing we are likely to put back is the _ast module, which was not audited based upon an underestimation of its usefulness (see my answer to question #3 above).  We will also put back some dummy functions and other objects whose absence currently prevents some popular frameworks from being loaded without modifications. For example, some harmless functionality in the imp module will come back. We’re also looking into making urllib2 work (to some extent), though that’s not really a security issue but merely a matter of API adjustment.

cloudsecurity.org: It is reported that Google encourages small groups to go off and create.  How involved were the Google security team with App Engine in terms of design and implementation review/testing?  Given the dynamics, is it possible to have a meaningful security process that shadows the development process?

GvR: The Google Security team is involved in everything we do. They have been extremely helpful.

cloudsecurity.org: How can people report security weaknesses they discover in App Engine?  What commitment does Google give in terms of dealing vulnerability reports?

GvR: There is a standard process for submitting security issues. See http://www.google.com/corporate/security.html. Google moves very fast to protect its users when a verifiable security vulnerability is reported.

cloudsecurity.org: One concern is the potential misuse of App Engine to exploit security vulnerabilities in visitors browsers.  This is not a new problem per se, shared hosting providers know all about this.  But with Google and other Cloud providers, the scalability potential is much higher.  What are your thoughts on this and what pro-active steps is Google taking to detect and terminate evil apps?

GvR: This is high on our list of concerns. We deal with this through a combination of restrictions on what you can do (e.g. certain HTTP headers and ports are off-limits) and, again, monitoring.

cloudsecurity.org: Beyond App Engine, what role do you think Python will play in the Cloud both now and in the future?

GvR: Sorry, I’m not prone to philosophizing about the future.

cloudsecurity.org: Trust is often cited as a barrier to enterprise adoption of Cloud Computing.  What role do you personally think Google can play in building that trust?

GvR: I think trust is built up over a long period of experience. Our actions in terms of being open to our users will be the most important factor in establishing trust. Of course, Google’s reputation also helps: everybody understands that Google doesn’t want its name associated with a bad product.

cloudsecurity.org: Looking at the Cloud Computing landscape beyond Google, what are your thoughts on the current state of Cloud Computing and Security?

GvR: It’s obvious that Cloud Computing is only just taking off. The next few years will be very exciting.

cloudsecurity.org: Lastly, what are some of your favourite App Engine apps?

GvR: There are too many to enumerate. If you insist on a highlight, well, I like Rietveld (http://codereview.appspot.com), a tool for collaborative code review which I (largely) wrote myself. It is open source and includes some essential components from Mondrian, a similar internal tool which I created before I joined the App Engine team.

Thanks

My thanks to Guido for his time and sharing his views.

A Question of Integrity: To MD5 or Not to MD5

Cloud Storage offers pay per drink off-site storage. Data to be saved is shuffled from the customer to the Cloud Storage Provider by the network. This all works wonderfully most of the time, what you upload is what you get back later. But what happens where the gremlins strike and what you send is not what is received?

This happened recently to some Amazon S3 customers. There were complaints in the AWS forums about ‘S3 Corruption’. The first post in the forum was recorded at Jun 22, 2008 5:05 PM PDT (although in subsequent posts some people reported emailing Amazon prior to this):

we are having some serious S3 issues.

all data we store on S3 has gone through the same code path for months. starting a couple days ago a small percentage of the objects we are retrieving are not checksumming to the correct values. we hash and store objects by checksum and rehash the objects when we retrieve to ensure there is no data corruption. all the objects we’re having issues with were uploaded at approximately the same time period a few days ago.

we’ve stored 10’s of millions of objects in S3 and never encountered such problems. please let me know ASAP if you have any idea what could be going on here. thanks.

Amazon responded 6 minutes later (!) and started investigating. To troubleshoot they asked customers to email aws@amazon.com with the ‘Bucket-Name and few keys that you believe are having issues’.

Others weighed in reporting similar problems. Amazon provided status updates and on Monday Jun 23rd at 6:10pm PDT, provided the following explanation:

We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3’s total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream. When the requests reached Amazon S3, if the Content-MD5 header was specified, Amazon S3 returned an error indicating the object did not match the MD5 supplied. When no MD5 is specified, we are unable to determine if transmission errors occurred, and Amazon S3 must assume that the object has been correctly transmitted. Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame.

What are some of the takeaways?

  • If you are directly using the AWS S3 API, make sure to calculate and send MD5 checksums along with actual data. Check status return codes - an HTTP 400 error code means ’something went wrong’ - respond appropriately.
  • If you are relying on 3rd party tools to access S3, be sure to check with your software vendor that they are following the advice from Amazon to use MD5. If they are not then your data can get silently corrupted…
  • Downloads, aka HTTP GETs, can also be affected. The thread in the forum continues and questions are asked as to whether the corruption caused by the loadbalancer was affecting both incoming and outgoing traffic. The conclusion was yes. If you are hosting media on S3, and the browser is using partial GET requests (to download in chunks) then the corruption will not be automatically detectable.
  • If your business relies on Cloud Storage, are you prepared to wait a 36 hours for a resolution? This isn’t a swipe at Amazon, this is true for any provider. Check your SLA’s, check the trouble ticket resolution times, ask about availability of experts for troubleshooting etc.
  • Cloud Providers will increasingly need to instrument their services such that they can ‘early detect’ negative operational events. In this case, Amazon has stated plans to use better logging and analysis to automate detection of unusual error patterns (i.e. anomoly detection).
  • This incident - caused by an Amazon malfunctioning loadbalancer - did not make it onto the AWS status page at http://status.aws.amazon.com/. Taking Amazon at face value, this incident only affected a small number of transfers, relative to the total number of S3 transfers. But this begs the question, what level of outage or service problem needs to happen before Amazon will flag the issue on their status page? On a sidenote, based on the timestamps, 31 hours passed between the loadbalancer being taken out of service and Amazon providing the explanation on the forum.
  • When Amazon update their S3 API documentation, it would be useful to have entries in the S3 API index for ‘checksum’, ‘MD5′, ‘integrity’ and ‘corruption’.
  • Stepping back, will customers hold Cloud Service Providers to a higher standard than their own internal IT teams?

I’m sure there are more takeaways I didn’t cover. What say you?

###

Kudos for the heads-up on the S3 issue goes to my friend and colleague Jason Harper - network supremo and crypto-head. Thanks Jason!