Вы находитесь на странице: 1из 15

Insight from DevOps Thought Leaders

12th APRIL 2013

Curated by Benjamin Wootton www.devopsfriday.com

12 April 2013
th

Welcome to DevOps Friday!


Each week, we summarise the best content of the week coming out of the DevOps community. The newsletter is curated by DevOps enthusiast, Benjamin Wootton. I welcome any feedback or suggestions for content for the newsletter via Twitter @benjaminwootton. To sign up for future editions, please visit the DevOps Friday home page.

DevOps Friday is proudly supported by ServerDensity. Server Density is a server and website monitoring tool. Were supporting DevOps Friday in its quest to continually publish bite size insights and news into the DevOps world. If you like this, please visit ServerDensity and the ServerDensity blog.

contents 5
application support is perfect for devops
Matt Watson
Finally, organizations can embrace a DevOps approach that improves application support even if they dont have a formal DevOps team. Stackify is the only solution that provides the proper access, tools and intelligence to improve application support efficiency.

6
growing an ops team from one founder
David Mytton
In the early days of 2009, it was just me running the Server Density monitoring infrastructure. Over the last 4 years the service has grown in terms of team members, data volume, customers and infrastructure so here are a few lessons from scaling the ops team and how things are run.

8
Matthew Skelton On software operability
Matthew Skelton
Operability is an engineering term concerning the qualities of a system which make it work well over its lifetime, and software operability applies these core engineering principles to software systems.

10
why devops matterS (to developers)
Benjamin Wootton
DevOps stems from the idea that developers and operations should work more closely together to increase the quality of the systems that we build and operate, but most of the enthusiasm and thought leadership appears to come from the Operations side of the fence.

12
the state of the art monitoring stack
Sandy Walsh
For many of us starting in this area, our concept of monitors consists of top, some apache, mysql and application log files. We were scared off of monitoring by these old monolithic products that required huge licensing fees and armies of professional services people. Thankfully, times have changed.
3

14
the benchmark youre reading is probably wrong
RethinkDB Team
Unfortunately, most benchmarks published online have crucial flaws in the methodology, and since many people make decisions based on this information, software vendors are forced to modify the default configuration of their products to look good on these benchmarks.

this week in DevOPS


relaunching devops friday THIs week in devops

Around a year ago, I started a small newsletter summarising the best DevOps related links of the week. Since that time, interest has continued to grow in DevOps. Fantastic blog posts and articles are coming out each week, advancing the state of the art. Conferences are generating massive amounts of interesting content and discussion. Discussion on Twitter and on the podcasts is entertaining and educational. Considering this, I want to take DevOps Friday to the next level, using it as a hub to capture and communicate the best content of each week. If you like this issue, please consider sharing with friends and colleagues and encouraging them to sign up at the DevOps Friday home page.

Keith and Marios Guide to Fast Websites MongoDB Large Scale Data Centric Applications Hiring for the DevOps Toolchain: The Need for Generalists How Badly Set Goals Create a Tug-of-War in Your DevOps Organization Treating Servers as Cattle, Not as Pets Achieving Awesomeness with Opscode Chef (Part 2) Making A Point With SLAs Amazon Cloud A River Runs Through It Using Message Queues in Cloud Applications Are You Unknowingly Replicating Your Failure as a DBA?

application support
Matt Watson @stackify

is perfect for devops


Matt Watson founded Stackify in 2012 and as CEO provides the vision and leadership for the direction of the company. Matts goal is to simplify IT operations via Stackifys DevOps solution. Prior to founding Stackify, he was the founder and CTO of VinSolutions. Matt is an entrepreneur at heart and excels and product and software development.

hen DevOps emerged in 2009, the gap between development and operations teams finally started to get the kind of media and vendor attention it deserved. DevOps gets developers more involved in IT operations so they can more rapidly resolve software issues that arise after deployment. Without access to production applications and servers, even development managers and system admins need help identifying and solving problems, which is horribly inefficient. Some of us have been doing DevOps even before it had a name. At my last company, the lead developers were heavily involved in hardware purchases, setting hardware up, deploying code, monitoring systems and much more. The problem was that only three of the 40 developers had production access. The chosen three (including me) spent an inordinate amount of time helping others troubleshoot and fix application bugs. While I didnt trust the junior developers with the keys to the kingdom I nevertheless would have preferred them to have the ability to fix their own bugs. Because our application support processes werent very efficient, I wasted a lot of my own time fixing bugs instead of building new features. Later, I started Stackify because I believe that more developers should be involved in production application support. That way, a couple of employees like the three of us at my old job dont become a bottleneck. Meanwhile, junior developers, QA and even less technical support people can get server access to view log files and other basic troubleshooting information. Sadly in most companies today, the lead developer or system admin ends up

tracking down a log file or finding some minor bug in another developers app when they should be working on more important projects. Developers should be more involved in the design and support of the infrastructure our applications rely on since we are ultimately responsible for the applications we create. We should be able to deploy our applications, monitor production systems, ensure everything is working properly and be held responsible when our applications fail in production. Finding and fixing bugs is often more difficult than it sounds, however. Just think for a moment. What do your developers need access to? If your team is anything like mine was they need: A database of application exceptions Application and server log files Windows Event Viewer Application and server config files SQL databases to test queries Scheduled jobs history Server monitoring tools Performance monitoring tools and the list goes on.

As nice as it sounds, giving developers access to the information they need has been more difficult than it sounds because: The data resides in many locations Too many tools exist to access different types of information It can be difficult or impossible to control access rights and protect sensitive data Developers should be prevented from making changes It is difficult or impossible to audit what developers access

To overcome the challenges outlined in this post, I and my team at Stackify built a solution that gives developers access to all the information they need to provide effective application support. It also solves the problems that have prevented such information sharing in the past. With Stackify, you can eliminate bottlenecks in development teams and scale application support teams without additional head count. Finally, organizations can embrace a DevOps approach that improves application support even if they dont have a formal DevOps team. Stackify is the only solution that provides the proper access, tools and intelligence to improve application support efficiency.

When a developer is trying to fix a bug, nothing is more frustrating than lacking the details necessary to reproduce or fix the problem. Troubleshooting application problems can require access to a lot of information which in turn involves a lot of screens and a lot of logins. Imagine getting all the information you need in a single screen and then having the ability to drill down into it with a couple of mouse clicks.

GROWING an ops team


from one founder
David Mytton @serverdensity
David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan.

n the early days of 2009, it was just me running the Server Density monitoring infrastructure. The service came out of beta in the summer and immediately had a few paying customers which helped to fund the rental of a couple of slices from Slicehost (fancy VPSs). The volume of traffic, simplicity of the service components and small number of servers meant that there were few problems. Over the last 4 years the service has grown in terms of team members, data volume, customers and infrastructure so here are a few lessons from scaling the ops team and how things are run.

It also means trying to find the quickest way to do things


Time is something you dont have much of and one of the slowest things is transferring large quantities of data over the internet. We had an unexpected failure where we had to do a full resync of a MongoDB slave in a different data centre, which wouldve taken 6 days. Instead, we copied the data onto a USB disk drive and had UPS ship it to the other facility. Network transfer speeds worked out at around 5MB/s whereas UPS delivered at 11MB/s.

Consider support contracts


In addition to general sysadmin support from your hosting provider, you can buy commercial support contracts for the software products youre using. This could be Ubuntu Linux, Nginx or MongoDB. Depending on the level of support you can get some pretty involved help when you need it most. However, theyre often very expensive and unaffordable as a startup. Even with the greater resources we now have, support contracts are aimed at enterprises with big budgets. One way to workaround this is to be very involved with the projects you use. I was an early adopter of MongoDB and have a close relationship with 10gen, the company behind it, so am able to get good deals on support. Also consider what support you really need. Our support contract with MongoDB was well used in the early days because it was a new technology. Its significantly more stable nowadays and other products, like Apache for example, weve never had an issue with.

Let other people help


You really need at least one other person to be able to take on-call duties when youre away but if thats not possible or as a backup, you could make use of services provided by your hosting company or a third party. We quickly moved from Slicehost to managed servers at Rackspace and they were able to do monitoring and respond to issues like servers down or services not running. They took special instructions for different scenarios and you could always phone them and ask them to perform certain actions. I remember several instances where I was away from my computer and was able to phone Rackspace support, asking them to perform some basic recovery actions whilst I got back online.

Bootstrapping often means leaving things to last minute


Ideally youll anticipate problems and have a solution well in advance, but thats not always possible. The most likely reason in the early days is cash; or lack of it. In August of 2009 Id just completed our migration from MySQL to MongoDB and it still had problems with eagerly eating up disk space. This prompted setting up a new server with increased disk space because resizing a Slicehost instance wouldve meant some hours of downtime. It went down to the very last few bytes of remaining disk space as the sync completed.

Figure out what you have to do and what can be outsourced


I consider keeping core engineering inhouse very important for technology/ software companies but there are lots of

things that need doing to run operations that could be outsourced to (trusted) individuals on an ad-hoc basis. Engineers are terrible at valuing their own time and often use the argument: why pay for something I could build/install/configure myself?. Candidates for this are things like running through PCI compliance checklists, setting up centralised logging, reorganising servers (e.g upgrading base OSs), researching CDN providers, integrating CI tools, etc. You always want someone technical managing the project to keep things on track and validate the end results, but these are things you dont need to do yourself.

You have to consider who will take the call when things break: How quickly can they get to a computer they can use to fix things? Are you out drinking on a Friday? What happens if someone falls ill? This could be a minor cold or major emergency. This could be the individual engineer or their family members. Does the on-call have enough phone battery? Can they hear their ringtone? Who is backup if the primary doesnt pick up? This is especially the case with outages. They often happen at inconvenient times and big incidents might require you to work for significant periods of time.

multiple people) who are responsible for the day to day operations. Engineers still engage with the team, can deploy, work on testing and debug problems but things like dealing with a failed disk drive or implementing backups is really outside the remit of devops in a large team.

You know youre there when you can start hiring site reliability engineers!

Hack traveling
As part of the founding team and even as an engineer youre likely to have to travel at some point to conferences, meeting customers, pitching vendors or maybe on holiday! Its relaxing to be uncontactable on the plane but its also scary because you have no idea if everything is still running. On one of my trips to Japan, as soon as I stepped off the 12 hour flight to Tokyo Narita, I had a flood of SMS alerts as one of our MongoDB servers had encountered a problem 4 hours previously. One of our engineers had been assigned on call for my flight and had already worked with the guys at 10gen and resolved the problem. Youll realise you become a slave to connectivity so trips to Japan are fine, but Tajikistan isnt really an option. So you need to be able to get internet access and power anywhere you are tricks such as visiting Starbucks, carrying external hotspots and not running things like updates when youre away!

Dealing with communicating with customers, fixing problems and recovering data can be exhausting especially when theres nobody else to help. The ultimate goal is to build your team so that shift based on-call cover can be provided but its difficult in the beginning with limited resources (both for people and multigeographic redundancy). Nobody is an invested in your service as you and your team Although services like Rackspaces support are helpful in certain situations, theyre never able to know the full story behind your service and how to deal with complex components. For example, MongoDB was a completely new database and didnt have single server durability for some time a bad shutdown could require a lengthy database repair, which was important to take steps to avoid such as by properly shutting it down before powering off the server. Knowing about the weaknesses and how to deal with them is something that requires greater knowledge of your setup that basic vendor support isnt going to provide. These things should be a stopgap or supplement the end goal of growing your own team. The whole point of devops is that its a mixture of engineering and operations so you dont need to hire dedicated sysadmins. This works well for small startup teams but you will eventually want someone (or

Dont forget the human aspect


There are a lot of cool tools which help to automate processes, and these should be used as much as possible. However, its still real people running things in the end. This is the really difficult bit of having a small team because everyone has to pitch in and it can be difficult to share the workload when just a few people know how things work.

Matthew Skelton
Matthew Skelton @matthewskelton

On Software Operability
Matthew has been building, deploying and operating commercial software systems for over 13 years. He has engineered software systems for organisations in finance, insurance, pharmaceuticals, travel and media, as well as for MRI brain scanners and oil and gas exploration. He looks after build and deployment at thetrainline.com, the UKs leading rail ticket vendor which operates one of the countrys busiest web infrastructures.

What is software operability and why is it important?


Operability is an engineering term concerning the qualities of a system which make it work well over its lifetime, and software operability applies these core engineering principles to software systems. An operable software system is one which delivers not only reliable enduser functionality, but also works well from the perspective of the operations team. Such software has been built to operate successfully without needing application restarts, server reboots, loadbalancer hacks, or any of the countless other fixes and work-arounds which operations teams have to use in order to make many business software systems work in practice on a daily basis. Software systems which follow software operability good practice will tend to be simpler to operate and maintain, with a reduced cost of ownership, and almost certainly fewer operational problems.

cheaper to drill a new oil well than to extract a faulty down-hole pressure gauge; these systems had to operate reliably with minimal human intervention. Since then I have too often seen the negative effects of operational features being dropped before go-live, which usually results in significant operational costs and more incidents in Production. There is no good reason in 2013 why businesses should put up with (and pay for) second-rate software which needs arduous human attention every few hours or days just in order to maintain normal operation. In my experience, most modern business software is simple enough (at a systems level at least) that we can significantly reduce operational cost and downtime by introducing software operability as a key concern for software product delivery teams. Ultimately, its about lower cost of ownership, better engineering, and fewer late nights debugging flaky software!

draft operation manual, the software team can demonstrate to the operations folks that either all the major operability concerns have been addressed or that some operability criteria are beyond the expertise of the software team, but at least there will be no nasty surprises when the software is put into operation. The act of having to think about things like backups, time changes, health checks, and clear-down steps in the context of their software tends to mean that the software team members will implement small but crucial changes to the software to provide hooks for monitoring, alerting, backups, failover, etc., which improve the operability of the software.

Beyond that, what would represent higher level of operability?


Software with a high level of operability is easy to deploy, test, and interrogate in the Production environment. Highly operable software provides the operations team with the right amount of goodquality information about the state of the service being provided, and will exhibit predicable and non-catastrophic failure modes when under high load or abnormal conditions. Systems with good software operability also lend themselves to rapid diagnosis and simple recovery following a problem, because they have been built with operational criteria as first-class concerns. - How do you make the case for operability when the main business focus is

Where did your interest in operability come from?


Early in my career I built software systems for MRI (brain) scanners and oil & gas exploration. Operability for such systems is essential; its no use building an MRI scanner which can produce 3D brain images if it needs rebooting after taking every second image. Likewise, it was

What are some of the low hanging fruit a software team can tackle to make their software more operable?
The best thing a software team can do to make their software more operable is to write a draft operation manual alongside feature development. The operation manual (aka run book) eventually contains the full details of how the software system is operated in Production. By writing a 8

usually on features? I think one of the most important changes to make is to stop using the term non-functional requirements for things like performance and stability requirements; instead, use the term operational requirements, or even better, operational features, and include these in the product backlog alongside end-user features. This gets away from the artificial (and unhelpful) contrast of functional vs non-functional requirements, and helps to communicate to the business that the operational aspects of the software also require specific features if the business requirements are going to be met. A useful approach (discussed at the excellent DevOpsDays 2013 event in London) is to make the product owner responsible not only for feature delivery but also operational success of the software; after a few early morning Priority 1 call-outs due to the application servers needing a restart, the product owner will probably start to realise the importance of operational features! Making any operational problems more visible is also crucial. If the operations team needs to restart the app servers every night, make this visible, and include the product owner or business sponsor in the email notifications every day. Draw analogies with systems familiar to the product owner: if they had to have their car fixed by a mechanic every two days, theyd soon either buy a new car or pay to have the faulty part replaced. So, dont hide the effort which youre expending on keeping their software product running; make sure they see the cost (and the pain!).

Richard Crowleys talk on Developing Operability at SuperConf 2012 is also worth reading and understanding. I recently began a blog at softwareoperability.com which I plan to turn into a book in late 2013 or early 2014 to help software teams get to grips with software operability. Its worth saying that teams with a DevOps approach will generally produce systems with better operability than teams split into the traditional DevOps silos. Im approaching software operability from this siloed world of DevOps, mainly because this is where most organisations still are today, and in fact,

I hope that by gaining a better understanding of software operability, many engineering teams will move instinctively towards a DevOps model.
More info can be found at softwareoperability.com, @Operability and #operability on Twitter.

where should we look for further information on operability?


A good starting point to learn more about software operability is the excellent book Patterns for Performance and Operability by Ford, Gileadi, et al (ISBN 978-1420053340), which explains the core concepts and works through several real-world examples. In the 1980s and 90s the US space agency NASA did some really useful work on operability as part of the space shuttle programme, and much of the research is available online;

why devops matterS


(to developers)
Benjamin Wootton @BenjaminWootton
Benjamin Wootton is the Principal Consultant at Autumn Devops, a London, UK based consultancy specializing in DevOps and software release automation. He has over 10 years experience working at the intersection of agile Java software development and operations. He is the maintainer of the popular DevOps Friday newsletter.

evOps stems from the idea that developers and operations should work more closely together communicating, knowledge sharing, and collaborating to increase the quality of the systems that we build and operate.

DevOps Increases The Focus On Production


Though software teams might divide themselves into development, QA and operations, these can be slightly arbitrary distinctions. The business who are paying for all of this only care about the net output of what those three teams deliver the value that the finished production software is bringing to the organisation. Our goal as developers should be to deliver not just source code but a reliable product, feature or system that is in production and that people will gain business benefit from. Though we might be personally motivated by cutting code, it is all for nothing if our work never makes it to production, or if the users of the application have a bad time once its out. To my mind, the operational focus on production and delivery espoused by DevOps is a good thing which usually leads to much more net value for the business. DevOps oriented development teams have a focus on value and their user base, rather than their code base.

software bug, ahardware outage or a failed rollout. They wont care if it was a human error or some arbitrary combination of events. All they care about is that they cant use the system as intended. This might be a product of the systems on which Ive worked, but with good unit and integration testing and good QA testing, it is possible to catch most software bugs that would impact a large percentage of the user base. However, where things more typically go wrong is when the system comes into contact with the real world. For instance, we might find our code performs badly under real world load, that a disk fills up, or that users use the application in a way we didnt anticipate as they are prone to do! A DevOps oriented developer or team have a much more stringent focus on these issues and general site reliability. Theyll not only test their code; they will think about failure scenarios and mitigate them before code is even released. Theyll think carefully about detailed testing of their features to minimise the risk of them impacting the broader production system. They will plan and stage their upgrades to de-risk releases, and always have a rollback strategy. They will talk regularly with operations to ensure that they are taking into account their experience with keeping the site available. In short,

Though DevOps is an idea that is finding a lot of success and adoption, most of the enthusiasm and thought leadership appears to come from the Operations side of the fence. This is of course understandable. With Operations teams being on the front line and talking to end users daily, they have an obvious motivation not to upset customers through downtime, and an obvious personal motivation to avoid fire-fighting issues in favour of working on higher value projects. However, as a developer who has always worked at this intersection of the two teams, I feel that developers should also sit up and give more credence to what is coming out of the DevOps community. By opening up communication paths and adopting Operations-like skills and mindsets, we can likely all benefit both as individuals and as teams and in the quality of software that we deliver. Here are some of the reasons why I think this is the case:

DevOps Helps You Improve Your Site Reliability


If your application has downtime, customers wont care if its due to a 10

DevOps rightly places site reliability front and center, and almost all developers will benefit from this mindset. This focus on site reliability might mean that sometimes we churn out fewer pure lines of code in a day, but it means that we move forward more predictably and reliably keeping the system stable and available.

superior coding skills but without the same degree of production awareness. I believe that as a result of DevOps and other trends, this will continue, i.e. that the best developers will increasingly be those who are the most operationally aware, who can code but also have the knowledge, skills and experience to reliably deliver a working production system over the long haul. This is particularly true in these tough and resource constrained economic times. With fewer people having the luxury or saying its not my job, the generalist will get ahead. (I guess for some people, such as those in startups and small companies, it was ever thus, with developers pitching in on operations type stuff such as deployments and upgrades.)

environments should all be in line, and all infrastructure changes should also be versioned and tested alongside the code assets. Doing this well removes so many unknowns and can lead to massive improvements in efficiency and quality of software development.

DevOps Helps You Build Better Software


By being more operationally aware of the production context that our code lives within, developers will also design and build better software. It might be something simple like choosing to add that additional logging statement that you know will make troubleshooting easier later on, or something more complex such as designing a component for horizontal scalability for future growth scenarios. These kind of operationally aware decisions can lead to massive improvements in the net productivity of the team and the quality of their software. Its only by increasing communication with operations teams will we developers learn about these concerns and incorporate them into our designs and everyday coding decisions. Simple things such as joint production incident post-mortems or the inclusion of operations staff in your early design process can help you to move in the right direction. Again, these practises are core to the DevOps philosophy.

DevOps Helps You Manage Modern Infrastructure


DevOps has emerged at a time when cloud hosting, infrastructure as a service, and platform as a service are also reaching widespread adoption. Cloud and PAAS make the hosting environment much more fluid. For instance, over time operations might want to use these platforms to their full potential and scale capacity up or down dynamically. To do that, they will need to be working with development much more closely to work out how to support this in the applications. Because of this, I would argue that developers today need to be more aware of the operational environments in which their applications will operate. Increasingly, we will also find that cloud infrastructure will be managed through software. For instance, the ability to provision new boxes via APIs or deploy applications onto a PAAS. Managing large scale infrastructure in an automated fashion likely to start to look more and more like development work. Development and operations will increasingly start to look like one and the same role. So these are just a few of the reasons why I think developers need to look at DevOps in a lot more detail. Some of this is about a broadening of mindset from my job responsibility is to deliver good code to my job is to deliver and operate a successful system. Others are about acquiring the skills that will actually allow you to do that. With Operations staff then also actively moving towards more of a developer mindset and skillset, DevOps is likely to continue to grow in importance.

DevOps Helps Developers To Own Their Platforms And Infrastructure


A big element of the DevOps movement is the idea of infrastructure as code: that we can define our infrastructure and configuration in descriptive files and metadata, and then be able to test and repeatably deploy that infrastructure and our applications on top of it. This is such a compelling idea with many benefits, and yet developers do not always embrace and own configuration management tools as much as our operations colleagues. By moving towards infrastructure as code and configuration management, developers are given the ability to own and bring under their control the infrastructure that their code runs on. People often say that Apple computers are so reliable because they own the full hardware and software stack. Well, with infrastructure as code and repeatable deploys, developers also get to develop and deploy and own the whole platform on which their software is deployed. It worked on my machine or it worked in QA should be a thing of the past in a mature DevOps team making use of tools such as Vagrant and Puppet, because the development, test, and production 11

DevOps Improves Your Career Prospects


In addition to being a more well rounded developer with focus on production, Operations skills such as system administration, monitoring, scripting, change management and the broader knowledge and experience required to maintain and run complex systems are genuinely useful for developers to acquire. In most of the software teams and hiring decisions I have been involved in, a developer with this profile would have been more valuable than someone with

The State Of The Art


Sandy Walsh @TheSandyWalsh

Monitoring Stack
Based in Nova Scotia, Canada, Alex Sandy Walsh is the owner of Dark Secret Software. He has been a senior professional developer for nearly 20 years and a Pythonista for 10 years. He is currently a developer on the OpenStack project with Rackspace. You can learn more about him at sandywalsh.com or follow @TheSandyWalsh.

ast week I had the pleasure of attending the first annualMonitoramaconference. This was a conference aimed towards advancing the state of open source monitoring and trending software.

For many of us starting in this area, our concept of monitors consists of top, some apache, mysql and application log files and perhaps an external ping service that tells us when our web site is unavailable. Anything beyond that generally ran into the commercial product realm. We were scared off of monitoring by these old monolithic products that required huge licensing fees and armies of professional services people. Thankfully, times have changed. And our application footprint has grown. No longer are we just deploying web servers and databases. Our application stack starts with our automated testing framework and runs through continuous integration and continuous deployment. Jenkins, Travis, Puppet/Chef, etc... theyre all critical. It also includes our deployment partners... that army ofSaaS applications we use to make our life easier. Any SaaS solution worth its salt has a status API available for tracking availability. Our monitoring needs are now wide and diverse. My first exposure to the next generation of monitoring tools came with the awesome Etsy post Measure anything, measure everything.

The concept ofMeasure Everythingwasnt new to me. Id been working on StackTach for OpenStack around the same time and understood the value of getting a visual representation of the internals of an application. Even from my old management days we used to say you cant manage what you cant measure. I lived this with my Google Analytics experiences from running various web sites and my software development management interests were aiming towards Six Sigma techniques over the hand-wavey agile methods. Essentially, numbers are good. But this was giving us a way to apply those same measurement techniques to running software. It was a lens into the black box. Could the days of parsing log files be over? The first generation of these new monitoring tools included Zenoss, Nagios, RRDtool, Cacti, Munin and Gaglia to name a few. They were built out of necessity and often have some really nasty warts that people just hate. This latest generation of tools have learned from their mistakes. The Etsy tool chain started with statsdwith graphite. This introduced to me the concept of usingUDPpackets for instrumenting the running applications... which was pretty brilliant. For those unfamiliar with statsd and graphite, heres the flow: your application wants to measure something, so it sends a UDP 12

packet to the statsd service. UDP packets are lossy and unreliable but fast for large amounts of data. Most large video networks send via UDP packets. statsd is a node.js in-memory data aggregator (it accumulates received data and every so often sends it to graphite). graphite is a django app that archives received data and gives a funky web interface for presenting and querying the data. There are a number of cool things happening here: Adding statsd integration to an existing application is very easy. No special libraries needed and sockets are available in nearly all languages. Since statsd uses UDP there is very little risk of the production application crashing if statsd fails. The packets just get lost. Since statsd is in-memory, it can process a lot of data very quickly. But rather than take on the task of archiving and disk access, it simply forwards the results to something that can do it better. graphite has an easy REST interface which makes it easily accessible by technical product managers to create their own dashboards and status reports.

Side note: if your application is written in python and you want to experiment with this stuff without touching your existing code base, have a look at the Tach application. This monkeypatches your python application and sends the output to statsd or graphite directly. Pretty cool. Although it was originally written for use with OpenStack, it can work with anything. But the real insight here is a set of atomic, well focused tools that could be put together to create a monitoring stack.The tool chest of theDevOpsteam just expanded. As our experiences with statsd and graphite had grown within the company we also saw where the monitoring stack failed. A UDP-based approach wont work for billing or auditing. For these scenarios you need to have a reliable transport for events. In OpenStack we publish event notifications to AMQP queues for consumption by various other tools. These are important events, often with large payloads. When the StackTach application is unavailable these queues can grow very quickly, and we dont want to drop events. This is manageable for something like OpenStack Compute, but other applications like Storage produce an incredible amount of data across a wide range of servers. Using a notification-based system would be difficult. Instead we needed to look at syslog-based archiving and processing solutions. The new monitoring stack offers tools likeLogStashand, in the OpenStack case,Slogging. Then there is the post-processing. To add value to the raw events we often need to apply other functions to the data such as times series averaging. This can be tricky. We need to wait for all the collected data to arrive before we can start the postprocessing. We may need to ensure proper ordering. Historically this would be done with cron jobs and batch processing, but the new monitoring stack includes tools like Riemann which can do this postprocessing inline.

It seems evident that Nagios isnt going anywhere any time soon, but there are some other tools offering alternatives such asShinkinandSensu. Recently our team has been working on bringing what weve learned with StackTach to the OpenStack-blessed monitoring solution called Ceilometer. Without standing back and looking at the larger monitoring community it would have been very easy to want to recreate an entire monitoring stack on our own. But now its clear that we can focus on the minimal set of missing functionality and augment that from an already powerful set of tools. This is a very attractive proposition for one simple reason: the project has an end in sight. There are lots of fun problems out there to tackle and knowing you dont have to reinvent the wheel is very compelling. There is a cost though. The monitoring stack today consists of a variety of tools all written in different languages and each with different care and and feeding instructions. One could argue that the workload on operations will only increase by mixing and matching. My knee-jerk reaction is to agree, but I know that the greater win is to get familiar with all of these new tools. In production, these monitoring tools need monitoring as well.So we may have to monitor Java, Ruby, Python and C# VMs running bytecode from a potential variety of languages. If this all seems too daunting, perhaps the hosted offerings are a better choice for you. For nearly every open source offering there are hosted offerings. Look at loggly, papertrail, pagerduty, librato, datadog, hostedgraphite, boundary, new relic, etc. This brings me back to Monitorama. The Monitorama conference had a format that worked very well for me for the following reasons: It was only two days long. Day One focused on hearing about the state of the ar t from industr y leaders.

Day Two was tactical with the tools and included a hackathon which let you understand where the realworld pain lived in each of these components. It was small enough so you could actually talk to people and have meaningful conversations.

The Day One talks made it clear that Alert Fatigue (a term borrowed from the medical industry) is a big problem. Too many alerts hitting our inbox. Some are important, most are noise. There are people working on it, but its perhaps the biggest source of angst for operations currently. Side story: for the hackathon I started work on a tool that allowed members of the company to track external events that might affect production. Things like sales events, big holidays, new customer deployments or internal events such as new code deployments, hardware upgrades, etc. The idea was to have these events show up on the spikes in the dashboard graphs so we could say That spike was due to Foo and that ravine was due to Blah. I made some good progress for the day and then one of the other attendees showed me his side projectAnthricite, which does all this and more. The author was sitting in the room next to me. What are the odds? For a while I was getting disillusioned with this space because I saw it dominated with commercial solutions or that the problem was so big it would be a lifetime of work to build as open source. But now I see there are viable open source components and there is enough of the stack available that we can focus on some of the smaller missing pieces. Also, there is a smart community out there facing the exact same problems and actively working on solutions. There is a light at the end of the tunnel. I may not attend Monitorama EU, but definitely the US one next year. But for now, Ive got some products to learn.

13

The benchmark youre reading

is probably wrong
RethinkDB Team @RethinkDb

The RethinkDB team is working on a scalable, open-source, distributed document database system that features a pleasant query language, parallelized architecture, and table joins. You can learn more at rethinkdb.com.

ikeal Rogers wrote a blog post on MongoDB performance and durability. In one of the sections, he writes about the request/response model, and makes the following statement: MongoDB, by default, doesnt actually have a response for writes. In response, one of 10gen employees (the company behind MongoDB) made the following comment on Hacker News: We did this to make MongoDB look good in stupid benchmarks. The benchmark in question shows a single graph, which demonstrates that MongoDB is 27 times faster than CouchDB on inserting one million rows. At the first glance, the benchmark immediately looks silly if youve ever done serious benchmarking before. CouchDB people are smart, inserting such a small number of elements is a relatively simple feature, and its almost certain that either they would have fixed something that simple or they had a very good reason not to (in which case the benchmark is likely measuring apples and oranges). Lets do some back of the envelope math. Roundtrip latency on a commodity network for a small packet can range from 0.2ms to 0.8ms. A single rotational drive can do 15000RPM / 60sec = 250 operations per second (resulting in close to 5ms latency in

practice), and a single Intel X25-m SSD drive can do about 7000 write operations per second (resulting in close to 0.15ms latency). The benchmark demonstrates that CouchDB takes an average of 0.5ms per document to insert one million documents, while MongoDB does the same in 0.01ms. Clearly the rotational drives are too slow to play a part in the measurement, and the SSD drives are probably too fast to matter for CouchDB and too slow to matter for MongoDB. However, CouchDB appears to be awfully close to commonly encountered network latencies, while MongoDB inserts each document 50 times faster than commodity network latency. At first observation, it appears likely that the CouchDB client library is configured to wait for the socket to receive a response from the database server before sending the next insert, while the MongoDB client is configured to continue sending insert requests without waiting for a response. If this is true, the benchmark compares apples and oranges and tells you absolutely nothing about which database engine is actually faster at inserting elements. It doesnt measure how fast each engine handles insertion when the dataset fits into memory, when the dataset spills onto disk, or when there are multiple concurrent clients (which is a whole different can of worms). 14

It doesnt even begin to address the more subtle issues of whether the potential bottlenecks for each database might reside in the virtual memory configuration, or the file system, or the operating system I/O scheduler, or some other part of the stack, because each database uses each one of these components slightly differently. What the benchmark likely measures is something that is never mentioned the latency of the network stack for CouchDB, and something entirely unrelated for MongoDB. Unfortunately most benchmarks published online have similar crucial flaws in the methodology, and since many people make decisions based on this information, software vendors are forced to modify the default configuration of their products to look good on these benchmarks. There is no easy solution performing proper benchmarks is very error-prone, time consuming work. Its good to be very skeptical about benchmarks that show a large performance difference but dont carefully discuss the methodology and potential pitfalls. As Brad Pitts character says at the end of Inglourious Basterds,

Long story short, we hear a story too good to be true, it aint.


This blog post originally appeared on rethinkdb.com/blog in July 2010.

Sign up at DevOpsFriday.com for next weeks issue!

Вам также может понравиться