What to Look for in Load Test Reporting: Six Tips for Getting the Data you Need

Looking at graphs and test reports can be a befuddling and daunting task – Where should I begin? What should I be looking out for? How is this data useful or meaningful? Hence, here are some tips to steer you in the right direction when it comes to load testing result management.

For example, the graph (above) shows how the load times (blue) increase [1] as the service reaches its maximum bandwidth (red) limit [2], and subsequently how the load time increases even more as bandwidth drops [3]. The latter phenomenon occurs due to 100% CPU usage on the app servers.

When analyzing a load test report, here are the types of data to look for:

  • What’s the user scenario design like? How much time should be allocated within the user scenario? Are they geographically spread?

  • Test configuration settings: is it ramp-up only or are there different steps in the configuration?

  • While looking at the tests results, do you get an exponential growing (x²) curve? Is it an initial downward trend that plateaus (linear/straight line) before diving downwards drastically?

  • How does the bandwidth/requests-per-second look like?

  • For custom reporting and post-test management, can you export your test results to CSV format for further data extraction and analysis?

Depending on the layout of your user scenarios, how much time should be spent within a particular user scenario for all actions (calculated by total amount of sleep time), and how the users are geographically spread, you will likely end up looking at different metrics. However, below are some general tips to ensure you’re getting and interpreting the data you need.

Tip #1: In cases of very long user scenarios, it would be better to look at a single page or object rather than the “user load time” (i.e. the time it takes to load all pages within a user scenario excluding sleep times).

Tip #2: Even though “User Load Time” is a good indicator for identifying problems, it is better to dig in deeper by looking at individual pages or objects (URL) to get a more precise indication of where things have gone wrong. It may also be helpful to filter by geographic location as load times may vary depending on where the traffic is generated from.

Tip #3: If you have a test-configuration with a constant ramp-up and during that test the load time suddenly shoots through the roof, this is a likely sign that the system got overloaded a bit earlier than the results show. In order to gain a better understanding of how your system behaves under a certain amount of load, apply different steps in the test configuration to allow the system to calm down for approximately 15 minutes. By doing so, you will be able to obtain more and higher quality samples for your statistics.

Tip #4: If you notice load times are increasing and then suddenly starting to drop, then your service might be delivering errors with “200-OK” responses, which would indicate that something may have crashed in your system.

Tip #5: If you get an exponential (x²) curve, you might want to check on the bandwidth or requests-per-second. If it’s decreasing or not increasing as quickly as expected, this would indicate that there are issues on the server side (e.g. front end/app servers are overloaded). Or if it’s increasing to a certain point and then plateaus, you probably ran out of bandwidth.

Tip #6: To easily identify the limiting factor(s) in your system, you can add a Server Metrics Agent which reports performance metrics data from your servers. Furthermore, you could possibly export or download the whole test data with information containing all the requests made during the tests, including the aggregated data, and then import and query via MySQL database, or whichever database you prefer.

In a nutshell, the ability to extrapolate information from load test reports allows you to understand and appreciate what is happening within your system. To reiterate, here are some key factors to bear in mind when analyzing load test results:

  • Check Bandwidth

  • Check load time for a single page rather than user load time

  • Check load times for static objects vs. dynamic objects

  • Check the failure rate

  • For Server Metrics – check CPU and Memory usage status

……………….

 

1e93082This article was written by Alex Bergvall, Performance Tester and Consultant at Load Impact. Alex is a professional tester with extensive experience in performance testing and load testing. His specialities include automated testing, technical function testing, functional testing, creating test cases, accessibility testing , benchmark testing, manual testing, etc.

Twitter: @AlexBergvall

New Load Script APIs: JSON and XML Parsing, HTML Form Handling, and more!

Load scripts are used to program the behavior of simulated users in a load test. Apart from native functionality of the Lua language, load script programmers can also use Load Impact’s load script APIs to write their advanced load scripts.

Now you can script your user scenarios in the simple but powerful language Lua, using our programmer friendly IDE and new APIs such as: JSON and XML parsing, HTML form handling, Bit-fiddling, and more.

Advanced-Scripting2

Automated Acceptance Testing with Load Impact and TeamCity (New Plugin)

teamcity512

As you know, Continuous Integration (CI) is used by software engineers to merge multiple developers’ work several times a day. And load testing is how companies make sure that code performs well under normal or heavy use.

So, naturally, we thought it wise to develop a plugin for one of the most widely used CI servers out there – TeamCity by JetBrains. TeamCity is used by developers at a diverse set of industry leaders around the world – from Apple, Twitter and Intel, to Boeing, Volkswagen and Bank of America. It’s pretty awesome!

The new plugin gives TeamCity users access to multi-source load testing from up to 12 geographically distributed locations worldwide, advanced scripting, a Chrome Extension  to easily create scenarios simulating multiple typical users, and Load Impact’s Server Metrics Agent (SMA) for correlating the server side impact of testing – like CPU, memory, disk space and network usage.

Using our plugin for TeamCity makes it incredibly easy for companies to add regular, automated load tests to their nightly test suites, and as a result, get continuous feedback on how their evolving code base is performing. Any performance degradation, or improvement is detected immediately when the code that causes it is checked in, which means developers always know if their recent changes were good or bad for performance – they’re guided to writing code that performs well.

 

Here’s how Load Impact fits in the TeamCity CI workflow:CD-TeamCity

 

TeamCity-Button

 

Once you have the plugin installed, follow this guide for installing and configuring the Load Impact plugin for TeamCity. 

Countdown of the Seven Most Memorable Website Crashes of 2013

Let this be a lesson to all of us in 2014. 

Just like every other year, 2013 had its fair share of website crashes. While there are many reasons why a website might fail, the most likely issue is the site’s inability to handle incoming traffic (i.e. load).

Let’s look at some of the most memorable website crashes of 2013 that were caused by traffic overload.

#7. My Bloody Valentine

imgres-3February 2nd, obviously not so alternative shoegaze legends, My Bloody Valentine, decided to release their first album since 1991, and they decided to do so online. They crashed within 30 minutes.

In the end, most of their fans likely got hold of the new album within a day or two and the band, which clearly has a loyal fanbase, probably didn’t end up loosing any sales due to the crash.

#6. Mercedes F1 Team 

Lewis_Hamilton_2013_Malaysia_FP2_2Mercedes F1 team came up with a fairly clever plan to promote their web content. In february, they told fans on Twitter that the faster they retweeted a certain message, the faster the team would reveal sneak preview images of their 2013 Formula One race car.

It worked a little too well. While waiting for the magic number of retweets to happen, F1 fans all over the world kept accessing the Mercedes F1 web page in hopes of being the first to see the new car. Naturally, they brought the website down.

You guys are LITERALLY killing our website!” Mercedes F1 said via Twitter.

#5. NatWest / Royal Bank of Scotland

rbs-nat-west-1-522x293Mercedes F1 and My Bloody Valentine likely benefited from the PR created by their respective crashes, but there was certainly nothing positive to come out of the NatWest/RBS bank website crash. A crash which left customers without access to their money!

In December, NatWest/RBS saw the second website crash in a week when a DDOS attack took them down.

It’s not the first DDOS attack aimed at a bank and it’s probably not the last one either.

#4. Sachin Tendulkar crash

imagesOne of Indias most popular Cricketers, Sachin Tendulkar, also known as the “God of Cricket”, retired in 2013 with a bang! He did so by crashing local ticketing site, kyazoonga.com.

When tickets for his farewell game at Wankhede in Mumbai became available, kyazoonga.com saw a record breaking 19.7 million hits in the first hour, after which the website was promptly brought down.

Fans were screaming in rage on Twitter and hashtag #KyaZoonga made it to the top of the Twitter trending list.

#3. UN Women – White Ribbon campaign 

images-1

It may be unfair to say that this website crash could have been avoided, but it’s definitely memorable.

On November 25th – the International Day for the Elimination of Violence against Women – Google wanted to acknowledge the occasion by linking to the UN Women website from the search giant’s own front page.

As a result, the website started to see a lot more traffic than they’ve been designed for and started to load slowly, even crashing entirely.

Google had given the webmasters at unwomen.org a heads up and the webmasters did take action to beef up their capacity, but it was just too difficult to estimate how much traffic they would actually get.

In the end, the do-no-evil web giant and unwomen.org worked together and managed through the day, partly by redirecting the link to other UN Websites.

Jaya Jiwatram, the web officer for UN Women, called it a win. And frankly, that’s all that really matters when it comes to raising awareness for important matters.

#2. The 13 victims of Super Bowl  XLVII

Super_Bowl_XLVII_logoCoca Cola, Axe, Sodastream, Calvin Klein had their hands full during Super Bowl XLVII. Not so much serving online visitors as running around looking for quick fixes for their crashed websites.

As reported by Yottaa.com, no fewer than 13 of the companies that ran ads during Super Bowl saw their websites crash just as they needed them the most.

If anything in this world is ever going to be predictable, a large spike of traffic when you show your ad to a Super Bowl audience must be one those things.

#1. healthcare.gov

imgres-5The winner of this countdown shouldn’t come as a surprise to anyone. Healthcare.gov came crashing down before it was even launched.

It did recover quite nicely in the last weeks of 2013 and is now actually serving customers. If not exactly as intended, at least well enough for a total of 2 million americans to enroll.

But without hesitation, the technical and political debacle surrounding healthcare.gov makes it the most talked about and memorable website crash in 2013.

Our friends over at PointClick did a great summary of the Healthcare.gov crash. Download their ebook for the full recap: The Six Critical Mistakes Made with Healthcare.gov

There’s really nothing new or surprising about the website crashes of 2013. Websites have been developed this way for years – often with the same results. But there are now new methodologies and tools changing all that.

It isn’t like it used to be; performance testing isn’t hard, time consuming or expensive anymore. One just needs to recognize that load testing is something that needs to be done early and continuously throughout the development process. It’s not optional anymore. Unfortunately, it seems these sites found that out the hard way. A few of which will likely learn the lesson again in 2014.

Our prediction for 2014 is more of the same. However, mainstream adoption of developmental methodologies such as Continuous Integration and Delivery, which advocate for early and continuous performance testing, are quickly gaining speed.

A Google search trend report for the term, DevOps, clearly shows the trend. If the search trends are any indication of the importance being given to proactive performance testing by major brands, app makers and SaaS companies, we might only see half the number of super bowl advertiser site crashes in 2014 as we did last year.

DevOps Trend

Update following Superbowl XLVIII: According to GeekBeat, the Maserati website crashed after their ad featured their new Maserati Ghibli. And monitoring firm, OMREX, found two of the advertiser websites had uptime performance issues during the game – Coca-Cola and Dannon Oikos.

New Pay-Per-Test Credits ($1 = 1 Credit)

We recently made two big changes to our pricing.

1. We released monthly subscriptions

2. We converted our test credits to a $1=1 credit model.

If you already had credits in your account before this change, you will notice that the total amount of credits in your account has increased.

Don’t worry, we aren’t a failed state. Your new credit count buys you just as much load testing as you could do before.

We simply wanted to simplify the process of purchasing a single test. Now you can get an exact dollar price for the test you want to run. Easy-peasy-lemon-squeezy!

The price for a single test is based on two factors – load level and test duration.

Head of over to our pricing page to see how it works.

Pay-per-test sliders

Configuring a load test with multiple user scenarios

We recently had a great question come in from one of our customers that we thought we would share.

Question: Planning to run a test with 10.000 concurrent users spanning 4 or 5 user scenarios.  How do I configure a test to run with, say 35% of the load running user scenario 1, 35% running user scenario 2, 10% running user scenario 3 etc?

And, when running multiple scenarios, where each scenario consists of 2 or more pages, how can we see the performance (load time) of each page in each scenario?

Answer: Assigning a certain amount of the simulated users to each user scenario is something you do in the “Test configuration” section.

Just scroll down the page to the section called “User scenarios”, then click the “Add scenario” button to add a new user scenario to the test. When you have all the scenarios you want added, you can fiddle with the percentages to get the exact load allocation for each scenario that you want.

User Scenario Gif Final Medium8(large)

 

The load time of each page in a user scenario can be collected if you use the –

http.page_start() and http.page_end() functions

– inside the user scenario script. Read more about that here and here.

Example: page metrics

-- Log page metric
 http.page_start("My page")
 responses = http.request_batch({
    { "GET", "http://loadimpact.com/" },
    { "GET", "http://loadimpact.com/style1.css" },
    { "GET", "http://loadimpact.com/image1.jpg" },
    { "GET", "http://loadimpact.com/image2.jpg" }
 })
http.page_end("My page")

Using the above script as a user scenario would result in a plot-able page load time metric for a page called “My page”. The name of the page can be changed to whatever you want.

5 Websites That Crashed This Holiday That Could Have Been Avoided

T’was the season to deliver a seamless online user experience, to bring under two second response times to shoppers looking for the best pre and post Christmas sale. Except that it wasn’t. At least not for the following five companies.

Every Christmas, e-commerce, ticketing, flash-sale and other online businesses prepare themselves to meet the demands of expected visitor traffic. Most fair exceptionally well because they take the necessary precautions and run realistic load and performance tests well in advance.

Yet, as more and more traditionally offline services move online and consumer demand for faster response times increases, the post-mortem on websites that crash during the holiday rush draw ever more media attention.

The increasing media attention is also due in part to the fact that innovation in performance testing has dramatically reduced the cost of doing so and the proliferation of cloud-based tools make testing services accessible to every website owner within just a few minutes. Basically, there is really no excuse for crashing.

Here’s a recap of some of the websites that crashed on our watch this holiday. We definitely didn’t catch all of them, so please do share your stories in the comment section below. Moreover, as we are a Swedish based company, many examples are from Sweden. Feel free to share examples from your countries.

1. December 4th, Wal-Mart.com:

Walmart Site

Wal-Mart went down for a brief period, about an hour, on December 4th. Admittedly, they did claim to have had over 1 billion views between Thanksgiving and Cyber Monday and to have had the best online sales day ever on Cyber Monday.

So, despite the brief downtime, we’ll give it to Wal-Mart. They did have a pretty massive load to bare and if anyone can take it and recover that quickly, it’s probably them.

Read more 

2. December 16th, Myer.com.au:

Myer-620x349

On boxing day, Australia’s largest department store group, Myer, suffered technical difficulties that prevented online purchases during the biggest shopping day of the season.

According to the media, Myer has pumped tens of millions of dollars into improving its website over the years. Despite boosting its technology, this isn’t their first crash during peak shopping periods. They also crashed in June when heavy customer traffic triggered a website failure half an hour after the start of the annual stocktaking sale.

Although Myer is pushing an omni-channel strategy and hoping to boost its online sales in the long-term, the website is only responsible for about 1% of the company’s business today.

Although online sales may not make up a significant part of business today, it would be wise not to deny the impact these constant crashes probably have on the successful implementation of an omni-channel strategy. Yet this is how Myer CEO, Mr. Brookes,  seems to be behaving when he made this odd statement about the recent boxing day crash.

“There will be no impact at all on our profitability or our overall sales”

Sure Mr. Brookes, if you say so.

Read more

3. December 25th, Siba.se:

Siba ImageThe day after Christmas, Siba – one of Sweden’s largest electronic’s dealers –  crashed due to overwhelming visitor traffic. This in turn led to a social media storm of customers complaining that the site was down.

As a curtesy to those who were not able to access the site,  Siba directed visitors to its sales catalogue saying: “Oops, at the moment there is a lot of traffic on the site, but you can still read our latest catalogue and stay up to date through our Facebook page”.

Thanks Siba, reading about the sales I’m missing out on is totally the same as taking advantage of them.

4. December 29th, SF.se 

In the period between Christmas and New Year’s,  SF  – Sweden’s largest movie theatre chain – suffered continuous day long crashes and delays. This left many people unable to fill those long cold days, when not much else is going on, with a cozy few hours at the cinema. In fact, these “mellandagarna” (days between Christmas and New Year’s) are the busiest movie going days of the entire year.

Needless to say, people were very frustrated. Particularly because SF has a monopoly and if they go down there is pretty much no where else to turn to get your cinema fix.

Read more

5. January 1st, Onlinepizza.se:

For the third new year’s day in a row, Onlinepizza.se crashed due to heavy user load. This may seem trivial to some, but to Swedes it’s devastating. That’s because on new year’s day, Swedes eat pizza. It’s just what they do.

So, despite the nearly  30,000 or so pizzas sold that day through Onlinepizza.se, many hungry swedes were forced to brave the cold and wind and buy their pizza the old fashion way – in a pizzeria.

Read more

Some of the holiday website crashes described above are bearable; most of us can go without buying another electronic device or pair shoes for at least a few days. But not being able to cozy up in a warm cinema on days when it’s to cold to go outside and nothing else in the city is open is just disappointing. As is not getting a home delivered pizza when you just simply can’t stuff another left-over Swedish meatball down your throat.

Make Scalability Painless, by First Identifying your Pain Points

This post was originally written for: 

SDTIMES_Logo

………..

With many, if not most, applications, it is common that a very small part of the code is responsible for nearly all of the application response time. That is, the application will spend almost all of its time executing a very minor part of the code base.

In some cases, this small part of code has been well optimized and the application is as fast as can reasonably be expected. However, this is likely the exception rather than the rule.

It might also be that the real delay happens in external code – in a third-party application depended on.

Regardless of where a performance bottleneck lies, half of the work in fixing it (or working around it) is usually spent identifying where it’s located.

Step 1: Understand how your backend is being utilized.

One of the first things you must do to identify your pain points is to understand how your backend is being utilized.

For example, if your application backend functionality is exposed through a public API that clients use, you will want to know which API functions are being called, and how often and at what frequency they are being called.

You might also want to use parameter data for the API calls that are similar to what the application sees during real usage.

Step 2: Combine performance testing with performance monitoring to locate bottlenecks. 

The second, and more important, step to take is to combine performance testing with performance monitoring in order to nail down where the problems lie.

When it comes to performance testing, it’s usually a matter of experimenting until you find the point at which things either start to fall apart, often indicated by transaction times suddenly increasing rapidly, or just stop working.

When you run a test and reach the point at which the system is clearly under stress, you can then start looking for the bottleneck(s). In many cases, the mere fact that the system is under stress can make it a lot easier to find the bottlenecks.

If you know or suspect your major bottlenecks to be in your own codebase, you can use performance monitoring tools to find out exactly where the code latency is happening.

By combining these two types of tools – performance testing and performance monitoring – you will be able to optimize the right parts of the code and improve actual scalability.

Let’s use an example to make this point clear.

Let’s say you have a website that is accessed by users using regular web browsers. The site infrastructure consists of a database (SQL) server and a web server. When a user accesses your site, the web server fetches data from the database server, then it performs some fairly demanding calculations on the data before sending information back to the user’s browser.

Now, let’s say you’ve forgotten to set-up an important database table index in your database – a pretty common performance problem experienced with SQL databases. In this case, if you only monitor your application components  – the physical servers, the SQL server and the web server – while a single user is accessing your site, you might see that the database takes 50 ms to fetch the data and the calculations performed on the web server take 100 ms. This may lead you to start optimizing your web server code because it looks as if that is the major performance bottleneck.

However, if you submit the system to a performance test which simulates a large number of concurrent users with, let’s say, ten of those users loading your web site at exactly the same time, you might see that the database server now takes 500 ms to respond, while the calculations on the web server take 250 ms.

The problem in this example is that your database server has to perform a lot of disk operations because of the missing table index, and those scale linearly (at best) with increased usage because the system has only one disk.

The calculations, on the other hand, are each run on a single CPU core, which means a single user will always experience a calculation time of X (as fast as a single core can perform the calculation), but multiple concurrent users will be able to use separate CPU cores (often 4 or 8 on a standard server) and experience the same calculation time, X.

Another potential scalability factor could be if calculations are cached, which would increase scalability of the calculations. This would allow average transaction times for the calculations to actually decrease with an increased number of users.

The point of this example is that, until you submit a system to real heavy traffic, you have really no idea how it will perform when lots of people are using the system.

Put bluntly, optimizing the parts of the code you identified as performance bottlenecks when being monitored may end up being a total waste of time. It’s a combination of monitoring and testing that will deliver the information you need to properly scale.

By: Ragnar Lönn, CEO, Load Impact

The Load Impact Session Recorder – Now Available as a Chrome Extension!

Start a load test with just a few clicks. Record all HTTP traffic and use the recordings to simulate real user traffic under realistic load.

The Load Impact Chrome extension will capture everything – every single thing being loaded into the browser as you click – including ads, images, documents, etc., so you get a far more accurate read of what’s going on.

Just press “record”, start browsing and when complete, the script will automatically upload to your Load Impact account.

Here’s how it works:

output_ZcOpmw

With the help of our Chrome extension, you can run up to 10 different users scenarios in each load test and simulate up to 1.2 million concurrent users.  You can also run the multiple user scenarios simultaneously from up to 10 different geographic regions in a single test (powered by Amazon and Rackspace).

Until now our session recorder required developers to go to our website and manually change the proxy settings in the browser or operating system to perform a recording. That was a bit of a hassle, and the proxy solution sometimes caused problems with SSL certificates.

The extension now automates the entire process, from recording traffic in a specific browser tab, to stopping, saving and sending the scrip to your Load Impact account for future use.

The Chrome extension is available free of charge from the Google Chrome Web Store and is easily ported to the Safari and Opera browsers.  An extension for the Firefox browser is planned for release early next year.

To use the Chrome extension, you will need to register for a Load Impact account at loadimpact.com.

How did the Obama Administration blow $400M making a website?

By doing software development and testing the way it’s always been done.

There is nothing new in the failure of the Obamacare site. Silicon Valley has been doing it that way for years. However new methodologies and tools are changing all that.

There has been a huge amount of press over the past several weeks about the epic failure of the Obamacare website. The magnitude of this failure is nearly as vast as the righteous  indignation laid at the feet of the administration about how this could have been avoided if only they had done this or that.  The sub text being that this was some sort of huge deviation from the norm. The fact is nothing could be farther form the truth. In fact, there should be a sense of déjà-vu-all-over-again around this.

The record of large public sector websites are one long case study in epic IT train wrecks.

In 2012 the London Olympic Ticket web site crashed repeatedly and just this year the California Franchise Tax Board’s new on-line tax payment system went down and stayed down – for all of April 15th.

So, this is nothing new.

As the Monday morning quarterbacking continues in the media one of my favorite items was a CNN segment declaring that had this project had been done in the lean mean tech mecca that is Silicon Valley, it all would have turned out differently because of the efficiency that we who work here are famous for. And as someone who has been making online software platforms in the Bay Area for the past decade, I found that an interesting argument, and one worth considering and examining.

Local civic pride in my community and industry generates a sort of knee jerk reaction. Of course we would do it better/faster/cheaper here. However if you take a step back and really look honestly at how online Software as a Service (SaaS) has been done here over most of the past 20 or so years that people have been making websites, you reach a different conclusion. Namely, it’s hard to fault Obama Administration. They built a website in a way that is completely in accordance with the established ways that people have built and tested online Software platforms for most of the past decade in The Valley.

The only problem is it doesn’t work.  Never has.

The problem then isn’t that they did anything out of the ordinary.  On the contrary.  They walked a well worn path right off a cliff very familiar to the people I work with. However, new methodologies and tools are changing that. So, the fault is that they didn’t see the new path and take that instead.

I’d like to point out from the start that I’ve got no special knowledge about the specifics of HealthCare.gov. I didn’t work on this project.  All of what I know is what I’ve read in the newspapers. So starting with that premise I took a dive into a recent New York Times article with the goal of comparing how companies in The Valley have faced similar challenges, and how that would be dealt with using the path not taken, of modern flexible — Agile in industry parlance — software development.

Fact Set:

  • $400 million
  • 55 contractors
  • 500 million lines of code 

$400 million — Lets consider what that much money might buy you in Silicon Valley. By December of 2007 Facebook had taken in just under $300 million in investment and had over 50 million registered users — around the upper end of the number of users that the HealthCare.gov site would be expected to handle.  That’s big. Comparisons between the complexity of a social media site and a site designed to compare and buy health insurance are imperfect at best. Facebook is a going concern and arguably a much more complex bit of technology.  But it gives you the sense that spending that much to create a very large scale networking site may not be that extravagant. Similarly Twitter had raised approximately $400 million by 2010 to handle a similar number of users. On the other hand eBay, a much bigger marketplace than HealthCare.gov will ever be, only ever asked investors for $7 million in funding before it went public in 1998.

55 contractors — If you assume that each contractor has 1,000 technical people on the project you are taking about a combined development organization about the size of Google (54,000 employees according to their 2013 Q3 Statement) for HeathCare.gov. To paraphrase the late Sen. Lloyd Benson ‘I know Google, Google is a friend of mine and let me tell you… you are no Google’

500 million lines of code – That is a number of astronomical proportions. It’s like trying to image how many matches laid end to end would reach the moon (that number is closer to 15 billion but 500 million matchsticks will take you around the earth once). Of all the numbers in here, that is the one that is truly mind boggling.  So much to do something relatively simple. As one source in the article points out, “A large bank’s computer system is typically about one-fifth that size.”  Apples latest version of the OSX operation system for computers has approximately 80 million lines of code.  Looking at it another way, that is a pretty good code to dollar ratio. The investors in Facebook probably didn’t get 500 million lines of code for their $400 million. Though, one suspects, they might have been pretty appalled if they had.

So if the numbers are hard to mesh with Silicon Valley, what about the process — the way in which they went about doing this, and the resulting outcome?  Was the experience of those developing this similar, with similar outcomes, to what might have taken place in Silicon Valley over the past decade or so? And, how does the new path compare with this traditional approach?

The platform was ”70 percent of the way toward operating properly.”   

Then – In old school Silicon Valley there was among a slew of companies the sense that you should release early, test the market, and let the customers find the bugs.

Now – It’s still the case that companies are encouraged to release early, and if your product is perfect it was thought that you waited too long to release.  The difference is that the last part — let the customers find the bugs — is simply not acceptable, excpet for the very youngest  beta test software,. The mantra with modern developers is, fail early and fail often.  Early means while the code is still in the hands of developers, as opposed to the customers.  And often means testing repeatedly — ideally using automated testing. This, as opposed to manual tests, that were done reluctantly, if at all.

“Officials modified hardware and software requirements for the exchange seven times… As late as the last week of September, officials were still changing features of the Web site.” 

Then —  Nothing new here. Once upon a time there was a thing called the Waterfall Development Method. Imagine a waterfall with different levels each pouring over into the next, each level of this cascade represented a different set of requirements each dependent on the level above it and the end of the process was a torrent of code and software the would rush out to the customer in all its complex feature-rich glory called The Release. The problem was that all these features and all this complexity took time — often many months for a major release, if not longer. And over time the requirements changed. Typically the VP of Sales or Business Development would stand up in a meeting and declared that without some new feature that was not on the Product Requirement Document, some million-dollar deal would be lost. The developers, not wanting to be seen as standing in the way of progress, or being ordered to get out of the way of progress, would dutifully add the feature or change a requirement, thereby making an already long development process even longer. Nothing new here.

Now — The flood of code that was Waterfall has been replaced by something called Agile, which as the name implies, allows developers to be flexible, and expect that the VP of Sales will rush in and say, “Stop the presses!  Change the headline!”  The Release is now broken down into discrete and manageable chunks of code in stages that happen on a regular weekly, if not daily, schedule. Software delivery is now designed to accommodate the frequent and inherently unpredictable demands of markets and customers. More importantly, a problem with software can be limited in scope to a relatively small bit of code with where it can be quickly found and fixed.

“It went live on Oct. 1 before the government and contractors had fully tested the complete system. Delays by the government in issuing specifications for the system reduced the time available for testing.”

Then — Testing was handled by the Quality Assurance (QA) team. These were often unfairly seen as the least talented of developers who were viewed much like the Internal Affairs cops in a police drama. On your team in name only, and out to get you.  The QA team’s job was to find mistakes in the code and point them out publicly, and make sure they got fixed. Not surprisingly, many developers saw little value in this. As I heard one typically humble developer say, “Why do you need to test my code? It’s correct.The result of this mindset was that as the number of features increase, and time to release remained unchanged, testing got cut.  Quality was seen as somebody else’s problem.  Developers got paid to write code and push features.

Now — Testing for quality is everybody’s job. Silos of development, operations and QA are being combined into integrated Dev/Ops organizations in which software is be continuously delivered and new features and fixes are continuously integrated into live websites. The key to this is process — known by the refreshingly straight name of  Continuous Delivery — is automated testing that frees highly skilled staff from the rote mechanics of doing testing, and allows them to focus on making a better product, all the while assuring the product is tested early often and continuously. A Continuous Delivery product named Jenkins is currently one of the most popular and fastest growing open source software packages.

“The response was huge. Insurance companies report much higher traffic on their Web sites and many more callers to their phone lines than predicted.”

Then — The term in The Valley was victim of your own success. This was shorthand for not anticipating rapid growth or positive response, and not testing the software to ensure it had the capacity and performance to handle the projected load and stress that a high volume of users places on software and the underlying systems. The reason for this was most often not ignorance or apathy, but that the software available at the time was expensive and complicated, and the hardware needed to do these performance tests was similarly expensive and hard to spare.  Servers dedicated solely for testing was a luxury that was hard to justify and often appropriated for other needs.

Now — Testing software is now often cloud-based, on leased hardware, which means that anybody with a modicum of technical skill and an modest amount of money can access  tools that would have been out of reach of all but the largest, most sophisticated software engineering and testing teams with extravagant budgets. Now, not only is there no excuse for not doing it, is in fact inexcusable. Software is no longer sold as licensed code that comes on a CD.  It is now a service that is available on demand — there when you need it. Elastic.  As much as you need, and only what you need.  And, low entry barrier.  You shouldn’t have to battle your way through a bunch of paperwork and salespeople to get what you need.  As one Chief Technical Officer at a well know Bay Area start-up told me,  “If I proposed to our CEO that I spend $50,000 on any software, he’d shoot me in the head.” Software in now bought as service.

It’s far from clear at this point in this saga as to the what, how and how much it will take to fix the HealthCare.gov site. What is clear is that while the failure should come as no surprise give the history of government, and software development in general that doesn’t mean that the status quo need prevail for every. It’s a fitting corollary to the ineffective process’ and systems in the medical industry that the healthcare.org itself is trying to fix. If an entrenched industry like software development and Silicon Valley can change the way it does business and produce its services faster, better and at a lower cost; then maybe there is hope for the US health care industry doing the same.

By: Charles Stewart (@Stewart_Chas)

About Load Impact

Load Impact is the leading cloud-based load testing software trusted by over 123,000 website, mobile app and API developers worldwide.

Companies like JWT, NASDAQ, The European Space Agency and ServiceNow have used Load Impact to detect, predict, and analyze performance problems.
 
Load Impact requires no download or installation, is completely free to try, and users can start a test with just one click.
 
Test your website, app or API at loadimpact.com

Enter your email address to follow this blog and receive notifications of new posts by email.