03Jul / 2014

5 Lessons Learned After Self-Hosting Goes Haywire

When things start to go wrong it can sometimes be impossible to contain the unravelling – as if the problematic situation quickly gains momentum and begins to ‘snowball’ into an even worse situation.

This happened to me recently. And since much of what went wrong could have been prevented with good process controls, I believe I have some valuable lessons to share with you.

At the very least this post will be entertaining, as I assume many of you reading this will think to yourself: “yep, been there done that”.

I’ll start by mentioning how I met the folks at Load Impact and started working with their product and writing for them.

I was doing some shopping for a hosting provider for my personal and business website, and ran across someone else’s blog post that tested the performance of all the major players in the ‘affordable web hosting’ segment. We are talking the $8/month type deals here – the bare bones.

This author used Load Impact to quantify the performance of all these providers and provided great insight into how they faired from a performance and scalability perspective.

My first thought was: awesome! – I’ll use that same tool and test a few out myself, and then compare it to the performance of a self-hosted site. I already had a bunch of VMs running on an ESXI server so adding a turnkey wordpress site would be super easy.

It turns out that my self hosted site was much faster and scaled all I needed (thanks to Load Impact), so in the end I decided to just self host.

I’m not making any money from the sites – no ecommerce or ads – so it doesn’t really matter from a business perspective.It’s also easier to manage backups and control security when you manage the whole environment.

But it’s also much more likely that the whole thing will get screwed up in a major time-consuming way.

I imagine there are many SMBs out there that self host as well, for a variety of reasons. It could be that you like having control of your company assets, it was faster and cheaper, or you just like doing everything yourself.

It’s often very difficult for smart people to avoid doing things they can do but probably shouldn’t do as it might not be the best use of their time.

In this blog post I’ll demonstrate how how quickly this situation can go wrong and then go from bad to worse:

Problem #1: my ISP screwed me!

If you are in business long enough, your ISP will screw you too. I made a change to my service plan (added phone line) which I did the week before we went out of town.

For some reason nothing happened so I decided called my provider while 300 miles away from my house. Of course, this is exactly when things started to unravel.

Instead of provisioning my modem correctly, they removed my internet service and added phone. No internet. To make matter worse, I’m not at home so I can’t troubleshoot.

Lesson #1 – don’t make changes with your ISP unless you can be onsite quickly to troubleshoot.

It was nearly impossible for me to troubleshoot this issue as I couldn’t VPN into my network, there wasn’t a connection at all.

I even had a neighbor come in and manually reboot both my firewall and modem. That didn’t work, so my only recourse was a dreaded call to customer support.

The first time I called it was a total waste of time, the Customer Support agent had no idea what was going on so that call ended.

Call number two the next day was slightly more productive in that it ended 45 minutes later and a level 2 support ticket was opened.

Finally, upon getting a level 2 engineer on the line (I was home at this point), they immediately recognized that my modem was mis-provisioned and was setup for phone only! It only took minutes to properly provision the modem and get it back online.

Lesson #2 – if you are technically savvy, then immediately demand a level 2 support engineer. Time spent with first line support is usually a totally frustrating time suck.

Problem #2: Some things start working again and others mysteriously don’t

After the final problem-resolving phone call was complete I was tired, hot (AC was off while out of town) and irritated. So when the internet connection finally came back up, I wasn’t exactly in a “I’m making great decisions” mindset.

So I go to check my site and one works fine, but this one it not up at all. I reboot the VM but still no response from the server.

I’m not sure what is going on.

Lesson #3 – Don’t start making significant changes to things when tired, hot and irritated. It won’t go well.

This is exactly the point at which I should have made a copy of the VM in it’s current state to make sure I don’t make things worse. But instead I immediately go to my backup server (Veeam) and try to restore the VM in question.

Well guess what? That didn’t work either, some sort of problem with the storage repository for Veeam. Unfortunately, the problem is that some of the backup data is corrupt.

I ended up with a partially restored but completely unusable webserver VM.

Lesson #4 – Test your backups regularly and make sure you have more than one copy of mission critical backups.

At some point in this whole fiasco, I remembered what this package I had on my desk was. It was a replacement hard drive for my ZFS array because one of my 4 drives in the RAIDZ1 array was “failing”.

I figured that now would be the perfect time to swap that drive out and allow the array to heal itself.

Under normal circumstances this is a trivial operation, no big deal. Not this time!

This time, instead of replacing the failing hard drive, I accidentally replace a perfectly good drive!

So now I have a really tenuous situation with a degraded array that includes a failing hard drive and no redundancy whatsoever.

Fortunately there wasn’t any real data loss and eventually I was able to restore the VM from a good backup source.

Finally back online!

Lesson #5 – Be extra diligent when working on your storage systems and refer to Lesson #3.

The overall message here is most, if not all, of these issues could have been easily avoided. But that is the case 99% of the time in IT – people make mistakes, there is a lack of good well documented processes to handle outages, and of course hardware will fail.

It’s also worth noting that in large enterprises mechanisms for change control are usually in place – preventing staff from really messing things up or making changes during business hours.

Unfortunately, many of smaller businesses don’t have those constraints.

So what does this have to do with Load Impact? Nothing directly…but I think it’s important for people to be aware of the impact that load and performance testing can have on the infrastructure that runs your business and plan accordingly when executing test plans.

Just like you wouldn’t do something stupid like changing network configs, ISP settings or Storage without thoroughly thinking it through, you should also not unleash a world-wide load test with 10,000 concurrent users without thinking about when you should execute the test (hint – schedule it) and what the impact will be on the production systems.

Hopefully there is a test/dev or pre-production environment where testing can take place continuously, but don’t forget many times there are shared resources like firewalls and routers that may still be affected even if the web/app tiers may not be.

And always remember Murphy’s law: Anything that can go wrong will go wrong.

———-

This post was written by Peter Cannell. Peter has been a sales and engineering professional in the IT industry for over 15 years. His experience spans multiple disciplines including Networking, Security, Virtualization and Applications. He enjoys writing about technology and offering a practical perspective to new technologies and how they can be deployed. Follow Peter on his blog or connect with him on Linkedin.

Don’t miss Peter’s next post, subscribe to the Load Impact blog by clicking the “follow” button below.

Permalink 2 Comments

22Apr / 2014

Test Driven Development and CI using JavaScript [Part I]

In this tutorial, we will learn how to apply TDD (Test-Driven Development) using JavaScript code. This is the first part of a set of tutorials that includes TDD and CI (Continuous Integration) using JavaScript as the main language.

Some types of testing

There are several approaches for testing code and each come with their own set of challenges. Emily Bache, author of The Coding Dojo Handbook, writes about them in more detail on her blog – “Coding is like cooking“

1. Test Last: in this approach, you code a solution and subsequently create the test cases.

Problem 1: It’s difficult to create test cases after the code is completed.
Problem 2: If test cases find an issue, it’s difficult to refactor the completed code.

2. Test First: you design test cases and then write the code.

Problem 1: You need a good design and formulating test cases increases the design stage, which takes too much time.
Problem 2: Design issues are caught too late in the coding process, which makes refactoring the code more difficult due to specification changes in the design. This issue also leads to scope creep.

3. Test-Driven: You write test cases parallel to new coding modules. In other words, you add a task for unit tests as your developers are assigned different coding tasks during the project development stage.

TDD approach

TDD focuses on writing code at the same time as you write the tests. You write small modules of code, and then write your tests shortly after.

Patterns to apply to the code:

Avoid direct calls over the network or to the database. Use interfaces or abstract classes instead.
Implement a real class that implements the network or database call and a class which simulates the calls and returns quick values (Fakes and Mocks).
Create a constructor that uses Fakes or Mocks as a parameter in its interface or abstract class.

Patterns to apply to unit tests:

Use the setup function to initialize the testing, which initializes common behavior for the rest of the unit test cases.
Use the TearDown function to release resources after a unit test case has finalized.
Use “assert()” to verify the correct behavior and results of the code during the unit test cases.
Avoid dependency between unit test cases.
Test small pieces of code.

Behavior-Driven Development

Behavior-Driven Development (BDD) is a specialized version of TDD focused on behavioral specifications. Since TDD does not specify how the test cases should be done and what needs to be tested, BDD was created in response to these issues.

Test cases are written based on user stories or scenarios. Stories are established during the design phase. Business analysts, managers and project/product managers gather the design specifications, and then users explain the logical functionality for each control. Specifications also include a design flow so test cases can validate proper flow.

This is an example of the language used to create a BDD test story:

Story: Returns go to stock

In order to keep track of stock

As a store owner

I want to add items back to stock when they’re returned

Scenario 1: Refunded items should be returned to stock

Given a customer previously bought a black sweater from me

And I currently have three black sweaters left in stock

When he returns the sweater for a refund

Then I should have four black sweaters in stock

Scenario 2: Replaced items should be returned to stock

Given that a customer buys a blue garment

And I have two blue garments in stock

And three black garments in stock.

When he returns the garment for a replacement in black,

Then I should have three blue garments in stock

And two black garments in stock

Frameworks to Install

1. Jamine

Jasmine is a set of standalone libraries that allow you to test JavaScript based on BDD. These libraries do not require DOM, which make them perfect to test on the client side and the server side. You can download it from http://github.com/pivotal/jasmine

It is divided into suites, specs and expectations.

.Suites define the unit’s story.

.Specs define the scenarios.

.Expectations define desired behaviors and results.

Jasmine has a set of helper libraries that lets you organize tests.

2. RequreJS

RequireJS is a Javascript library that allows you to organize code into modules, which load dynamically on demand.

By dividing code into modules, you can speed up the load-time for application components and have better organization of your code.

You can download RequireJS from http://www.requirejs.org

Part II of this two part tutorial will discuss Behavioral Driven Testing and Software Testing – how to use BDD to test your JavaScipt code. Don’t miss out, subscribe to our blog below.

————-

This post was written by Miguel Dominguez. Miguel is currently Senior Software Developer at digitallabs AB but also works as a freelance developer. His focus is on mobile application (android) development, web front-end development (javascript, css, html5) and back-end (mvc, .net, java). Follow Miguel’s blog.

04Sep / 2013

Different types of website performance testing – Part 3: Spike Testing

This is the third of a series of posts describing the different types of web performance testing. In the first post, we gave an overview of what load testing is about and the different types of load tests available. Our second post gave an introduction to load testing in general, and described what a basic ramp-up schedule would look like.

We now we move on to spike testing. Spike testing is another form of stress testing that helps to determine how a system performs under a rapid increase of workload. This form of load testing helps to see if a system responds well and maintains stability during bursts of concurrent user activity over varying periods of time. This should also help to verify that an application is able to recover between periods of sudden peak activity.

So when should you run a spike test?

The following are some typical example case scenarios where we see users running a spike test, and how your load schedule should be configured in Load Impact to emulate the load.

Advertising campaigns

Advertising campaigns are one of the most common reasons why people run load tests. Why? Well, take a lesson from Coca Cola – With an ad spend of US$3.5 million for a 30 second Superbowl commercial slot (not including customer cost), it probably wasn’t the best impression to leave for customers who flooded to their Facebook app.. and possibly right into Pepsi’s arms. If you’re expecting customers to flood in during the ad campaign, ramping up in 1-3 minutes is probably a good idea. Be sure to hold the load time for at least twice the amount of time it takes users to run through the entire scenario so you get accurate and stable data in the process.

Contests

Some contests require quick response times as part of the challenge issued to users. The problem with this is that you might end up with what is almost akin to a DDOS attack every few minutes. A load schedule comprising of a number of sudden spikes would help to simulate such a situation.

TV screenings/Website launches

If you’re doing a live stream of a very popular TV show (think X Factor), you might want to consider getting a load test done prior to the event. Slow streaming times or a website crash is the fastest way to drive your customers to the next streaming app/online retailer available. Take Click Frenzy as an example – they’re still working to recover their reputation. Streaming servers also tend to be subject to prolonged stress when many users all flock to watch an event or show, so we recommend doing a relatively quick ramp up and ending with a long holding time.

Ticket sales

Remember the 2012 London Olympics? Thousands of frustrated sports fans failed to get tickets to the events they wanted. Not only was it a waste of time for customers, but it also served to be a logistics nightmare for event organizers. Bearing in mind that a number of users would be ‘camping’ out on the website awaiting ticket launch, try doing a two stage quick ramp up followed by a long holding time to simulate this traffic.

TechCrunched… literally!

If you are trying to get yourself featured on TechCrunch (or any similar website that potentially might generate a lot of readership), it’s probably a good idea to load test your site to make sure it can handle the amount of traffic. It wouldn’t do to get so much publicity and then have half of them go away with a bad taste in their mouth! In these cases, traffic tends to come in slightly slower and in more even bouts over longer periods of time. A load schedule like this would probably work better:

Secondary Testing

If your test fails at any one point of time during the initial spike test, one of the things you might want to consider doing is a step test. This would help you to isolate where the breaking points are, which would in turn allow you to identify bottlenecks. This is especially useful after a spike test, which could ramp up too quickly and give you an inaccurate picture of when your test starts to fail.

That being said, not all servers are built to handle huge spikes in activity. Failing to handle a spike test does not mean that these same servers cannot handle that amount of load. Some applications are only required to handle large constant streams of load, and are not expected to face sharp spikes of user activity. It is for this reason that Load Impact automatically increases ramp-up times with user load. These are suggested ramp up timings, but you can of course adjust them to better suit your use case scenario.

Load Impact Blog

Tag Archives: Testing