2016-01-21

"Cutting edge" might cut you up


We love "cutting edge" technology. We need to have the latest gadget, we need to use this new trendy framework, we need to use the latest update of our favorite software. Newer should be better, right?

Actually, each new technology brings a promise of improvement and an additional hidden small risk. You are trying the new innovative painkiller drug that promises to solve your low intensity, but still disturbing, migraines. You can get a real improvement on your condition, you could get no significant improvement or, with a small probability, you could die, like in the recent drug trial in France. Indeed, that was an experimental drug, and many others that took the drug were not affected. Well, this is exactly the danger: if the really bad effects only happens with low probability, it might not even be caught in the trial period...

If we look in the past, for each successful technology there are more than 10 promising technologies that proved to not deliver up to expectations. This means that you have like 90% probability that you consider promising a technology that will prove to be a failure. It is not realistic to believe that you always have a better "nose" for identifying the technologies that will prove to be successful. And from those 90% failures, a small percent can bring damages far bigger than the actual cost of that technology - think about business losses, data corruption.

We usually only count the expected improvements - the new features, while we tend to totally sub-estimate the associated risks. It happened with all technologies, they got better after a significant period of usage or they proven to be really dangerous and got forbidden after a while. For each new product, hidden flaws might be discovered in time.

The hidden flaw

You got this great synthetic material that is light, shinny and resistant... It turns out that in contact with this other plastic material it gets melted down. I had once a pair of sandals that worked just fine in the first 2 years, but their soles practically disintegrated in the first 50 meters when I tried to use them again after a while. I did not found the cause, but it never happened with more classical materials like plain rubber.

You get a new car, great new features, a great futuristic design. It might turn out that in some conditions, the windows gets stuck because of ice. It did not happened with the old model, but this new design favors the water accumulation in exactly the worse place, and nobody thought of it. Bonus, there is a small, inessential sensor that gets easily broken, and the computer treats this as a major error and refuses to start exactly when you need the car.

With software it is even worse. Because software can be updated over Internet, most of the companies have a practice to release software having significant flaws or being insufficiently tested, in order to achieve a better "time to marker". Chances are really big that a new major version of the software to be much more unstable than your old software. You might get some nice new features, but old, important features might stop working like they did. Security updates are less likely to destroy old features, however the risk is not zero, especially when you have a non-standard configuration that is unlikely to be found in the test environments.

I once worked in an ISP and we implemented an automatic network redundancy over a L2 Ethernet ring, bases on "spanning tree" protocol. Despite the fact that the switches were a trustful brand (Cisco), we had 3 times when the fail-over system itself got defective and created downtime in the network, causing high outage for our clients. In this time, no real failure was saved by the automatic fail-over system. We decided to administratively cut the ring and enable the fallback path manually if an unlikely outage would appear.

We also had a mail server. We implemented RAID5 to protect from any single hard disk failure. After couple of consecutive "disk failures" and hours of filesystem recovery, we discovered that the RAID controller was actually faulty, not the disks. You don't usually take into account the reliability threats that are brought by the very systems that tries to improve it.

The situation with software is even worse because many times marketing only cares about new features, as people use feature lists for choosing products. Whatever product has more features must be better... You might end up discovering that it has more features, but many of them are very poorly implemented. Usually there is a negative correlation between the number of features and their quality. There are couple of reasons for this: extra complexity brings extra implementation risks and trade-offs and when you have many features it's easy to overlook a certain one when testing.

So, what can we do?

We cannot avoid progress, we just need to be aware about the associated risks for each decision. If you really need to finish your project today and your computer works just fine, don't try to do that shinny upgrade that promises to improve something that you didn't know you needed. There is a small chance that it might completely broke your computer and make you fail your big, important delivery. The chance is small, but is many times higher than leaving the system as it is until tomorrow.

The rule of thumb

A rule of thumb is this: it worth to change a current mature technology with a new technology only if it is expected to bring at least 2-10 times the expected investment. This should compensate for the significant risk that the new technology will prove to be actually slightly worse than the existing one and the small risk to have a catastrophic impact. Depending on the worse case scenario risks, depending on the migration costs and the novelty of the technology, the level of minimum acceptable gain might vary. However, on the long term, it is mathematically a losing strategy to change a technology that is ok for only marginal gains.

Why

This is because the expected gains are usually slightly overestimated, while the hidden traps and flaws are greatly underestimated. Nobody will highlight the weaknesses of his product or technology, just the benefits. Also, it is practically impossible to estimate the probability of rare events like the catastrophic failure are. The only thing that you can do is to evaluate the worse possible impact and to understand that the probability to happen is most of the time higher than your first guess.

Therefore
  • for domains where reliability is more important than features, one might benefit more by being late adopter rather than early one. After a longer test period, hidden flaws might be discovered and corrected. The more users are already using it, the more chances are that the flaw that would affect you was already spotted and workaround-ed. Also, it is beneficial to use standard configurations that are likely to have been used when the product was tested.
  • it is beneficial to be early adopter if the "worse case scenario" loss is small and the potential benefit of the new features is couple of times bigger than the expected investments in the new technology. Or similar, it worth doing experiments in uncharted areas, like taking a new drug, a vaccine, only when the problem that you try to solve (or prevent) is big enough to compensate for the possible, unexpected, side effects. I initially found this idea in a book by Nassim Taleb.

Simplicity 

A good estimator of the risk that a new technology have hidden flaws is complexity. If the technology is complex enough, it will almost sure that will have flaws that are not visible for early adopters. Also, the risk becomes lower when there are many people that are already using that technology for the same use cases. Any "innovative" usages are likely to increase the likelihood to encounter issues that were not caught when the product was tested. It worth to depart from the mainstream way of using it only when the benefit is really significant.

Many times, on the long term, simplicity far exceeds the benefits of a complicated solutions. It's just too appealing to believe that this new and complex system will completely solve your problems and will not bring unexpected maintenance costs. After decision, it is also very human to refuse to account all the evidences that shows the unexpected extra cost brought by the extra complexity. Just look back, it's never as simple as it looks initially.

I say: 
  • do not underestimate the power of old, validated by time, solutions
  • try to simplify things as much as possible, it brings reliability
  • choose wise when it worth to bring additional complexity!
Please share this article if you find it interesting. Thank you.

No comments:

Post a Comment