Testing in Production
Testing in production seems a risky and immature concept, traditionally reserved for development teams and companies that did not have the technical capability and processes to ensure code was tested in lower environments before being released to the end user. We should be clear, that we still do not advocate for traditional build and functional test activities in production, but there are instances where testing in production can be considered a highly valuable activity.
Understand and monitor how new features are being used
As teams mature in their CI/CD capabilities, releasing features frequently and confidently into production becomes second nature. As more teams move to this level of maturity within an organisation, a new focus and approach to business value can be achieved. Good product owners work with the business to understand how digital features can help achieve strategic goals both near and long term, however finding what works is invariably a process of experimentation and trial and error. The feedback loop on whether a feature is successful, and adding business value can now be supercharged and shortened with testing them in production. Approaches such as implementing feature flags in your code and turning them on for specific users/regions and monitoring the effect is an excellent way to understand if changes are releasing the expected value. Once an organisation adopts this approach, historic and real time application usage data can then become an integral part of the application design process, and the guesswork on whether a feature will be as successful as initially anticipated is greatly reduced.
Once you have the ability to monitor the performance of features, you can act accordingly. If features are not delivering value or indeed having an adverse effect on your users and business, you can easily roll back changes protecting an organisation’s brand, and providing a pro-active customer experience that is grounded in monitoring and data.
Test resilience and recovery time
Cloud infrastructure and its benefits are well known and articulated. The ability to spin up environments at a whim, quickly and configured instantly to host your application and services has changed the way organisations and development teams are structured and operate. Scalability, resilience and security are now more than ever easier to configure and implement. however, issues still occur and systems/applications do go down. Testing an application’s performance and resilience in production should only be reserved for the most mature and confident cloud architectures but if done correctly can serve as a great way to ensure that your production services can recover and operate even when under intense traffic/load or unusual usage. Netflix’s chaos monkey was created to ensure that development teams built resilience into their platform, and it was not an afterthought. Testing these scenarios can also ensure that your cloud application and monitoring are working as expected and giving you the alerts and fore-warning you need to prevent service disruption.
You must task and empower your development teams to build resilience into their applications. This is important as we want this type of production testing to be a confirmation activity, there should be no doubt that your architecture will hold up as resilience and performance were considered before the first line of code was written following a ‘Test First’ paradigm.
Verify complex applications changes in a contained environment
There are still instances when development teams fundamentally change application architecture and features, one example is the modernisation of legacy platforms into a cloud environment. These changes for the best part are not rolled live in one go, in fact, the concept of delivering in small batches is an even more important mantra in large-scale programs. Testing in production especially connectivity and basic sanity checks can be an important part of the development and release process when architecture and applications have been radically overhauled. It still should be in a contained area of your architecture and in a controlled manner so as to not affect end users and your business.
Testing in production requires mature architecture and releases
Testing in production is not something that all organisations should undertake, it is vital that architecture is designed in a manner that is conducive to the activity and releases automation and monitoring is mature. Teams must have achieved advanced level CI/CD, and their cloud platforms, architecture and application must:
- Have the ability to release and roll back quickly
- There must be no manual steps in the deployment process (apart from gated triggered/manual releases). So the team must leverage IaC tooling and techniques to allow for quick provisioning and error-free infrastructure set-up.
- Alerting and monitoring must be at the level required to monitor what is required, and alert when things are about to go wrong. This needs to be coupled with easy-to-understand dashboarding, visualisation, reporting and channel alerts that are suitable for the team and stakeholders.
- Above all, end users should never feel or be impacted by the test so architecture must be designed with that consideration.
Testing in production is not new, mature tech companies with high traffic services have been doing it for years. The data and what you can learn from it can give organisations the edge when it comes to designing features, and ensuring their platforms stand up to even the most testing conditions.
If you would like to learn more about how to start your testing in production journey, then please get in contact!