Visual regression testing is vital in a high growth tech startup like Canva to ensure any new code continues to work on our web application, whilst ensuring that visually, it looks just as good as before. In this blog, Canva engineer Joscha Feth shares his approach on visual regression testing to instill confidence in every product update.
Visual Regression Testing (VRT) can be quite a polarizing activity for the developer community - it’s one of those things that can either make a developer’s eyes sparkle with joy, or darken with rage. Which side of the fence you fall on often depends on how the process of updating the visual baseline and reviewing changes works in your company.
To explain what the baseline update is in simple terms – when, for example, you change the color of a button on a webpage from red to blue, every other developer who works on that page after you has to compare their changes against the blue button. Whilst you are working to change the button from red to blue however, that button on your baseline remains red. Only when you have merged your work back into the mainline does everyone’s baseline for the button become blue. The newly generated images (the ones now containing the blue button) are stored centrally and will be the new point of reference for any developer subsequently working on the project.
At most companies that I’ve worked with previously, these baseline regenerations happen on the developer’s machine. Because no two developer machines are the same, when it comes to something very intricate such as generating images for a website, it is very easy to produce a slightly different result every time a new change is introduced. This difference isn’t necessarily visible to the human eye, but potentially visibly different to a computer. For example, there may be a few subpixels rounded differently in a new browser version, and thus the whole image is slightly shifted by one pixel – something which a computer would instantly understand as a difference, but would be undetectable in a superficial inspection by human eyes.
Because of this tooling problem, it’s often impossible to know when there are real anomalies, as the noise caused by different tooling can be quite severe. The noise might disappear temporarily when you create a new baseline for yourself, but then shortly afterwards a colleague will say that they can see differences, because their machine will create a different image to yours — and the whole frustrating cycle begins again. This unexpected workload can be very aggravating for a developer who is working on multiple amendments to the source code, and has to keep constantly going back to make further changes to a UI that they hadn’t even realized they’d adversely impacted - not to mention the additional effort of keeping tooling in sync.
This is why the automation of visual regression testing without using massive local tooling is a godsend for developers, as it eliminates the need for unreliable and time consuming manual testing, as well as having to maintain a possibly unreliable custom stack. There are currently only a few commercially available visual regression testing tools, as the market is still fairly new.
Commercial solutions for visual regression testing
Of the 3 main commercial service offerings that were available when we looked to implement VRT at Canva, we opted for a tool called Percy.
For a design-centric business such as Canva, Percy is a very valuable tool, which allows us to speed up testing and catch even the small unintended visual changes that are sometimes created as side-effects of code changes.
Canva has a very strong culture of peer reviewing code changes before they go into the mainline, and whilst it is often possible to reason about how code changes affect the system, it can be a time consuming riddle to reason through everything thoroughly. VRT helps us to speed up code reviews as well, as you can see from a real conversation in a pull request that I came across the other week:
As visual regression testing is yet to be widely adopted, a lot of companies have their own makeshift systems that work for better or for worse. Some companies have been doing it for a few years but the in-house tools they use are commonly very clunky and/or expensive to maintain. When you join a company and they have a visual regression suite, it is usually something that they have written themselves, requires a lot of attention and maintenance, and in the end doesn’t work very well after all.
This is because there has been no clearly defined process to update and regenerate baselines. Up until the last couple of years, when commercially available VRT tools have become available as services, visual regression testing involved setting up a myriad of tools on your local machine, running those tools to generate the baseline, and whilst doing so, crossing your fingers and hoping the visual differences generated are only due to the changes made, and not because one of the dependencies to render these images has changed.
It may sound pretty straightforward, but keeping these dependencies stable can be complicated, as screen resolutions, browsers, libraries and machine types change. At Canva, we provide users with templates for a wide range of design projects, from online ads, flyers and logos, to posters, brochures, invitations, business cards, and much more. Whilst we already had visual regression testing for our exported designs, that particular system uses a well-defined rendering engine and is much easier to tame than common frontend and browser-based visual regression testing, so we decided to not reuse it for the web-development workflow. Percy, the service we are now using, comes with an API that will in fact allow us to replace our custom export regression testsuite with a Percy-managed project where we provide the baseline images coming from our export renderer, still retaining full control of the rendering but leveraging Percy’s service integration with GitHub, baseline approval, diffing and more, which will greatly improve that part of our testing infrastructure.
Because commercially available visual regression tools are a recent phenomenon, companies have been forced to produce their own out of necessity. Percy is one of the first companies that has developed this technology and made it commercially available and fully supported, removing the headache of the ongoing maintenance of in-house solutions.
Percy was a service that I had wanted to try for a long time, but in its early days it was tightly coupled to the software development platform GitHub, so I wasn’t able to use it, as the company I worked for previously had its own proprietary software development platform that was in direct competition to GitHub and not compatible with Percy.
After joining Canva (which uses GitHub), I knew the time was right. After a few weeks of trialling we knew that with the Percy workflow we could vastly improve testing by preventing unwanted visual regressions, without having to maintain a custom solution. Percy solved the visual baseline update and stability issues, and also provided a number of predefined workflows and integrations with services and tools we use.
There are only a few other commercially available products on the market aside from Percy, and one of these is owned and operated by one person. A company like Canva, which has over a hundred developers, could not seriously consider investing in a product from a company that only has 1 developer – if that company suddenly closed down or that person could no longer provide the support that we need, it would mean we had put a lot of effort into incorporating a product that is not supported any more and we would have to start again from scratch with another provider. With a product like Percy however, which is supported by a bigger team of developers, that risk factor is greatly reduced.
Visual regression testing tools are really not something that companies want to own either, due to the fact that so much effort is required for their ongoing maintenance. Automated VRT tools seem easy enough to create at the outset, but it’s the maintenance of these tools that is onerous, not to mention potentially very costly. Commercially available tools such as Percy remove that burden.
Also, once a company develops its own tool, it inherits all the traits that are specific to that particular company and has features that are specific to the types of problems that company is typically trying to solve. This makes them very difficult to migrate away from, whereas something like Percy can be used for a whole range of visual regression testing requirements, and is therefore a lot more versatile and user-friendly.
How we set it up…
In the beginning, we set up Percy for our main app page, so on every master build Percy would produce a new baseline reference that the rest of the organisation could compare against in their feature branches.
Percy does have the ability to change the page state, but it means using custom APIs to do so. As we had already decided on a few other technologies to mutate browser state and decouple UI states from each other, we were a bit reluctant to add this complexity on top, hence we started looking at how we could do visual regression testing for our UI library, which is decomposed in a product called Storybook.
We managed to produce a proof-of-concept that worked, but was bound by runtime (we had hundreds of different stories already at that point in time, and we build them both for left-to-right and right-to-left text direction). Together with the Percy team, we were able to improve on that, essentially making it possible to produce hundreds of snapshots in a few seconds and render them immediately.
That’s when things really kicked off. Suddenly, all of our frontend developers were able to immediately spot (both wanted and unwanted) visual changes in their pull requests.
Status checks on pull requests help to identify differences that will be made to the baseline. This prevents unwanted changes (tests with diffs show up as failed statuses before they are approved) and highlights intended changes once the diff has been approved. When we set out to create new components, often it’s designers, not developers who are approving baseline updates—making sure that the components they drafted in Sketch are actually represented faithfully in the frontend implementation.
Separate components within a website (picture Facebook for example, with shortcuts on the left, the newsfeed in the middle and ads on the right) can be developed in isolation then combined. This separation allows you to ensure that changes made in one component of the site don’t affect all of the other components. Each of these components usually has one or more visual “stories” attached, so if, for example, you changed the colour of a button on the page from red to green, then only the button “story” which would change is its color, not the logo story or the header story.
From a developer’s point of view, it can sometimes be very difficult to understand the side-effects of your code changes, particularly with user interface changes and complex code. A developer may think that a change they’ve made is really small and contained, but then when the regression test is run, it can show a knock-on effect on a whole bunch of other components that you didn’t expect to be affected. The ability to develop components in isolation gives developers added certainty that their changes are contained to just the parts that they want to effect. I remember one time where some CSS for a lightbox/dialog was introduced that conditionally added a margin to the surrounding element. Whilst this was anticipated in the context where it was introduced, the newly introduced code didn’t take all scenarios into account where the dialog was used and also didn’t clean up after itself when destroyed. Unfortunately the code was merged into the mainline nonetheless, and for reasons I don’t remember not all tests had been run. A few hours later people started seeing the diff clearly visible in hundreds of stories and we were able to track down the offending code before the change was ever shipped to our customers.
For reviewers, it’s often much easier to look at changes in an automated visual regression test, rather than the person who made the changes having to verbalise the changes they made or explain them in an email or text message. This can also drastically reduce the review time.
Automated VRT also takes into account the slightly different views generated by different browsers. One of the most difficult aspects of web development, especially when it comes to smaller viewing formats such as mobile browsers, is that some design aspects are very difficult to standardize across different browsers. Most of our developers use Google Chrome, a handful use Firefox and maybe one or two Safari, but none of our developers are currently using Internet Explorer or Microsoft Edge. With VRT, you can run tests in two web browsers simultaneously, so we can pick up any anomalies between the developer’s browser and another common browser on the spot.
It is essential that all of our components look the same in all of the browsers we support, and VRT enables us to do that (Caveat: Percy currently only supports Firefox and Chrome, with more browsers to come).
…and how widely it has been used
Percy has become part of our standard frontend developer workflow. We currently have around 600 different stories for UI components that we perform visual regression testing on and each app contains between one and ten different screens. All of these are also run in a second text direction and many in additional screen widths based on mobile breakpoints defined by our designers, bringing our snapshot count to well over a thousand. Definitely not something that is still possible to do manually in a reasonable timeframe!
For our apps, we typically use the first screen as a sanity check across browsers (we test visreg on these screens in Firefox and Chrome), then for some language specific pages, we use German to test how text length affects the UI. German is usually a good indicator of capacity, as German sentences are on average around 20% longer than any other language, and secondly, because individual words themselves can be much longer- (Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft is one rather extreme example of the way German chains nouns together, roughly translating to as “Association for Subordinate Officials of the Main Maintenance Building of the Danube Steam Shipping Electrical Services.”)
Of course, it is highly unlikely that we would ever use this particular word, but you’d be surprised how many UI components fail on words like these in visual regression testing when unit tests and integration tests still pass.
We use Thai and Burmese to test text height (as they have a lot of characters with ascenders and descenders, so text is easily cut-off if developers use a fixed line-height), and Arabic as a proxy for all right-to-left languages. We also use English in both LTR and RTL as a proxy, because it is easy to reason about for engineers and provides good feedback:
VRT tools such as Percy can test in right-to-left text direction as well as left-to-right, so for languages that read right to left, such as Arabic, the testing is carried out automatically. This means we can be confident that the reversed text direction will have the expected effect on component layout and text in an app.
Previously, we may have ended up delivering something to the end user that didn’t look exactly as we wanted or intended it to, whereas with automated VRT, we now have the surety and security of not sending anything out that contains a visual error. We’re informed by the automated testing any time something is broken and needs to be fixed prior to that going out to the customer, rather than being hostage to human error in testing.
Automated VRT also ensures that developers have a much easier time of identifying and accepting or rejecting changes to the baseline code. Previously we would have to wait until users came back to us and said “this is broken, can you fix it for us?”, which is clearly less than ideal. Now we have the tools in place to prevent these incidents occurring in the first place.
Because our testing is now automated, it can also scale with our business. As we add more developers to our team, it can perform all the testing we require with little additional effort on our part, which is essential for the future viability of the tool and our organization as a whole.
Domain-specific benefits of using VRT
A side benefit of using a platform like Percy, where adding additional regression tests has little cost in terms of developer time, is that whole new categories of tests may become feasible. The additional tests we have developed are somewhat domain-specific to Canva, but I assume that each organization has a loosely related set of tests where they could leverage VRT to make things easier.
Let me show you 4 examples:
The tests I am talking about are, at their core, mathematical problems (transformations, packing, dynamic alignment) and their conditions can be mathematically expressed. However the result of that math is quite abstract and hard to understand when written down. In Canva’s case the solution to these mathematical problems can have an actual visual representation attached, something which can’t easily be expressed as part of a unit test whilst at the same time being easily updatable.
Since adopting Percy we’ve started using VRT to detect regressions on these groups of problems (in the examples you can see a dynamic pie chart generation with labels, image rotation based on EXIF data in the browser, a dynamic endlessly scrolling masonry layout component and image clipping and filtering via paths and all permutations filters and clipping in between). We’ve noticed that VRT is not only much more manageable in terms of expressing fixtures for these tests and updating them, but also that detecting intended and accidental changes has become a lot easier ever since.
Visual regression testing has been a great asset to Canva in the last year and a half or so. It consistently delivers value to our designers and developers, especially our many new starters who may be unsure about how changes they make can affect the system as a whole.
Canva is a very visual product, which probably benefits more from VRT than most other companies, but even just considering our core UI components in Storybook, we’re getting great value for money out of Percy. The cost of maintaining the integration with Percy is marginal and Percy’s response time to incidents and their support is outstanding. It will definitely help us to scale the company successfully into the future and - as our CEO Melanie Perkins likes to put it - build the rest of the 99% of Canva that is yet to be built. The more surety we have in the end product we deliver to users, the more confident we can be in adding new features and bringing on the new developers that we need to take Canva to the next level.
So if you’re standing at the crossroads deciding whether to go down the path of VRT, I would suggest finding answers to the following questions:
- Do you often roll out new versions where you only find out months later, by accident, that the layout shifted or components are broken?
- Do you only find out after a customer complains that some infrequently used page has a layout problem?
- Do you need confidence that changes look good in multiple screen sizes/ for multiple breakpoints? Are you migrating your stack from legacy to a new one and worried about introducing inconsistencies in your design whilst doing so?
- Do you have tests that are very visual at their core and hard to express in standard unit tests?
- Do you want to make sure that designers have a chance to be the gatekeepers of design changes by your engineering organization?
- Are you currently developing and/or maintaining your custom visual regression tool suite?
If you can answer any of the questions above with a yes and you are not too concerned about throwing a bit of money at the problem to solve it, an automated VRT solution might be for you!
For more details on our current job openings, visit Canva’s careers page here. Know someone who would benefit from reading this article? Why don’t you share it with them here:
or, share via email