“I have invented a great visualisation that will help you to compare the system’s performance between two configurations”, I said. And I was wrong three times in this single sentence. First of all, I had not invented it but had probably seen it somewhere. Secondly, my visualisation was most definitely not great. And finally, it was not very helpful neither. What is worse, I have repeated this sentence more than once during my journey of trial and error, while presenting new versions of my visualisation to the team, resulting in something that, perhaps, is just a bit closer to the truth. But let’s start from the beginning…
I was informed that one of the teams was struggling with the problem of having too much data. They were trying to optimise the performance of the system that, when deployed, would be used by 1200 concurrent users. To decide if a certain optimisation change made in the system resulted in performance improvement, they had to easily compare two systems: before and after the change. To measure system performance, they had prepared a set of performance test scenarios that simulated the expected interactions of future users with the system. As a result, they got 23 test scenarios that could be simultaneously run multiple times, simulating the future system load. During such a simulation (which usually took 3 hours with recurrent scenarios running), they used the JMeter tool to collect the response times of all the users’ actions. And this is where they faced the problem: in the 23 test scenarios they had 278 different actions. Each action could be executed a couple of million times. The single test runs were producing great amount of data and comparing two tests were difficult.
I realised, due to my sixth sense, that this was exactly the kind of problem that the Operations Research department could help solving. I put on my white coat and rushed to the upper floor. On the scene I met two guys debating the performance runs that have just finished. After some introduction, the conversation started to run as follows:
First team member: “For most of the actions, you can see that their average response times in the system with the latest optimisation are shorter than the same averages in the system without the optimisation. So, we can say that the optimisation improves the system’s performance.”
Me: “You are right.”
Second team member: “But if you take the longest actions, which are the ones that need improving the most, you can see that their average response time is worse with the latest optimisation implemented. So, we can say that it does not improve the system’s performance.”
Me: “You are right.”
Third team member: “But we should focus on the actions from the most important test scenario. This will tell us if the optimisation is improving the system performance.”
Me: “You are right.”
All the team members: “But we all cannot be right together.”
Me: “You are right.”
After losing most of their respect I could, without further delay, focus on the task of creating the proper visualisation. (I lost the rest of their respect by suggesting that the best and the easiest optimisation would be to prevent these confounded users from logging in.) The ideal visualisation would help them to easily understand which actions were gaining performance, which were losing – and to what extent. And so the journey begins.
Trial and error
My first attempt was pathetic. See for yourself:
Each graph shows a comparison made for one optimisation change. On the horizontal axis we have system configurations that we are comparing. On the vertical axis we have the average response time in seconds. Each action is represented by a line between two points. As there are multiple actions which cover one another, you cannot see the comparison for each of them, but you can try to gain a high level understanding of the comparison the two runs – if most of the lines are rising then the first run had better performance. To sum up – this was poor and I hated it. I had to start from scratch AGAIN.
The major concept of the new visualisation was to use a scatter graph, where each action would be represented by a point. On the horizontal axis you have the action’s response time from the first run and, on the vertical axis, from the second run. All the points above the 45 degrees diagonal mean that the first run showed better performance and vice versa. Additionally, you can see which actions – shorter or longer – are better in which run.
(One note about the statistics used: instead of analysing the average respone time of actions, we started to focus on the 90th centiles. This means that each action is represented by such a time that is longer than 90% of response times of all action executions. The reason was to make sure that most of the users’ interactions with the system perform properly.)
The problem with this visualisation is that most of the points are gathered in a small area in the bottom left corner, with only a few of the longest ones in the top right corner. The second problem is that you cannot easily read how large (relatively, e.g. in %) the difference between the runs for an action is. So, AGAIN.
As you can see, the axes are now in the logarithmic scale, which makes a better use of the chart area. The lines marking +20%, +50% and +100% are now helping to assess how big the differences between the runs are. The problem with this visualisation is that most of the chart area is still unused, as even such a big difference as 100% is still very close to the diagonal. AGAIN.
The concept was to rotate the diagonal so that it became the horizontal axis. We have the action response time from the first run on that line, still in the logarithmic scale. On the vertical axis we can see how different the response time in the second run was.
Actions that are faster in the second run (after application of the optimisation change) are above the horizontal axis, the slower are below. The whole area chart is now used and it is much easier to read precise values. However, there is a problem with the vertical axis. It is not symmetrical. If the second run is faster, the points have values from 0 to +infinity. If it is slower, the points have values from 0 to –100%. The visualisation deceives the viewer in favour of the second run. AGAIN.
This is my latest visualisation. The major change here is that the values on the vertical axis are symmetrical.
Additionally, the actions from the most important test scenario were marked in red. As a result, we have a way to show which actions gain and which lose by applying an optimisation. We can also assess the level of the change. And, hopefully, we can make better decisions on the system development.
This was my way of learning the lesson stated by Edward Tufte in the classic The Visual Display of Quantitative Information: the process of repetitive review of the graphical design is crucial in achieving a satisfactory result. In this process, it is important to get constructive feedback after each iteration which, in my case, was provided in strong words by the team members (for which I am grateful). Thanks to this, by the repetitive application of major and minor changes, I was able to improve my visualisation significantly. It does not mean that the visualisation has no drawbacks. First of all, the logarithmic scale is not very intuitive and could be deceiving. Secondly, we are only looking at the 90th centile, not the whole distribution. There are probably numerous other problems about which I haven’t got any feedback yet.
If you enjoyed this, keep an eye on my blog ExtraordinaryMeasures.net for more stuff like this.