During June and July 2015, Police Professional magazine ran a series of articles about performance management, in which I was invited to provide perspectives on the use of league tables, numerical targets and binary comparisons. If you are a subscriber you can read the articles first hand, alongside perspectives from another author, Malcolm Hibberd. Our respective articles fitted together well, even though we held contrasting views at times, and in the case of the article on numerical targets, neither of us had even seen the others’ piece when it was written.
Anyway, as my articles simply pulled together some of my pre-existing thoughts on league tables, numerical targets and binary comparisons, I thought it made sense to return to my original submissions and subject them to a slight rework so that the content can be accessed by interested parties seeking something between the light-hearted antics of the #StickPeople and the dry logic of my published academic papers.
Although some of the content is drawn from my ongoing PhD research, the style is relaxed and conversational – if you want the academic stuff, links to my journal articles can be found on my ‘About Me’ page. What follows, therefore, are re-edited versions of the three articles I submitted to Police Professional. I hope you find them useful. I have also attached a pdf link for each of the re-edited versions below. Feel free to share the content with others.
Article 1: League tables pdf
Article 2: Numerical targets pdf
Article 3: Binary comparisons pdf
“ABOUT HALF OF LEAGUE TABLES ARE BELOW AVERAGE”
Truly understanding comparative peer performance is important, as it can help us identify opportunities to learn and improve. (Indeed, if this isn’t the primary reason for doing it, we might as well just pack up and go home).
For this reason, I’d argue that methods for understanding comparative peer performance must be robust, accurate and transparent. Unfortunately, league tables tend to be none of the above; they are methodologically unsound, notoriously unstable, typically opaque and generally misleading. Not a good combination.
Some people acknowledge that league tables carry limitations and risks, yet believe they can still be a valuable tool for understanding performance, if used with caution. I, on the other hand, argue inherent risks and limitations are so fundamentally toxic that league tables should be avoided altogether. (Or at least be accompanied by a sleep-inducing list of caveats in bold neon font to provide the user with enough information to assess just how flimsy a particular league table is. And nobody wants that. Zzzzzzzz).
My concerns about league tables are twofold:
- They convey information in a format that is largely meaningless and prone to triggering unwarranted assumptions about relative performance.
- These assumptions predictably drive dysfunctional behaviour.
I’m going to address these points in turn, but first let’s get one thing clear – I’m not against comparing peers, or performance management in general, as long as it’s done properly. If we don’t measure the right things in the right way, how do we know if performance is improving? Meaningful performance information is essential for informing decision making, and better decisions lead to better service delivery. Therefore, if performance information misinforms decision makers because of the way it’s presented, we’re in real trouble.
A Bit of History
Ever since 18th Century social reformer Jeremy Bentham called for the ‘tabular comparison principle’, managers, politicians and an ever-demanding public have sought data on how institutions, departments, teams and even individuals perform against each other.
During the 1950s, Leon Festinger’s Social Comparison Theory proposed there is a fundamental urge within the human condition that drives a search for information about status amongst peers. Who is best? Who is worst? Am I better than others? This drive manifested itself most strongly during the New Public Management reforms of the 1980s and 1990s, where the use of league tables exploded onto the scene, particularly in the realms of healthcare, education and policing.
And they’re still here. League tables are such a common way of describing apparent differences between peers that they are regularly used for praise and criticism, sanction and reward, and so on. Unfortunately they’re so well-established in the dominant performance management psyche that they go widely unchallenged.
Well that needs to change.
Why League Tables Should be Avoided: Part One
League tables are almost always unsound – a harsh criticism indeed, especially as sometime this week you’ll probably hear someone on the news or at work citing a league table to claim performance is ‘good’ or ‘bad’ / ‘improving’ or ‘deteriorating’. If so, here’s a free tip – in the absence of that long list of boring caveats I mentioned, just ignore such statements.
Well, for starters, if data fall within a ‘normal’ range and are part of a random pattern of variation (don’t worry – I’m not going to inflict the maths on you), then there is absolutely no merit whatsoever in ascribing meaning to any apparent ‘differences’. Therefore, there is absolutely no basis for attempting to rank the work units concerned.
The clue is in the name – random variation is random. You get zig-zags on charts. Numbers go up and down. It’s normal. Wild fluctuations are common within league tables and therefore work units swap positions regularly. It’s mostly meaningless. Only if a genuine trend emerges, or a work unit is so significantly different from others in the group, should there be an attempt made to understand difference.
Have a look at the two diagrams below to see what I mean. In configuration ‘A’, although the six police forces are apparently performing at slightly different levels, they are all within the bounds of what is statistically ‘normal’ for that group. In configuration ‘B’, they are listed in the same order, but one force is significantly different from the peer group – that’s the time to delve deeper!
Figure 1. Peer Comparisons: Configuration ‘A’
Figure 2. Peer Comparisons: Configuration ‘B’
Okay, so it’s deliberately over-simplified, but you get the picture, right?
Next, there are confidence intervals. If caveats alongside a league table don’t report confidence intervals then it’s impossible to tell whether the ranking order means anything. League tables appear to neatly rank from ‘top’ to ‘bottom’, yet if there’s no information about confidence intervals you don’t know if any of the data pertaining to individual work units involve large ‘overlaps’. If this vital information is missing, or if such overlaps exist, you simply can’t clinically separate work units from each other.
Goldstein and Spiegelhalter of the Royal Statistical Society demonstrated this in 1996 by untangling a nice neat league table of health authorities, ranked 1 to 15. By examining the data properly, they found that all that could be said was one of the health authorities was in the lower quarter, one was in the lower half, and four were in the top half. Not a nice, neatly-ranked 1 to 15. Why not? Because it simply couldn’t be done. Others, such as Jacobs and Goddard (2007) and Bird and colleagues (2005) have issued similar warnings.
The chart below illustrates how this applies to police forces:
Figure 3. Peer comparisons with confidence intervals
As you can see, all forces except Force ‘C’ exhibit overlapping confidence intervals and the mean average of each data set (the thick bar in the middle of each line) also falls within the other forces’ boundaries. This means we can only say that Force ‘C’ is significantly different to the other four; therefore it is totally inappropriate to rank the other forces.
Furthermore, as Force ‘C’ has a genuinely different crime rate to the others, there may be opportunities to share good practice. This is what comparative peer performance should be about, not a mechanism for ‘naming and shaming’, based on dodgy maths.
However, if we simplistically build a league table using only the averages, it would look like this:
Figure 4. Force league table
You can only imagine the conversations in Forces ‘B’ and ‘E’ as they desperately try and move up the table. Some forces even set targets for league table positions that they aspire towards. The sad fact is that there is no significant difference between the forces ranked 2nd to 5th here, so ordering them violates responsible maths and is totally inappropriate.
Conversely, if this were a positive measure (i.e. the higher the figure, the better the thing – let’s say detected offences), then Force ‘B’ would be at the top of the table and everyone would be scrambling to see what they do differently. If you know anyone who’s ever gone to another force to find out their secret, but can’t find one, that’s why – Force ‘B’ is not actually ‘top’ of the group.
League tables are so unstable that they’re not even a ‘starting point for asking questions’. Such questions are usually the wrong questions, asked of the wrong people, and often about the wrong things. Let’s face it, about half of those in a league table are ‘below average’ and someone is always bottom. This occurs regardless of how tightly clustered the work units are, or how well the whole group is performing. Put the ten most brilliant leaders / musicians / sportspeople into a league table and someone will always be ‘worst’, because neat ranking forces them into a meaningless hierarchy.
Like I said, I won’t bore you with the maths, but there’s a wealth of material out there explaining why confidence intervals, standard deviations, error rates, sample sizes, and more, should be transparently reported alongside the data in league tables. This ensures it’s possible to see if any apparent differences are significant. If such information is absent, do not proceed any further.
Why League Tables Should be Avoided: Part Two
This bit’s easy. The unfounded assumptions, false conclusions and misguided questions triggered by the use of league tables causes unwarranted behaviour by managers, aimed at addressing perceived deficiencies (such as demanding explanations for ‘poor’ performance, changing operational tactics, moving resources around, and so on). Even when this is well-intentioned, it’s baseless and counterproductive.
Then of course, managers’ actions influence others, and where performance pressures are brought to bear in order to move up a league table, guess what – otherwise good and decent people engage in gaming and dysfunctional behaviour. You know that already though, right? There’s so much evidence of this in healthcare, education, policing and beyond that I’d grow old listing it.
Okay, so that was a whistle-stop tour on why league tables should be avoided. I see well-meaning people using and endorsing them and it genuinely concerns me – if limitations are known and consequent behavioural dysfunctional is so likely, then I argue there’s a real need to openly acknowledge this and assess comparative peer performance in a more contextualised and mature way.
Incidentally, during my ongoing PhD research I used experimental psychometrics to test the predictability of disproportionate or unwarranted behavioural responses being enacted as a consequence of league table use; the results confirmed such adverse reactions are indeed highly predictable.
So the argument, “It’s not the league tables that are the problem – it’s the way they’re used”, doesn’t hold water for me. Yes, aggressive application is likely to exacerbate dysfunction, but there’s something innately illegitimate about the format itself that makes it highly likely adversity will ensue. Therefore, rejecting league tables isn’t a case of throwing the baby out with the bathwater, because there is no baby in this bathwater.
Neither do I subscribe to the view that there’s a dichotomous choice between using league tables with caution, or nothing at all. I’ve demonstrated how straightforward diagrams can clearly illustrate comparative peer performance; accurate narrative could do the same, thereby dispensing with league tables completely. I believe it’s better to present comparative performance information in a legitimate, easy-to-interpret format that truly informs decision-making, rather than one which routinely impairs it and distorts behaviour.
Bird, S. M., Cox, D., Farewell, V. T., Goldstein, H., Holt, T. and Smith, P. C. (2005) ‘Performance Indicators: Good, Bad and Ugly’. Journal of the Royal Statistical Society (A). 168 (1): 1-27
Goldstein, H. and Spiegelhalter, D. (1996) ‘League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance.’ Journal of the Royal Statistical Society 159 (3): 385-443
Jacobs, R. and Goddard, M. (2007) ‘How Do Performance Indicators Add Up? An Examination of Composite Indicators in Public Services.’ Public Money & Management. 27 (2): 103-110
“MISSING THE TARGET; HITTING THE POINT”
Mention numerical targets and you tend to get one of three reactions:
- “Targets are necessary to drive performance”.
- “Targets are bad because they cause dysfunctional behaviour”.
- “Targets are risky, but okay if used with caution”.
I’ll nail my colours to the mast from the outset – I’m firmly in Camp Two. The first statement is just plain wrong; the third appears to be stranded in no-man’s land.
My case against numerical targets is based on experience, research and evidence. In contrast, most of the defences of targets lack credible supporting evidence and are either characterised by stoic denial of well-documented consequences, or based on a simple misunderstanding of what numerical targets actually are.
If you’re in the third category, then we can get this over with quickly and painlessly, because as Belinda Carlisle sang in the 1980’s, “We want the same thing.” It’s probably just a terminology issue and we can agree about effective performance management principles, knowing we’re not that far apart in our thinking after all.
However, if you’re part of the ‘targets are necessary despite the evidence’ brigade, then this might hurt a bit.
Measures Not Targets
My position on targets can be summed up by two simple points:
- All numerical targets are arbitrary.
- No numerical target is immune from causing dysfunctional behaviour.
We’ll explore these two points soon, but first I want to resolve the use of language. ‘Numerical targets’, ‘measures’ and ‘priorities’ are three different things, but the words and concepts are often conflated. This causes people to assume targets are necessary, when they are actually thinking about one of the other two things.
Priorities are essential, because we need to ensure activity is focused. Priorities set direction and tell people what’s important. So, if tackling house burglaries is a force priority, that’s great because everyone knows what’s expected. BUT they’re not numerical targets! The target is the bit at the end, invented in someone’s head that states, “…by 23%”, or whatever.
Despite this, I still hear people saying things like, “We need targets for house burglaries because targets set direction and ambition”. No, the clearly-articulated priority of ‘tackling house burglaries’ sets direction and ambition – the target is irrelevant.
Next, we come to perhaps the greatest terminology mix-up in the history of performance management – ‘targets’ vs ‘measures’. One of the most common phrases I hear trotted out is, “We need targets so we can measure performance”. No, you don’t! MEASURES measure performance – the clue is in the name. Targets don’t measure anything – they are just those random aspirational numbers, invented in someone’s head that state, “…by 23%”, remember?
Measures are absolutely critical, because without measuring the right things in the right way, it’s impossible to understand performance. Measures are just a source of information that can help us to make informed decisions. So for house burglaries, for instance, we could measure:
- The burglary rate.
- Detected burglaries.
- Response times to burglaries.
- Factors that led to burglaries being detected (e.g. forensic hit / caught in the act / CCTV / house-to-house enquiries).
These are just some examples of measures that are directly linked to the priority of tackling house burglaries. In addition, we can utilise hotspot mapping, predictive analysis, intelligence submissions, and so on. Taken together, this wealth of information provides a starting point for understanding the burglary picture.
So once again, numerical targets are irrelevant. Measures tell us about performance perfectly well – targets are random numbers invented in people’s heads. Therefore, I’d argue effective performance management systems require priorities and measures, but not numerical targets. If you already understand this concept and can differentiate between the three, then you probably don’t need to read the rest of this article.
Numerical Targets are Arbitrary
Right, let’s get straight to the point. Numerical targets are arbitrary because there is no known scientific method for setting them. No matter how in-depth the data analysis that establishes the range, trajectory, or average rate of prior performance, the actual adjustment to produce the target is always arbitrary.
Consequently, you see targets for policing activity, such as these:
- Reduce crime by 3%.
- Detect 18% of burglaries.
- Conduct 243,206.3 stop and searches. (The decimal point is not a typo).
These targets are contrived by taking a baseline (usually simplistically derived from the previous year’s data), then metaphorically sticking one’s finger in the air and designating a random number as the target. Alternatively, some people multiply an arbitrary percentage against the baseline to produce the target. Other approaches involve ‘consultation’, which simply involves a group of people concocting targets, rather than just one or two individuals.
These approaches are fundamentally flawed, because they ignore important statistical considerations. Without getting into the heavy stuff, every data set exhibits normal (and totally random) variation within a predictable range. Look at the chart below.
Figure 1: Example of a statistical process control (SPC) chart
The numbers and date increments have been omitted from this example to keep it simple, plus I don’t have enough space here to explain how the lines are determined – just trust me that the zig-zags are indicative of random variation. This means it is unlikely there is an underlying identifiable cause for the differences observed. When the data points stay within the dashed lines (‘control limits’) and do not exhibit specified patterns, then the process can be described as ‘stable’. Therefore, unless there is a sudden shock or change in system conditions, the data will continue to populate anywhere within this range.
Let’s say the chart related to response times, with the lower control limit at the 5 minute mark and the upper control limit at 15 minutes. This tells us that if the type and frequency of demand remains constant, and we deploy the same resources from the same location, then officers will predictably arrive at emergencies at any point between 5 and 15 minutes.
Consequently, there is no merit in setting a target (e.g. 10 minutes), because officers will continue to arrive between 5 and 15 minutes. This is because setting a target anywhere in a range of data ignores variation, meaning sometimes it will be hit and other times it won’t, purely due to randomness. Similarly, if a target is set outside the expected range (e.g. at 4 minutes), then it cannot routinely be achieved under current system conditions.
This latter point illustrates why ‘stretch targets’ are particularly inappropriate; targets do not provide a method or capacity for achieving the objective of quicker response times. If demand and resources remain constant, then system conditions dictate the range within which officers will arrive. Response time targets are therefore arbitrary and irrelevant; when responding to emergencies, the ambition should be to deploy the appropriate resource to attend as quickly and safely as possible, then resolve the issue upon arrival.
Reviewing SPC data helps leaders understand demand and make informed decisions about how to improve. Altering system conditions (e.g. amount or location of resources on duty at particular times) is what influences performance. Numerical targets, on the other hand, are incapable of improving the system; they are unnecessary and impotent in this context.
The desire for faster response times should therefore result in careful analysis and evidence-based improvements, not simply expecting officers to arrive more quickly. Furthermore, the argument that failure to hit a target acts as a useful trigger for initiating remedial action is misguided because:
- The judgment is made against an arbitrary ‘good’ / ‘bad’ dividing line.
- The target disregards variation and imparts nothing about the capability of the system.
- If leaders use contextualised data to understand performance, then informed decisions can be made based on actual evidence instead.
Additionally, whilst it may be feasible to predict future performance within a range, it is impossible to state precisely where performance will be at some future point. For example, if the detection rate was steadily increasing it may be possible to predict there will be between approximately 2,000 and 2,600 offences detected in a years’ time, but it would be impossible to state exactly how many (e.g. 2,550 detected offences).
For these reasons, targets such as ‘detect 18% of offences’ violate established statistical principles. Managers would require a crystal ball to know if this precise amount was achievable. Furthermore, such targets inadvertently suggest there is no ambition to detect the other 82% of offences – surely this isn’t the case, but it calls into question why the stated ambition is not to detect as many offences as possible.
The bottom line is this – numerical targets are fundamentally incompatible with variation; there is no way round this and therefore all numerical targets are arbitrary.
Targets and Behavioural Change
The basic assumption underpinning targets is that they change behaviour. I agree – targets are explicitly intended to exert influence, so they’re certainly not neutral. Proponents believe targets encourage pro-organisational behaviour, whereas I warn of highly predictable gaming and dysfunction. Whilst not claiming that every single person subject to a target will always engage in dysfunctional behaviour, I’d suggest it’d be naïve to ignore the risks, or deny that targets are consistently responsible for triggering adverse consequences.
The evidence is overwhelming – introduce numerical targets into performance frameworks and people will engage in gaming, cheating and other subterfuge in order to hit the targets. They are not necessarily ‘bad apples’ either, as otherwise good people also engage in these behaviours.
Look at the Public Administration Select Committee’s 2014 report into misreporting of crime statistics by police forces; it categorically warns targets, “…tend to affect attitudes, erode data quality and to distort individual and institutional behaviour and priorities”. (p.31) Consequently, the Committee issued the following strongly-worded recommendation:
“The Home Office, which claims credit for abolishing national numerical targets, should make clear in its guidance to PCCs that they should not set performance targets based on Police Recorded Crime data as this tends to distort recording practices and to create perverse incentives to misrecord crime. The evidence for this is incontrovertible. In the meantime, we deprecate such target setting in the strongest possible terms”. (p.52)
Given this and other high profile warnings, along with the vast array of cross-sectorial evidence regarding the highly predictable consequences of target-setting, I find it astounding that these very real dangers are still ignored by some. For me, there is no ‘use with caution’ when it comes to numerical targets.
Then there’s something about motivation. Target-setters believe targets are necessary to make people work hard, which I think is pretty insulting, particularly in the public services context. I, for one, joined the police to help people, catch criminals and protect the vulnerable. If an officer was investigating five burglaries, what is the point of a 20% detection target? That’s like saying, “Don’t bother with four of those offences”. How is that supposedly better than aiming to solve as many as you possibly can?
Similarly, who puts 13% effort into pursuing a stolen car because the detection target for vehicle crime happens to be 13%? What about targets for conducting, say, 456 patrols? Why 456? Why is 455 not enough and 457 too many? What about quality? Then, if targets are really needed to drive performance, what prompts the 457th patrol? Also, what motivates officers to do a good job in areas of policing that are too complex to set simplistic targets for (e.g. Public Protection)?
Even the world’s foremost proponents of numerical targets, Professors Edwin Locke and Gary Latham, who have spent decades developing Goal-Setting Theory, acknowledge there are serious limitations. Whilst experiments have shown individuals subject to targets increase output for simple, repetitive tasks, the consensus is numerical targets are unsuitable in complex systems (such as policing).
This is partly because targets encourage unhealthy internalised competition, resulting in a debilitating condition known as sub-optimisation. This occurs where individuals, teams or departments focus on targets at the expense of each other, the overall system and / or other important activities not subject to targets. Or to put it another way, targets do indeed set direction – in the direction of the targets.
Furthermore, Locke and Latham accept that targets can and do cause dysfunctional behaviour, damage morale, and impair performance; plus they warn targets are particularly inappropriate for situations where individuals don’t have total control over their performance. (Crime reduction targets spring to mind here). If these guys acknowledge the dangers of target-driven performance management, I think we should all listen.
Time to Decide
Target-setters insist unintended consequences aren’t actually due to the targets, but the way they’re implemented. I disagree. As with league tables, if target-driven performance management is aggressively enforced, then it stands to reason that adverse consequences will be magnified. However, simply exonerating the targets is like saying, “It’s not the nails that caused the tyre to go flat – it’s the way the nails were inserted into it”. (Or they were the ‘wrong sort’ of nails, or there were ‘too many’ / ‘too few’ nails etc.)
And if you’re one of those people who believes some targets must be okay, then ask yourself ‘which ones?’ and ‘why?’ What makes your preferred targets unlikely to cause dysfunctional behaviour, like those other targets do? And aren’t they arbitrary anyway – why not just aim for 100% instead? If you could solve five burglaries, you would.
There is something innately toxic about the targets themselves that significantly raises the risk of dysfunction. Again, during my ongoing PhD research, I conducted experimental psychometric tests with over 4,000 officers and found consistently that the use of numerical targets leads to unfounded assumptions about performance and drives adverse behavioural responses.
In the absence of cultural or organisational pressures, this strongly indicates the use of numerical targets leads to predictable adversity; therefore I would suggest any perceived benefits are outweighed by the dangers. So why gamble? For me, there is no such thing as ‘responsible’ or ‘positive’ use of numerical targets in police performance frameworks.
- Numerical targets are arbitrary and therefore inherently illegitimate.
- Numerical targets do not provide a method or capacity to improve performance.
- Dysfunctional behaviour is a highly predictable consequence of target-driven performance management.
Therefore, I argue there is a strong case to abandon targets and use contextualised data instead. Being clear about priorities and using the right measures in the right way informs decision making, promotes learning, and provides the insight necessary to use performance information intelligently. What’s not to like about that, seriously?
Having no targets does not mean no performance management. Having no targets does not mean no accountability. Having no targets does not mean no ambition.
Let numerical targets go.
“TODAY’S BINARY COMPARISONS COMPARED TO LEAST YEAR’S BINARY COMPARISONS”
I’ve already written a lot about binary comparisons in my blog; chiefly, that comparing any two isolated numeric values is completely meaningless, as by definition, the data have no context. Whether someone attempts to compare a figure with last week, last month, last year, the same period last year, year-to-date, the average, and so on, the practice is utterly misguided, as it ignores variation and results in a situation where the current figure is effectively being compared to a random historical value.
Therefore, depending on whether your random historical value was relatively high, low, or somewhere in the middle, you will get different results every time. The use of binary comparisons causes people to assume that any apparent ‘differences’ are meaningful; they then perceive a trajectory (‘up’ / ‘down’ / ‘improving’ / ‘deteriorating’ etc), hypothesise about what may have caused the change, then start asking questions, moving resources, convening meetings, writing plans, changing tactics and so on – all in the hope that these actions will address the perceived problem.
Sadly, this ‘perceived problem’ might not actually exist. It may be the case that only random variation was present, so all those nice plans with green and red ‘up’ and ‘down’ arrows were unnecessary at best, and counterproductive at worst. You see, whilst the boss was knee-jerking to something that essentially wasn’t there, he or she missed the opportunity to deploy resources to tackle a genuine problem elsewhere.
Binary comparisons tend to be used because of their simplicity, as they appear to convey headlines about ‘direction of travel’. The problem is, however, they are too simplistic. Bird et al (2005) of the Royal Statistical Society warn against their use:
“Very particularly the practice of concentrating on a comparison with the most recent value, this year’s results compared with last year’s, may be very misleading for reasons including regression to the mean”. (2005, p.14)
If you were to plot a binary comparison on a chart, it would look like this:
Figure 1. Binary comparison chart
Pretty worthless, huh? What’s happened to all the other data points that would have been in those massive gaps? Are either of the values particularly high or low? Is there a trend or not? It’s impossible to tell. No one in their right mind would attempt to draw any conclusions from a chart that looked like this, yet some managers seem content to accept the narrative without question; e.g. “Crime is 5.4% higher than it was during the same period last year”.
Binary comparisons are so useless that they aren’t even a valid starting point for asking questions.
However, there’s just one thing I want to explore further – some would argue that in certain circumstances, there can be exceptions to my claim that binary comparisons should be avoided completely. In particular, it can sometimes be possible to test for statistical significance in differences between two sets of numbers.
Indeed, as long as certain conditions apply and statistical details are clearly reported, there are various ways of doing this, depending on the type of data. I’d suggest that the sort of statistical considerations I alluded to in my league tables article would also be appropriate for consideration when making such comparisons; these would include transparently reporting information about sample sizes, confidence intervals, standard deviations and error rates, for example.
Focusing on confidence intervals then, if we return to a similar sort of diagram to one I used for comparing peers in my league tables article, you’ll see what I mean.
Figure 2. Satisfaction rates comparison: Version ‘A’
Figure 3. Satisfaction rates comparison: Version ‘B’
Using the example of satisfaction rates, here are two versions of a diagram which compares rates for 2013-14 against 2014-15. In each case, the value for 2013-14 is 83.3%, whilst the value for 2014-15 is 82.1%. “A decrease of 1.2%!” I hear you cry. Yet, by contextualising the data, it’s clear that only in the second case can it be said that the difference is actually significant.
The overlaps that are evident in Version 1 make it inappropriate to draw any inferences about there being a ‘decrease’. It follows that action taken to enquire further into this apparent ‘difference’, or initiating activity to address a perceived deficiency, is misguided and unnecessary. Conversely, in Version 2, it can be seen that the difference between the two data sets is indeed enough to warrant further inquiry.
The diagrams make the contrast between these two scenarios explicit. Contrast this with a situation where managers are expected to assess performance using binary comparisons that are simply presented as raw numbers alongside a percentage that describes nothing other than the ‘difference’ between two isolated numeric values. How on earth can they be expected to know which ones are significant and which ones aren’t?
It’s simply impossible, which is why using such binary comparisons even as a starting point for asking questions is a mistake. Without additional supporting statistics that communicate the essential background information necessary to understand whether differences are significant or not, you might as well read tea leaves, toss a coin, or draw random numbers out of a hat.
Yet still I see judgments being made about whether performance is ‘improving’ or ‘deteriorating’ based on binary comparisons involving simple percentages or raw data. These are the most common type of binary comparison found within police performance management documents, resulting in unwieldy multi-coloured numeric tables stuffed full of ‘month vs month’ or ‘year vs year’ comparisons.
The predictable consequences are that managers erroneously ascribe meaning to apparent differences, causing them to envision a ‘direction of travel’, then ask unnecessary questions about the wrong things and react by initiating unwarranted or disproportionate actions intended to address perceived issues of concern. It’s worrying stuff and so easily avoidable.
And this isn’t just me speculating or generalising. During my ongoing PhD research which used a series of psychometric tests to assess interpretation of different types of performance information, I found that almost 90% of participants misinterpreted an experimental binary comparison stimulus in the expected direction (i.e. they assumed it meant a deterioration); they then went on to enact all manner of unwarranted and disproportionate behavioural responses to address this perceived deficiency.
So unless you’re the 1 in 10 of the population who seems particularly resilient to being led astray by performance information presented in binary comparison format, you’re highly likely to fall into the trap as well. It’s just easier not to go there in the first place. (By the way, the papers listed at the end of this article report on these tests – go on, have a read).
Keep it Simple
Another way to avoid this malaise would be to simply plot the data in a control chart (see my previous articles for more on these). Notwithstanding that setting the control limits at the ‘industry standard’ level of 95% may rightly be considered arbitrary, this simply means that we choose to view the data with 95% confidence that significant events can easily be identified. It just depends what degree of clarity you want to have before you act.
Here’s an example. (I’ve left the axes unlabelled so you can imagine whether the chart relates to crime data, response times, or anything else you particularly care about, at any temporal interval you prefer).
Figure 4. Control chart
Go one, hazard a guess at which of the data points is significant…
Now imagine trying to work out that out using nothing but a massive table of binary comparisons and raw data. Say no more.
Conclusion and a Simple Choice
So yes – in exceptional circumstances (and subject to all manner of caveats and reporting requirements) you could technically do a binary comparison (although it’s not really a binary comparison as you need to access each individual datum that comprise both the full data sets that you wish to compare). Confused? You will be.
Alternatively, you could just look at the full data set in a control chart (or other time series) to see quickly and easily where it is appropriate to dig deeper.
Bird, S. M., Cox, D., Farewell, V. T., Goldstein, H., Holt, T. and Smith, P. C. (2005) ‘Performance Indicators: Good, Bad and Ugly’. Journal of the Royal Statistical Society (A). 168 (1): 1-27
Guilfoyle, S. J. (2015) ‘Binary Comparisons and Police Performance Measurement: Good or Bad?’ Policing: A Journal of Policy and Practice. 9 (2): 195-209
Guilfoyle, S. J. (2015) ‘Getting Police Performance Measurement under Control’. Policing: A Journal of Policy and Practice. doi:10.1093/police/pav027