Digital Innovation & Entrepreneurship
question the software is trying to solve. For example, the ground truth dataset for a cat identification AI might consist of labelled images of various breeds. A picture of a cat would be checked against this dataset to see what type of cat it is. However, AI tools used by organisations are far more complex, whether it’s deciding if a radiology image shows a malignant tumour, a candidate is a suitable hire, or a start-up is investable. Far from being clear, what constitutes the ground truth is often up for discussion. In these cases, users need to be certain that the ground truth is a sufficiently verifiable version of the truth so that they can rely on it when making decisions. Wherever possible, the ground truth dataset should be based on objective information. A radiological image for detecting malignant tumours should be checked against subsequent biopsy results, for example. Often the nature of the many prediction problems means the labelling of the ground truth data is not necessarily objective. In such cases, people with sufficient expertise should perform and check the labelling and apply the relevant professional standards for that information. What to avoid when adopting AI In practice, this means the tool developers interact with relevant expert practitioners, tapping into their accumulated knowledge and experience to better understand the practices and processes involved, in order to codify as much of that know-how as possible. Unfortunately, as with many other organisations making AI purchasing decisions, the managers in our study relied too heavily on a metric commonly used to assess AI performance – the
AUC (area under the receiver operating characteristic curve). The problem is that the AUC says little about the tool’s capabilities compared to the performance of the experts who will be using it. Instead it measures how likely it is that the tool delivers a correct response, based on whatever ground truth labels have been selected by the AI designers. In other words, the developers measure performance on their own terms. Once the medical professionals looked beyond the AUC metric and began to put the AI tools under the spotlight, problems emerged. In a series of pilot studies, these medical experts used their know-how to develop their own ground truth datasets and test the AI against it. In many cases, their results conflicted with the accuracy measures claimed for the tool. On closer examination, it became clear that the ground truth used by the models had not been generated in a way that reflected how the experts reached their decisions in real life. Most organisations adopting AI don’t go through such a rigorous process of evaluation. But failure to examine an AI tool properly risks a damaging fall-out from its poor performance. The behaviour of lower-level managers, often reluctant AI adopters, can aggravate problems. They may back their own know-how against an AI tool and continue to work as usual. The risk is that senior managers give the credit for good results to the AI tool and make staff redundant. By the time an organisation recognises the hit to its performance and expertise, it becomes very costly or impossible to rectify. The best strategy is to evaluate AI tools thoroughly before acquiring, implementing, and embedding them in the
organisation. This means putting some key questions to whoever is pushing for the AI’s adoption, be it designer, vendor, or the organisation’s own data scientists. What managers need to ask Find out exactly how the AI tool has been trained. Can it be objectively validated? How was the data labelled? Who did the labelling and validated it? How expert were they in their field? Was it done to the professional standards expected in that area? What was the evidence used? Where is the data source from? How applicable is it to the exact use? Don’t be deterred. Don’t accept jargon-filled responses that obscure the truth. Only proceed to piloting the tool if you are satisfied with the answers. AI is going to be unimaginably transformative, in many cases for the better. Eventually the way these tools are constructed will become more transparent, stakeholders will establish best practice, and the transaction of AI tools will be better regulated. Until then, our study shows that ‘caveat emptor’ – when the buyer alone is responsible for checking the quality and suitability of goods before a purchase – has to be the watchword for organisations thinking of adopting AI tools. The burden for checking the merits of these AI tools falls solely upon the purchasers. Thorough due diligence is needed if organisations want to avoid a bad case of buyer’s remorse.
TO THE CORE
1. Managers are under pressure to adopt AI from all directions, but many don’t know the critical questions to ask to make a reliable assessment. 2. Be certain that the ‘ground truth’ at the heart of any AI software is sufficiently verifiable and can be relied on when making decisions. 3. Don’t rely on metrics that allow developers to measure success on their own terms. 4. Beware of giving AI tools credit
for good results achieved by sceptical, lower-level managers who may be continuing to work as usual.
rtificial intelligence tools are big business. They reach into every aspect of life: allocating social housing,
constant reassurances that these tools will deliver on their promises, backed by seemingly credible third-party performance claims. But all that glitters is not gold. All too often, managers risk a huge gap between expectation and reality. Far from improving performance, AI tools may lower the accuracy and quality of decisions and undermine the knowledge capital the organisation has built up over decades. If this sounds shocking, it should. Because as my award-winning research with co-authors Sarah Lebovitz and Natalia Levina shows, it’s a risk that thousands of organisations are taking by failing to do the appropriate due diligence when adopting AI solutions. That risk threatens to damage organisations and governments. In some situations, it could even cost lives. We were able to study, at extremely close quarters, how five AI tools were evaluated in a renowned US hospital
that employed leading experts in their fields.
The application for the AI was in diagnostic radiology. For more than 11 months, we observed managers testing and evaluating the AI tools at research conferences, workshops, symposia, vendor presentations and 31 detailed evaluation meetings, as well as in 22 interviews and many informal conversations. We also had access to a wide range of associated data. What we discovered was highly concerning. It wasn’t that the medical professionals involved didn’t want to thoroughly evaluate the AI tools. They simply didn’t know the critical questions to ask to make a reliable assessment. As they probed further and looked beyond the surface-level metrics, they discovered fundamental flaws in the way the AI had been trained and validated. At the heart of any AI is its ‘ground truth’ – the labelled data that represents (and is used to verify) the correct answer to the
hiring top talent, diagnosing medical conditions, predicting traffic jams, forecasting stock prices, and generating sales discussing how AI will transform the world, fuelling expectations that the global market for this technology will exceed $1 trillion by the end of the decade. Managers are adopting AI because they believe it will enable their organisations to complete tasks and make decisions quicker, more leads. The list goes on. Everyone appears to be accurately, and at a lower cost. The pressure to embrace AI comes from all directions: vendors, boards, competitive markets, even the media. It comes with
Find out more about studying for a Global Online MBA with Warwick Business School.
Sustainable Development Goals
Warwick Business School | wbs.ac.uk
wbs.ac.uk | Warwick Business School
26
27
Made with FlippingBook Learn more on our blog