If Brighton and Hove Albion FC were a stud farm, they would not be Coolmore. They would not be Darley. They certainly wouldn’t be Juddmonte. And yet Roberto De Zerbi’s  team are proving that you don’t need bottomless pockets or the footballing equivalent of Frankel in your starting line- up in order to compete with the big boys.

In a banner season, the south coast team reached the FA Cup semi-finals and even clinched a lucrative European spot for the first time in the club’s history, and all with a budget a fraction of the size of that of rivals such as Manchester City.

The key is owner-chairman Tony Bloom and his embrace of big data. The 53-year- old sports betting entrepreneur has his own software that filters the whole of the transfer market. The club’s scouts are sent a list of names to watch, and then compile reports on players who have passed the data checklist. The exact algorithm is a closely guarded secret, kept even from those inside the club. But whatever is in Brighton’s secret sauce appears to be working.

Big data. Analytics. Moneyball. It goes by many names, but the use of data in sports is nothing new. It was brought to popular attention by Michael Lewis in his 2003 book Moneyball and by 2011 film of the same name starring Brad Pitt. It charted the fortunes of the Oakland Athletics baseball team. Due to their smaller budget compared to rivals such as the New York Yankees, the A’s applied an analytical, evidence-based approach in order to identify players who were undervalued by the market and then snapped them up.

Two decades on from the release of Lewis’s book, the term ‘Moneyball’ has become a metonym for the chase for efficiency and perfection in sport. All big-money leagues now employ legions of data nerds to crunch the numbers on all aspects of their players’ performance.

Premier League Arsenal, for example, use the STATSports system to gather physical data on all their players, from the under-12s through to the men’s and women’s first teams. They record some 250 separate metrics, including accelerations and decelerations, average heart rate, calories burned, distance per minute, high-speed running, high-intensity distance, max speed, sprints and strain. The statistics are available live during training sessions so coaches can make real-time adjustments where necessary.

It is apparent that racing trails the field behind other professional sports in terms of how it measures its athletes.

Horseracing – a sport where milliseconds, mere pixels in a photo finish, mark the line between success and failure – seems like a perfect candidate for the Moneyball treatment. After all, horses are data-generating animals. A standard racecard is packed with reams of data. If a punter wants more information, there are periodicals devoted entirely to providing historical data on every aspect of a horse’s pedigree and performance. And yet it is apparent that racing trails the field behind other professional sports in terms of how it measures its athletes.

Nevertheless, there are signs that a big data revolution could be on the horizon; one that will not just change the face of horseracing, but which could help ensure its very survival as an industry.

Perhaps it is no surprise that first applications of data in horseracing were in gambling. Harvard dropout Andrew Beyer was the first person to hypothesise that a horse’s performance could be  empirically quantified and created a metric to collapse all the variables that can affect a horse’s performance (surface, going, distance, field size) into a single number: the Beyer Speed Figure. (In the UK, data provider Timeform had an almost identical genesis.)

While Beyer’s mathematical approach  has survived and thrived in its modern iteration of Computer Assisting Wagering (or CAW; that is, the use of algorithms to analyse statistics, race history, and other relevant data to develop a prediction model for a race’s outcome), outside of betting, the industry itself has seemed slow to embrace the potential of big data.

Despite that, there are indications the industry is waking up to its potential. A number of new systems have recently hit the market that use the latest technologies to provide detailed insights into the horse’s performance. One of these systems is StrideSafe. Using a combination of GPS and motion capture technology, the system is capable of detecting minute variations in the horse’s stride that are effectively invisible to the human eye.

“From an observational point of view, humans can’t detect these sorts of changes that we’re picking up. It’s simply happening too fast,” explains David Hawke. Managing Director of StrideMaster, the company which produces StrideSafe. The sample rate in StrideSafe’s sensors is 800 hertz, or 800 frames per second. The human eye, by contrast, cannot directly perceive more than about 60 frames per second. Yet despite promising trials in America, take-up in Europe has been less than enthusiastic. “We’ve had some inquiries, but we haven’t had anything concrete in terms of take-up outside of Australia and the US,” says Hawke.

Arioneo and StrideSafe are just two of the technologies that are helping to bring racing into the 21st century

Indeed, Australian racing appears to be something of a pioneer when it comes to the use of big data. In a suburb north of Ballarat, Victoria, Ciaron Maher uses big data to collect and analyse performance and health data on his 700 equine athletes. Josh Kadlec-Cavanagh is Head of Data and Performance at Ciaron Maher Racing. “I would say I’m pretty similar to the Jonah Hill  character in Moneyball,” he explains. “I’m the guy sitting behind the database cranking out numbers.”

Maher’s journey into big data began with the appointment of Katrina Anderson as Head of Sport Science in 2020. Kadlec- Cavanagh came on board two years later. “I was hired as a data scientist to come and look after the database and do some of the modelling in terms of the predictive analytics of looking at their trainings, looking at their races and the correlations between the two, and what you can possibly correlate between how they recover in training and what they can produce on a track,” he says.

Maher’s yard uses trackers made by French company Arioneo to ensure no margin is left ungained in his quest to optimise performance. “Athlete feedback is key,” says Kadlec-Cavanagh. “It just seems like a no-brainer to get some sort of feedback that we can’t get through communication [with the horse]. We want to implement technology to be able to monitor things that are happening in the human sports performance world, such as sleep patterns, heart rate variability and recovery rates, all that sort of analytics that’s going on in the human space. We want to be able to do that for horses, too.”

Arioneo and StrideSafe are just two of the technologies that are helping to bring racing into the 21st century. But the industry has some way to go before it catches up to other professional sports. “If I walked into a major football club and said, ‘Who here’s got expertise in biometric sensor analysis’, half the football department would put their hand up,” says Hawke. “They’ve been doing it for 20 years. But the information can be used in so many different ways in terms of performance, breeding and training techniques. We’re just scratching the surface.”

Unlike human athletes, horses can’t talk, so any attempt to interpret how they ‘feel’ is, to some extent, an act of ventriloquy. “Any data is giving us more context than just eyeing up the horse. Horseracing is an art but now there’s a little bit of science and evidence to back that up,” adds Kadlec-Cavanagh. Indeed, the most persuasive argument in favour of big data is surely the welfare argument. With racing already sweltering through a summer of discontent, equine safety is in the spotlight like never before. If more information about horses’ movements, their bodies and behaviour were able to be collected and analysed, this information could be used to handicap and place horses in races as well as reduce injuries and improve outcomes.

That is the thinking behind the BHA’s Jump Racing Risk Model (JRRM). Forged out of the ashes of the 2018 Cheltenham Festival Review, the JRRM is a powerful data and epidemiology hub created to identify risk factors in jump racing. Its data set includes 41,438 horses, 45,235 races and 384,418 race starts.

Dr Sarah Allen is the lead researcher on the project, which is operated in collaboration with the Royal Veterinary College. “In terms of our current understanding of risk factors in jump racing, there hasn’t really been any work done on data from Great Britain since 2013, but that study only used data up until the end of 2009,” she explains.

We can look to identify opportunities to make changes

The JRRM will analyse all race starts from 2010 through to the end of the 2023 core jump season to produce six different risk factor models: one each for fatalities, long term injuries (defined as an injury sustained on raceday that requires at least three months recovery time) and falling. These will be further subdivided into separate models for steeplechase and hurdle races to give the six models.

Dr Allen’s team will initially use classical statistical methods such as multi-level logistic regression modelling (essentially, the concept of using data to estimate the odds that an event will occur – the yes/no outcome – while taking into account the dependency of data) in order to determine the independent effects of the different risk factors. “From there, we can look to identify opportunities to make changes – whether that’s changes in race distances, the age at which horses can compete, and so on – but also look to identify particular individuals who are at higher
risk, and then determining how we manage this higher risk population,” she explains.

Machine learning is also likely to play a role: “If we’re looking for individual horses who are at increased risk of injury, that’s where the machine learning will come in,” she says. “Even with the traditional models there is still an element of uncertainty around our ability to predict risk. How do we then improve our ability to detect these higher-risk horses? It’s a case of identifying which are the most important factors that we should be looking at. And that’s where the machine learning is going to help us, so we can increase the predictability of being able to correctly identify horses at increased risk before they race,” she says.

“The classical methods give us a much broader overview of all the factors associated with risk,” she continues, “whereas with the machine learning, we can really drill down and distil which are the key factors we need to be considering.” So far, she explains, the role of additional factors such as a horse’s biomechanics in its risk profile has been somewhat undertheorised due to the relatively small amount of data available:

“That’s something that’s only really done in quite small numbers at the moment,” explains Dr Allen. “It’s going to be quite difficult to scale that up to every single race start. But in time, as the technology advances, that may become more and more easy to do,” she says.

That’s where technologies such as StrideSafe come in: in a trial in New York, StrideSafe was placed on every runner in the summer of 2021. A retrospective analysis of the data it captured showed that the system had correctly predicted 90 per cent of fatal injuries that occurred on the track by ‘red flagging’ the horses in running during their final race. The value of such information, which could enable jockeys to pull up at the first sign of trouble or even help trainers decide whether or not to run their horses in the first place, cannot be overstated.

Yet part of the battle has been trying to persuade owners, breeders and trainers to embrace the technology. “We just don’t have the information from training sessions in this country that would allow us to get to that next level in terms of predictability,” says Stephen Wensley, Project Lead (Welfare Data) of British Horseracing’s Horse Welfare Board.

“Here in Europe, it’s much more traditional,” he continues. “The attitude is, it’s about the horseman’s sense and feel. And what we’re trying to do is find ways to help them prevent injuries in the first place by providing them with the education that will help them to know when to intervene in training. There’s a lot of research out there but it’s just not getting through to the right people in the right way to be able to act upon it.”

However, there are indications that attitudes are changing: “The younger trainers in particular are following along on that kind of route,” Wensley continues. “We’re getting people coming in from outside the industry without that horsemanship background, so they’re going to have to rely on these data sources because the level of horsemanship just isn’t there,” he says.

“Everybody within the industry wants to reduce injuries,” adds Dr Allen. “The main thing is getting the research out there. But we can’t do that without the support of the industry. That’s the strength of [the JRRM]. We bring our research expertise to deliver that impact and reach the right people.”

The JRRM is due to publish its finding later this year. While its downstream effects might not be felt for some time, the ghost in the machine could yet be the saviour of horseracing.

 

What is big data?

Big data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. These may include, for example, messages, images, readings from sensors and GPS signals, as well as other metrics. Generally speaking, the ‘bigger’ the data, the greater the statistical power.

Big Data Analytics (BDA), meanwhile, refers to the use of processes and technologies, including machine learning and deep learning, to combine and analyse these massive data sets with the goal of identifying patterns and developing insights.

Machine learning and deep learning are both types of artificial intelligence (AI). ‘Classical’ machine learning is AI that can automatically adapt with minimal human interference. Deep learning is a form of machine learning that uses artificial neural networks to mimic the learning process of the human brain by recognising patterns in the same way that the human nervous system does, including structures like the retina.

Deep learning is much more computationally complex than traditional machine learning. It is capable of modelling patterns in data as sophisticated, multi-layered networks and, as such, can produce more accurate models than other methods.

Chances are you’ve already encountered a deep neural network. In 2016, Google Translate transitioned from its old, phrase-based statistical machine translation algorithm to a deep neural network, with the result that its output improved dramatically from churning out often comical non sequiturs to producing sentences that are close to indistinguishable from a professional human translator.

Natural language processing (NLP) refers to the branch of computer science – and more specifically, the branch of artificial intelligence or AI – concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics – rule-based modelling of human language – with statistical, machine learning, and deep learning models.

Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, including nuances such as intention and sentiment.