How much TfL can learn about us from mobile data (JamesO’Malley)

It’s a well known maxim in the tech industry that if you’re not paying for the product, then you are the product. We get to use incredible services like Gmail, Facebook and Twitter1 for free – and in return, the big tech firms sell access to our eyeballs to advertisers2.

But this isn’t always the case. Sometimes, even when we pay for a service, we’re also the product being sold.

For example, something that EE, O2 and Vodafone all do, but don’t really love to shout about is sell anonymised, aggregated data on our physical movements to local authorities, transit agencies and any other companies with a chequebook large enough.

And that’s why today I’m going to tell you about some of the really mad things that Transport for London (TfL) can figure out about us by using our location data, provided by the O2 mobile network.

Using the Freedom of Information Act, I’ve managed to obtain the Data Protection Impact Assessment, and the Statement of Work for TfL’s Project EDMOND – which stands for “Estimating Demand from Mobile Network Data”3.

That’s right, this week’s newsletter is dangerously close to being actual reporting instead of just my usual bloviating. And having now fallen down the rabbit-hole digging into it, I’m amazed by the quality of information it gives transport planners and policy makers. And honestly, I’m a little freaked out.

So let’s dive in and explore it together.

Careful now

The way EDMOND works is very clever. TfL isn’t actually monitoring all of our phones all of the time, presumably because it knows that to do so would be hugely controversial.

So instead, it contracts with O24 to license data over shorter periods of time. For example, in 2023, it took data from ‘up to’ 40 normal weekdays between the start of April and end of June, when nothing weird was happening like school holidays or bank holidays5.

This is an enormous dataset, with potentially up to 25 million phones included in it6, but it still doesn’t include everyone in London because some people use other networks like EE, Vodafone, and so on.

So it’s crucial to understand that EDMOND isn’t just a pile of data – it is a model, where TfL has taken the data from O2, and has done some clever maths to scale it up to estimate the the movements of everyone in London over the age of 12.

There is also the elephant in the room. Though it might be surprising to learn that O2 is selling data insights on its users, it is not selling personal data7. What’s being sold by O2 and licensed by TfL is aggregated, anonymised data.

This means TfL can’t see the movements of individual people, and of course everything is fully GDPR-compliant and above board – as you’d expect for a major corporation and a transport agency.

In fact, according to the 2018 Travel in London report, any time the data suggests there were fewer than ten phones in a given statistical area, the data was automatically excluded so to avoid inadvertently unmasking people based on their metadata.

So to be absolutely clear, there’s no big scandal here8. In fact, using this sort of data is increasingly routine for local authorities and others9. To the extent that O2 even has a brand name for this line of its business – “O2 Motion”.

But that doesn’t mean what’s happening isn’t interesting. In fact, I’m willing to bet that most people outside of the mobile industry are completely unaware their movement data is being used in this way.

What TfL knows

Now let’s get to the good stuff. What does all of this data do for TfL, and what data do they have to play with?

Because of the aforementioned privacy restrictions, they don’t simply get dots on the map show them where everyone was. Instead, the data is broken down into hundreds of “Medium Super Output Areas (MSOAs)” – this is a statistical standard that divides up the country into groups of between 2000 and 6000 homes.

Here’s a map showing London’s MSOAs:

Looking at this, you can see why data on this level might be useful.

Using the aggregated data from O2, TfL can see which areas of London people are travelling from and where they are travelling to – which is exactly the sort of information you might need if you were, for example, planning where to run buses or impose an Ultra-Low Emissions Zone that disincentivises car use.

It goes deeper. As you can see above, it’s possible to work out which parts of London are hosting the most international visitors, by looking at which MSOAs have the most phones using international roaming mode inside their boundaries. (Unsurprisingly it appears the busiest areas for international visitors are the West End, and Heathrow.)

But here’s the other crazy thing. Whether your SIM card is roaming is not the only thing that O2 knows about its users. In fact, because it has demographic data on its contract customers, it’s possible to break down the demographics of people in each MSOA by gender and age – as well as the time of the day they were there.

Here’s some made-up example data showing just that, from one of the documents I got:

Arguably the creepiest column above is the one you can see labelled “type” – which you can see labels different types of people “Resident”, “Worker” and “Visitor”. Because O2 doesn’t just know where you are, it knows why you’re there too.

None of this is information is secret – this slide was talked about at a 2017 European Transport Conference.

How does it do this? By making some smart assumptions.

For example, it determines your home by looking at the place where you spend most evenings and nights during the prior month. It also figures out where you work based on where you spend working hours during weekdays. And according to the documents I’ve obtained, it appears that the latest 2023 modelling will also be figuring out when people are specifically travelling to education institutions (ie: schools and universities) too.

So TfL isn’t just able to figure out where people are travelling to and from, but why they are travelling. But amazingly, the model gets even smarter than this.

Mashing up datasets

To my mind, the most impressive thing about the EDMOND model is that it can apparently accurately predict by what means you’re travelling – whether by foot, bike, car, train, bus or even lorry.

Read on