A surge of interest has been noted in the use of mobility data from mobile phones to monitor physical distancing and model the spread of severe acute respiratory syndrome coronavirus 2, the virus that causes COVID-19. Despite several years of research in this area, standard frameworks for aggregating and making use of different data streams from mobile phones are scarce and difficult to generalise across data providers. Here, we examine aggregation principles and procedures for different mobile phone data streams and describe a common syntax for how aggregated data are used in research and policy. We argue that the principles of privacy and data protection are vital in assessing more technical aspects of aggregation and should be an important central feature to guide partnerships with governments who make use of research products.
When anonymised and aggregated, these data do not reveal information about individuals but provide epidemiologically relevant estimates about population mobility—ie, the extent to which people are sheltering in place, congregating at parks, grocery stores and transit hubs, and generally moving less (or more) than usual.
These data also provide vital insights into travel patterns to help better understand the effect of travel restrictions and the risk of importation from other locations and to inform spatial epidemiological models.
These analyses can be used to identify neighbourhoods or communities that could become hotspots for community transmission or that might need additional support to practise physical distancing, or as part of surveillance more generally.
Although one data stream might be more representative of a younger and more affluent population, another data stream could under-represent those living in rural areas. Popular analytical reports from large data providers show physical distancing (mobility) metrics, for example; however, the underlying data they represent and the aggregation methods used are typically not readily available.
This scant transparency makes it hard to know the representativeness and limitations of these data before using them for modelling. A common framework is needed to analyse the characteristics of these disparate data and their outputs, to allow for better comparison across mobility metrics and easier interpretation.
In this Viewpoint, we outline considerations for analysing aggregated data from mobile phones, including representativeness, situational context, and methods of aggregation. We then define the analytical pipelines used to construct nine metrics that can be applied towards measuring physical distancing interventions and modelling the spread of COVID-19 and other infectious diseases.
Data pipelines and processing
CDR data are generally more representative of the underlying population than are GPS traces (which are dependent on smartphones) because of the near-universal penetration of standard mobile phones. This notion generally remains true even in highly developed settings, because GPS data are typically only captured for a subset of the population that uses a particular application and provides consent to share location services. Ownership of more than one SIM and limited granularity of data do restrict the application of CDR for mobility analysis.
Prerequisites to data sharing
To date, epidemiological or clinical justification has not been satisfactorily shown to override privacy and ethical considerations in several settings where individuals have been deidentified by using mobile phone data.
Privacy should be preserved through statistical thresholds, differential privacy, and appropriate security controls with all parties in agreement on the principle of privacy protection.
Representativeness of data
Data providers must be clear about the representativeness of their data for epidemiological research. There are at least three important considerations. First is market share: what fraction of the population are represented in these data? Second is demographic representativeness: who are the people generating the data, with respect to age groups, sex, race and ethnicity, and socioeconomic status, compared with the overall population? Third is geographical representativeness: generally, these data will be most representative of urban populations; understanding how well they represent rural communities is important to communicate to users of aggregated data.
Outliers should be discarded. Devices with an implausible number of calls, short duration between calls, or implausibly fast travel patterns are probably machines and do not represent human behaviour. The exact parameters for discarding outliers will depend on the population and the operator. Iterative communication between providers and data consumers is vital on these issues.
Summary measures of representativeness (or imbalance) can be used to compare data across different providers but are almost never shared currently. Providers could compare their internal data about user characteristics to a shared, public gold standard (such as the US Census). These measures would allow researchers to formally compare the representativeness of different providers. Importantly, this process preserves the privacy of users and providers by ensuring individual-level data are never shared.
For example, in the COVID-19 context, mobility data are useful if compared against prephysical distancing or, now, comparing post-relaxation mobility with lockdown or prelockdown averages. Baselines can be established by making sure that the data analysis reaches back in time before the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) outbreak started and physical distancing interventions were put in place. Comparing data with the same time window in previous years accounts for important seasonal mobility patterns; however, most companies do not store data for such a long time because of data retention policies.
Clear communication of the baseline is important, including the uncertainty associated with it, so that decision makers can make sense of the changes they see in mobility.
Spatial and temporal aggregation
The spatial scale needed will also depend on the policy or research objective. For example, city administrators might be interested in identifying hotspots of congregation or patterns in specific types of activity (eg, visits to grocery stores, transit hubs, and schools), whereas state governments might be more concerned with travel networks across administrative boundaries. If aggregators can dedicate resources in response to a crisis, they might consider generating analyses at varying spatial resolutions specific to each use case.
In general, grid cells with fewer than five to 20 users should not be shared with external parties.
Resetting the cohort of the analysis daily also protects privacy.
and other organisations. Selecting appropriate parameters is implementation-dependent but must be undertaken with care.
Mobility metrics and their relation to physical distancing and COVID-19 response
Six baseline metrics are defined first.
Population: a description of the unit of measurement that contributes data to the analysis such as unique users or unique devices. These are designated as i and j.
Spatial resolution of categorisation: a description of the dimensions of the user locations that are used for categorisation. For example, unique locations that are visited by a user might be defined by tile grids, tower catchment areas, or GPS radii from points of interest for internal analysis. These are designated as a and b.
Spatial resolution of aggregation: a description of the size of the regions of interest for which data are aggregated before being shared. For example, the metrics calculated for the population can be aggregated across all users in a region of interest such as neighbourhood or county. These are designated as A and B.
Temporal resolution of categorisation: a description of the time bins that are used to categorise every user’s location. For example, the modal location a that a user i logs data in every hour. These are designated as t.
Temporal resolution of aggregation: a description of the time window for which data are aggregated for all users i in region A. For example, we might be interested in the average numbers of locations a (defined at top location per t min bin) over all space visited by users i in region A over the course of an 8 h time window. These are designated as T.
Temporal thresholds: rather than calculating locations by top location a in every time bin t, providers may decide to calculate locations a as those where the user i spends at least a certain amount of time. This threshold is then designated at T*.
for every 30 min segment (t) over the course of 24 h (T). Every Bing Tile is also mapped onto a county A for which all data are aggregated. We define this set of time-specific locations as
for which aitn refers to a specific bin t in time window T. Not all users will provide enough information to generate a full set MiT; therefore, the provider should be transparent about any interpolation or imputation steps taken to make the set more robust.
Because of differences in spatiotemporal resolution, the definition of a stay location a will differ between CDR and GPS. For CDR, a stay location represents the tower, grid, or administrative region where a user’s mobile phone was located. For GPS, depending on the spatiotemporal resolution, some preprocessing might take place. Typically, we can define a stay location as an area of size a within which all movement occurred for at least T* time units (eg, a circle with a radius of 25 m within which all movement remained for at least 30 min). Different thresholds T* can be used (eg, 5 min, 1 h, etc); the proper threshold will depend on both technological and computational constraints and the relevant question.
These key values can change depending on the metric, because of privacy-preserving objectives, or owing to computational limitations. However, the rationale of these values should be clearly communicated.
Metrics applicable to both CDR and GPS traces
Population distribution and dynamics
For most epidemiological analyses and modelling work, an estimate of the population residing in a specific region at a particular time provides an estimate of the denominator, which is the number of unique i that spend most of their time in a given area. To estimate this quantity, we assign every user to a home location, which is typically either the night-time location of the user or the area where the user spends the most time in the set of locations MiT. In the case of continuous GPS data collection, one can also use location at midnight local time, or a range around that time, if data availability is not continuous. The sum of unique users in every region A is then used as the population estimate for that time window T. This value is denoted as
for which xiAT = 1, if the mode of MiT for time T is in A and 0 otherwise.
Number of significant locations
The average number of significant locations provides an indication of how many distinct places users spend a substantial amount of time. Normal human mobility entails very few significant locations: usually, home, work, or school. However, varying shelter-in-place orders can result in different types of behaviours. For example, strict never-leave-home orders would result in a reduction of significant locations to 1, whereas less strict orders might result in increased numbers of local significant locations as individuals attempt to leave their homes more frequently but briefly.
To calculate the average number of significant locations for a specific population, we use the set of time-varying locations for every user i (MiT) and create a subset of unique locations. Once the set of significant locations for every user has been estimated, the average across a region, grid, or other area of interest can be estimated as the sum of total unique locations visited by user i whose home location is in region A divided by the total number of users whose home location is in region A.
Transition between regions
This value provides an estimate of mobility between locations, which can be used in models to estimate the spatial spread of SARS-CoV-2. Transition matrices should include number, index, or proportion of unique users i who move from region A to region B. Users should contribute only once to the transition matrices within every time window to ensure that the numbers represent unique users and not trips between regions. As the time window considered decreases, researchers will be better able to understand within-day heterogeneity in movement between regions. These values can be used to calculate the percentage change in the total number of trips that occur between and within regions of interest, providing metrics of intraregional and inter-regional mobility.
It is important to note that this metric will vary with spatial scale and the time window considered. We recommend that T is, at most, 4 h for assessment of local travel networks, particularly those that might not cross time zones. Smaller time windows are unable to capture long-distance travel that takes more time than the length of the time window; however, it does allow for better understanding of within-day fluctuations of mobility. Larger time windows could be warranted for assessment of long-distance travel networks across time zones (eg, long-distance interstate travel) because the degree of displacement will typically not be clearly captured within shorter time windows.
This metric can be calculated by splitting a time window T into two halves (T1 and T2) then simply calculating the mode for each set (MiT1 and MiT2), resulting in aiT1 and aiT2. We assume that, for some time window T, the user transitioned from the modal location in T1 to the modal location in T2. If the user did not transition between locations, the matrix will include these counts on the diagonal. These vectors of transition are then summed across all users. Vectors with transitions that do not meet a minimum threshold are dropped. To aggregate to a larger spatial resolution, we map a to its corresponding A and sum the transition values for unique pairs of A and B.
Average and total distances of these vectors are then weighted by the number of users who made the transition, providing a total or average distance moved by users in a given region.
is the distance between location ail and ail–1. Another option would be to directly provide the distances between these locations as a value for each pair of locations, allowing researchers to aggregate and calculate as they see fit.
Radius of gyration
To calculate the radius of gyration for user i, first calculate the root mean squared distance of a user’s movement across space over a given time window from their centre of gravity.
for which the centre of gravity is
Then for every user, generate their home region A as the region in which they spend the most time in their location set MiT. Then, aggregate this value across a population in a given region and provide an average and percentiles.
Regularity of movement
This predictability is important for urban planning, traffic forecasting, and public health. A formal measure of (un)predictability is location entropy. A low location entropy means an individual’s time spent at their significant locations is highly predictable. Conversely, high location entropy suggests that predicting an individual’s location is difficult. Therefore, the lowest location entropy would be achieved by a user who spends the exact same amount of time in the same places in every time window.
Using the set of locations defined above (MiT), the Shannon entropy of user i can be calculated as
for which Li is the number of distinct locations in MiT. This measure assumes a person’s location is uniformly distributed among all Li observed distinct locations in MiT. The uncorrelated Shannon entropy of the user i is
GPS trace metrics
Average co-location with individuals in other regions
This value provides an indication of how much contact individuals from one region have with individuals from other regions. This analysis is restricted to GPS trace data but provides the most direct measure of contact between different populations. It is important that users who form the population for this metric meet a minimum threshold for data contributed during time T. First, for every user, calculate their set of locations for every time segment t in time window T.
Depending on the completeness of the GPS traces, interpolation or imputation of user locations might need to be considered. Second, calculate every user’s home region A as the region in which the user spends most of their time at night in a time window T. For situations when comparing two different regions A and B, calculate the probability of co-location as
describes the total number of co-locations in a region of size a that users i and j have over the course of T. NAT is the total population of users whose home location is in region A for time T, NBT is the total population of users whose home location is in region B for time T, and T/t is the total number of time bins t that exist in time window T. For situations when comparing the same region, calculate the probability of co-location as
Measures of staying put
This metric is a direct measure of how much time people are spending in one location versus moving around and is relevant to measuring the effect of shelter-in-place policies and other strict lockdowns. This metric should be inversely related to the measures of mobility above (average distance travelled and radius of gyration). To calculate this measure, generate the full set of unique locations MiT for every user i for a given time window T. Then assign every user to a home region A as the region aiT where the user spends the most time in a time window T. Finally, count the number of unique locations for every user i in region A and calculate the proportion of all users i in region A who reported only one unique location during the time window T.
Measures of travel to points of interest (geofenced locations)
First, define a set of locations of size a in region A that are categorised as being locations of interest. These might include (but are not restricted to) parks, commercial areas, or grocery stores. These locations could either be grouped together in categories or specify different points of interest. Three steps are needed to calculate this metric. First, generate the full set of unique locations MiT for every user i in a time window T. Second, assign every user to a region A based on the location ait that the user i spends the most amount of time in. Third, for all users i in region A, calculate the proportion who visited a geofenced location in a time window T.
The COVID-19 pandemic has accelerated the use of aggregated mobility data from mobile devices, although without a universal governing framework for its application. Such data provide valuable insights, but without expertise and diligence it is easy to misinterpret these data, or cause harm, even if inadvertent.
As the COVID-19 pandemic continues, the metrics of interest and how they are used will also change. For example, our threshold of an optimal change in radius of gyration in response to a non-pharmaceutical intervention will be different now than when monitoring the same region for spikes in mobility 3 months from now.
We share this framework to advance a common language of comparison across these vast datasets. A shared language will allow us to synchronise future analysis with the limitations of every metric. Together, considerations provide insights for policy makers and could inform epidemiological models about physical distancing and the spatial spread of COVID-19. Combined with clinical and public health data, these metrics will have an important role in planning rollbacks of distancing because they help estimate the effect of various rollbacks on actual mobility patterns on the ground and, as a result, on epidemic spread.
NK and COB had the idea for this Viewpoint, created all figures, and contributed to writing the manuscript. MVK, KE-M, NV, AS, and SB provided substantial subject matter expertise and contributed to writing the manuscript. COB and SB provided supervision and guidance.
Declaration of interests
NV is an employee of Camber Systems, a data analysis firm that produces statistical analyses of mobility data; future work by this company might use these metrics or similar metrics for commercial products. All other authors declare no competing interests.
Research reported in this Viewpoint was supported by the National Institute on Drug Abuse of the National Institutes of Health ( K99DA051534 to MVK) and a National Institute of General Medical Sciences Maximizing Investigator’s Research Award ( R35GM124715-02 to COB). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Helping public health officials combat COVID-19.
Data for Good.
New approaches to human mobility: using mobile phones for demographic research.
Demography. 2012; 50: 1105-1128
Connecting mobility to infectious diseases: the promise and limits of mobile phone data.
J Infect Dis. 2016; 214: S414-S420
Measures of human mobility using mobile phone records enhanced with GIS data.
PLoS One. 2015; 10e0133630
Quantifying the impact of human mobility on malaria.
Science. 2012; 338: 267-270
Impact of human mobility on the emergence of dengue epidemics in Pakistan.
Proc Natl Acad Sci USA. 2015; 112: 11887-11892
Population flow drives spatio-temporal distribution of COVID-19 in China.
Nature. 2020; 582: 389-394
Flying, phones and flu: anonymized call records suggest that Keflavik International Airport introduced pandemic H1N1 into Iceland in 2009.
Influenza Other Respir Viruses. 2020; 14: 37-45
US consumer activity during COVID-19 pandemic: the impact of coronavirus (COVID-19) on foot traffic.
Social distancing scoreboard.
Uncovering socioeconomic gaps in mobility reduction during the COVID-19 pandemic using location data.
arXiv. 2020; ()
Heterogeneous mobile phone ownership and usage patterns in Kenya.
PLoS One. 2012; 7e35319
Mobile phones and malaria: modeling human and parasite travel.
Travel Med Infect Dis. 2013; 11: 15-22
Israel passes emergency law to use mobile data for COVID-19 contact tracing.
South Korea is watching quarantined citizens with a smartphone app.
Mobile phone data for public health: towards data-sharing solutions that protect individual privacy and national security.
arXiv. 2016; ()
Phones could track the spread of Covid-19: is it a good idea?.
Unique in the crowd: the privacy bounds of human mobility.
Sci Rep. 2013; 3: 1-5
Estimating the success of re-identifications in incomplete datasets using generative models.
Nat Commun. 2019; 10: 1-9
Simple demographics often identify people uniquely.
Geo-indistinguishability: differential privacy for location-based systems.
in: Proceedings of the 2013 ACM SIGSAC conference on computer and communications security; Berlin, Germany. November, 2013: 901-914
Privacy preserving in location data release: a differential privacy approach.
in: Pham D-N Park S-B PRICAI 2014: trends in artificial intelligence. Springer International Publishing,
Cham, Switzerland2014: 183-195
Assessing the interplay between human mobility and mosquito borne diseases in urban environments.
Sci Rep. 2019; 916911
Quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data.
Proc Natl Acad Sci USA. 2015; 112: 11114-11119
Multinational patterns of seasonal asymmetry in human movement influence infectious disease dynamics.
Nat Commun. 2017; 8: 1-9
Dynamic denominators: the impact of seasonally varying population numbers on disease incidence estimates.
Popul Health Metr. 2016; 14: 35
Robust de-anonymization of large sparse datasets.
in: Proceedings of the 2008 IEEE Symposium on Security and Privacy; Oakland, CA, USA. May 18–21, 2008: 111-125
Facebook disaster maps: aggregate insights for crisis response & recovery.
in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Anchorage, AK, USA. July, 20193173
Detecting violations of differential privacy.
in: Proceedings of the 2018 ACM SIGSAC conference on computer and communications security; New York, NY, USA. January, 2018: 475-489
Differential privacy models for location-based services.
Trans Data Privacy. 2016; 9: 15-48
Disclosure avoidance and the 2020 Census.
Google is open-sourcing a tool for data scientists to help protect private information.
On the privacy-conscientious use of mobile phone data.
Sci Data. 2018; 5180286
The algorithmic foundations of differential privacy.
Bing Maps Tile System.
On the Levy-walk nature of human mobility.
IEEE ACM Trans Netw. 2011; 19: 630-643
Understanding individual human mobility patterns.
Nature. 2008; 453: 779-782
Limits of predictability in human mobility.
Science. 2010; 327: 1018-1021
Approaching the limit of predictability in human mobility.
Sci Rep. 2013; 3: 1-9
Patterns, entropy, and predictability of human mobility and life.
PLoS One. 2012; 7e51353
Published: September 01, 2020
© 2020 The Author(s). Published by Elsevier Ltd.