API Reference
DpMobilityReport
- class dp_mobility_report.dpmreport.DpMobilityReport(df: DataFrame, tessellation: GeoDataFrame | None = None, privacy_budget: int | float | None = None, user_privacy: bool = True, max_trips_per_user: int | None = None, analysis_selection: List[str] | None = None, analysis_exclusion: List[str] | None = None, budget_split: dict = {}, timewindows: List[int] | ndarray = [2, 6, 10, 14, 18, 22], max_travel_time: int | None = None, bin_range_travel_time: int | None = None, max_jump_length: int | float | None = None, bin_range_jump_length: int | float | None = None, max_radius_of_gyration: int | float | None = None, bin_range_radius_of_gyration: int | float | None = None, max_user_tile_count: int | None = None, bin_range_user_tile_count: int | None = None, max_user_time_delta: int | float | None = None, bin_range_user_time_delta: int | float | None = None, subtitle: str | None = None, disable_progress_bar: bool = False, seed_sampling: int | None = None, evalu: bool = False)[source]
Generate a (differentially private) mobility report from a mobility dataset. The report will be generated as an HTML file, using the
.to_file()
method.- Parameters:
df –
DataFrame
containing the mobility data. Expected columns: User IDuid
, trip IDtid
, timestampdatetime
(orint
to indicate sequence position, if dataset only consists of sequences without timestamps), latitudelat
and longitudelng
in CRS EPSG:4326.tessellation – Geopandas
GeoDataFrame
containing the tessellation for spatial aggregations. Expected columns:tile_id
. If tessellation is not provided in the expected default CRS EPSG:4326, it will automatically be transformed. If no tessellation is provided, all analyses based on the tessellation will automatically be removed.privacy_budget – privacy_budget for the differentially private report. Defaults to
None
, i.e., no privacy guarantee is provided.user_privacy – Whether item-level or user-level privacy is applied. Defaults to
True
(user-level privacy).max_trips_per_user – Maximum number of trips a user is allowed to contribute to the data. Dataset will be sampled accordingly. Defaults to
None
, i.e., all trips are used. This implies that the actual maximum number of trips per user will be used according to the data, though this violates user-level Differential Privacy.analysis_selection – Select only needed analyses. A selection reduces computation time and leaves more privacy budget for higher accuracy of other analyses.
analysis_selection
takes a list of all analyses to be included. Alternatively, a list of analyses to be excluded can be set withanalysis_exclusion
. Either entire segments can be included:const.OVERVIEW
,const.PLACE_ANALYSIS
,const.OD_ANALYSIS
,const.USER_ANALYSIS
or any single analysis can be included:const.DS_STATISTICS
,const.MISSING_VALUES
,const.TRIPS_OVER_TIME
,const.TRIPS_PER_WEEKDAY
,const.TRIPS_PER_HOUR
,const.VISITS_PER_TILE
,const.VISITS_PER_TIME_TILE
,const.OD_FLOWS
,const.TRAVEL_TIME
,const.JUMP_LENGTH
,const.TRIPS_PER_USER
,const.USER_TIME_DELTA
,const.RADIUS_OF_GYRATION
,const.USER_TILE_COUNT
,const.MOBILITY_ENTROPY
. Default is None, i.e., all analyses are included.analysis_exclusion – Ignored, if
analysis_selection
is set!analysis_exclusion
takes a list of all analyses to be excluded. Either entire segments can be excluded:const.OVERVIEW
,const.PLACE_ANALYSIS
,const.OD_ANALYSIS
,const.USER_ANALYSIS
or any single analysis can be excluded:const.DS_STATISTICS
,const.MISSING_VALUES
,const.TRIPS_OVER_TIME
,const.TRIPS_PER_WEEKDAY
,const.TRIPS_PER_HOUR
,const.VISITS_PER_TILE
,const.VISITS_PER_TIME_TILE
,const.OD_FLOWS
,const.TRAVEL_TIME
,const.JUMP_LENGTH
,const.TRIPS_PER_USER
,const.USER_TIME_DELTA
,const.RADIUS_OF_GYRATION
,const.USER_TILE_COUNT
,const.MOBILITY_ENTROPY
budget_split –
dict
to customize how much privacy budget is assigned to which analysis. Each key needs to be named according to an analysis and the value needs to be an integer indicating the weight for the privacy budget. If no weight is assigned, a default weight of 1 is set. For example, ifbudget_split = {const.VISITS_PER_TILE: 10}
, then the privacy budget forvisits_per_tile
is 10 times higher than for every other analysis, which all get a default weight of 1. Possibledict
keys (all analyses):const.DS_STATISTICS
,const.MISSING_VALUES
,const.TRIPS_OVER_TIME
,const.TRIPS_PER_WEEKDAY
,const.TRIPS_PER_HOUR
,const.VISITS_PER_TILE
,const.VISITS_PER_TIME_TILE
,const.OD_FLOWS
,const.TRAVEL_TIME
,const.JUMP_LENGTH
,const.TRIPS_PER_USER
,const.USER_TIME_DELTA
,const.RADIUS_OF_GYRATION
,const.USER_TILE_COUNT
,const.MOBILITY_ENTROPY
timewindows – List of hours as
int
that define the timewindows for the spatial analysis for single time windows. Defaults to[2, 6, 10, 14, 18, 22]
.max_travel_time – Upper bound for travel time histogram. If
None
is given, no upper bound is set. Defaults toNone
.bin_range_travel_time – The range a single histogram bin spans for travel time (e.g., 5 for 5 min bins). If
None
is given, the histogram bins will be determined automatically. Defaults toNone
.max_jump_length – Upper bound for jump length histogram. If
None
is given, no upper bound is set. Defaults toNone
.bin_range_jump_length – The range a single histogram bin spans for jump length (e.g., 1 for 1 km bins). If
None
is given, the histogram bins will be determined automatically. Defaults toNone
.max_radius_of_gyration – Upper bound for radius of gyration histogram. If
None
is given, no upper bound is set. Defaults toNone
.bin_range_radius_of_gyration – The range a single histogram bin spans for the radius of gyration (e.g., 1 for 1 km bins). If
None
is given, the histogram bins will be determined automatically. Defaults toNone
.max_user_tile_count – Upper bound for distinct tiles per user histogram. If
None
is given, no upper bound is set. Defaults toNone
.bin_range_user_tile_count – The range a single histogram bin spans for the distinct tiles per user histogram. If
None
is given, the histogram bins will be determined automatically. Defaults toNone
.max_user_time_delta – Upper bound for user time delta histogram. If
None
is given, no upper bound is set. Defaults toNone
.bin_range_user_time_delta – The range a single histogram bin spans for user time delta (e.g., 1 for 1 hour bins). If
None
is given, the histogram bins will be determined automatically. Defaults toNone
.subtitle – Custom subtitle that appears at the top of the HTML report. Defaults to
None
.disable_progress_bar – Whether progress bars should be shown. Defaults to
False
.seed_sampling – Provide seed for down-sampling of dataset (according to
max_trips_per_user
) so that the sampling is reproducible. Defaults toNone
, i.e., no seed.evalu – Parameter only needed for development and evaluation purposes. Defaults to
False
.
- property analysis_exclusion: list
List of analyses that have been excluded from the report and similarity measures. If analysis selection was provided as a parameter, they are inverted to this
analysis_exclusion
parameter.
- property budget_split: dict
Budget split as specified in the parameters.
- property df: DataFrame
DataFrame containing the processed input mobility data of the report.
- property max_trips_per_user: int
Maximum number of trips per user as specified in the parameters. If
None
was given, this equals the actual maximum according to the data.
- property privacy_budget: int | float
Privacy budget as specified in the parameters.
- property report: dict
A dictionary with all report elements (i.e., analyses).
- property tessellation: GeoDataFrame
Processed tessellation.
- to_file(output_file: str | Path, disable_progress_bar: bool | None = None, top_n_flows: int = 100) None [source]
Write the report to a file. By default a name is generated.
- Parameters:
output_file – The name or the path of the file to store the
html
output.disable_progress_bar – if
False
, no progress bar is shown.top_n_flows – Determines how many of the top
n
origin-destination flows are displayed. Defaults to 100.
BenchmarkReport
- class dp_mobility_report.benchmark.benchmarkreport.BenchmarkReport(df_base: DataFrame, tessellation: GeoDataFrame | None = None, df_alternative: DataFrame | None = None, privacy_budget_base: int | float | None = None, privacy_budget_alternative: int | float | None = None, user_privacy_base: bool = True, user_privacy_alternative: bool = True, max_trips_per_user_base: int | None = None, max_trips_per_user_alternative: int | None = None, analysis_selection: List[str] | None = None, analysis_exclusion: List[str] | None = None, budget_split_base: dict = {}, budget_split_alternative: dict = {}, timewindows: List[int] | ndarray = [2, 6, 10, 14, 18, 22], max_travel_time: int = 120, bin_range_travel_time: int = 5, max_jump_length: int | float = 10, bin_range_jump_length: int | float = 1, max_radius_of_gyration: int | float = 5, bin_range_radius_of_gyration: int | float = 0.5, max_user_tile_count: int = 10, bin_range_user_tile_count: int = 1, max_user_time_delta: int | float = 48, bin_range_user_time_delta: int | float = 4, top_n_ranking: List[int] = [10, 50, 100], measure_selection: dict | None = None, subtitle: str | None = None, disable_progress_bar: bool = False, seed_sampling: int | None = None, evalu: bool = False)[source]
- Evaluate the similarity of two (differentially private) mobility reports from one or two mobility datasets.
This can be based on two datasets (
df_base
anddf_alternative
) or one dataset (df_base
) with different privacy settings. The argumentsdf
,privacy_budget
,user_privacy
,max_trips_per_user
andbudget_split
can differ for the two datasets set with the according ending_base
and_alternative
. The other arguments are the same for both reports. For the evaluation, similarity measures (namely the symmetric mean absolute percentage error (SMAPE), Jensen-Shannon divergence (JSD), Kullback-Leibler divergence (KLD), the earth mover’s distance (EMD), the Kendall correlation coefficient (KT) and the top n coverage (TOP_N_COV)) are computed to quantify the statistical similarity for each analysis. The evaluation, i.e., benchmark report, will be generated as an HTML file, using the.to_file()
method.
- Parameters:
df_base –
DataFrame
containing the baseline mobility data, see argumentdf
ofDpMobilityReport
.tessellation – Geopandas
GeoDataFrame
containing the tessellation for spatial aggregations. Expected columns:tile_id
. If tessellation is not provided in the expected default CRS EPSG:4326 it will automatically be transformed. If no tessellation is provided, all analyses based on the tessellation will automatically be removed.df_alternative –
DataFrame
containing the alternative mobility data to be compared against the baseline dataset, see argumentdf
ofDpMobilityReport
. IfNone
,df_base
is used for both reports.privacy_budget_base – privacy_budget for the differentially private base report. Defaults to
None
, i.e., no privacy guarantee is provided.privacy_budget_alternative – privacy_budget for the differentially private alternative report. Defaults to
None
, i.e., no privacy guarantee is provided.user_privacy_base – Whether item-level or user-level privacy is applied for the base report. Defaults to
True
(user-level privacy).user_privacy_alternative – Whether item-level or user-level privacy is applied for the alternative report. Defaults to
True
(user-level privacy).max_trips_per_user_base – maximum number of trips a user shall contribute to the data. Dataset will be sampled accordingly. Defaults to
None
, i.e., all trips included.max_trips_per_user_alternative – maximum number of trips a user shall contribute to the data. Dataset will be sampled accordingly. Defaults to
None
, i.e., all trips included.analysis_selection – Select only needed analyses, see argument
analysis_selection
ofDpMobilityReport
.analysis_exclusion – Ignored, if
analysis_selection
is set! Exclude analyses that are not needed, see argument``analysis_exclusion``
ofDpMobilityReport
.budget_split_base –
dict``to customize how much privacy budget is assigned to which analysis. See argument ``budget_split
ofDpMobilityReport
.budget_split_alternative –
dict``to customize how much privacy budget is assigned to which analysis. See argument ``budget_split
ofDpMobilityReport
.timewindows – List of hours as
int
that define the timewindows for the spatial analysis for single time windows. Defaults to [2, 6, 10, 14, 18, 22].max_travel_time – Upper bound for travel time histogram. Defaults to 120 (mins).
bin_range_travel_time – The range a single histogram bin spans for travel time (e.g., 5 for 5 min bins). Defaults to 5 (min).
max_jump_length – Upper bound for jump length histogram. Defaults to 10 (km).
bin_range_jump_length – The range a single histogram bin spans for jump length (e.g., 1 for 1 km bins). Defaults to 1 (km).
max_radius_of_gyration – Upper bound for radius of gyration histogram. Defaults to 5 (km).
bin_range_radius_of_gyration – The range a single histogram bin spans for the radius of gyration (e.g., 1 for 1 km bins). Defaults to 0.5 (km).
max_user_tile_count – Upper bound for distinct tiles per user histogram. Defaults to 10.
bin_range_user_tile_count – The range a single histogram bin spans for the distinct tiles per user histogram. Defaults to 1.
max_user_time_delta – Upper bound for user time delta histogram. Defaults to 48 (hours).
bin_range_user_time_delta – The range a single histogram bin spans for user time delta (e.g., 1 for 1 hour bins). Defaults to 4 (hours).
top_n_ranking – List of ‘top n’ values that are used to compute the Kendall correlation coefficient and the top n coverage for ranking similarity measures. Values need to be integers > 0. Defaults to
[10, 50, 100]
.measure_selection – Select similarity measure for each analysis that is used for the
similarity_measures
property of theBenchmarkReport
. IfNone
, the default fromdefault_measure_selection()
will be used.subtitle – Custom subtitle that appears at the top of the HTML report. Defaults to
None
.disable_progress_bar – Whether progress bars should be shown. Defaults to
False
.seed_sampling – Provide seed for down-sampling of dataset (according to
max_trips_per_user
) so that the sampling is reproducible. Defaults toNone
, i.e., no seed.evalu (bool, optional) – Parameter only needed for development and evaluation purposes. Defaults to
False
.
- property emd: dict
The earth mover’s distance between base and alternative of all selected analyses, where applicable.
- property jsd: dict
The Jensen-Shannon divergence between base and alternative of all selected analyses, where applicable.
- property kld: dict
The Kullback-Leibler divergence between base and alternative of all selected analyses, where applicable.
- property measure_selection: dict
The specified selected similarity measure for each analysis.
- property report_alternative: DpMobilityReport
The alternative DpMobilityReport
- property report_base: DpMobilityReport
The base DpMobilityReport
- property similarity_measures: dict
Similarity measures according to
measure_selection
.
- property smape: dict
The symmetric (mean absolute) percentage error, based on the relative error, between base and alternative of all selected analyses, where applicable.