Introduction
The following report describes the workflow for calculation of indicator 11.2.1 within the framework of the GEOSTAT 3 project, work package 2, by Statistics Sweden.
Data status
The following data sources have been used:
- National data on officially recognized public transportation stops including coordinates and traffic frequency for each stop. Data is available under open data license following the GTFS data model (Google General Transit Feed Spec) and provided jointly by the public transportation service providers (www.trafiklab.se). Data for the year 2015 has been used.
- Population data from the Population register geocoded to address point location. In Sweden, the Population register is based on administrative data from the national population registry, which can be geo-enabled by use of geocoded authoritative address and/or building data (INSPIRE conformant) from the NMCA (Lantmäteriet). Population data is available on annual basis, hence geocoded population can be obtained for any point of time. Data for the year 2015 has been used.
- Geographic delimitation of urban areas (localities) following national methodology, produced by Statistics Sweden. Data on localities is national authoritative data available under open data licenses. Data for the year 2015 has been used.
- Urban and High-density cluster grid from Eurostat based on the GEOSTAT 2011 population grid. Data for the year 2011 has been used.
Processes
The steps to calculate the indicator comprised five different phases described in detail below:
a) Geocoding population data
The national population register at Statistics Sweden includes references to address, dwelling ID and Real property ID at unit record level (e.g. at the level of each individual). Data is collected by the Tax administration and transferred on daily basis to Statistics Sweden. Hence, geo-enabling population to address location is quite straightforward and can be deployed for any point in time using the authoritative address register from the NMCA. A copy of the address register is kept in Statistics Sweden and the location of each address is stored with point-geometries in the “Geobase”, which is a set of SQL-databases used for geocoding of unit record data in Statistics Sweden. Unit record data with population data linked to address location is served as a data warehouse internally, for use in SQL server or in desktop GIS software. The data warehouse comprise information about the address location (geometry), age and sex of each individual.
Some 99.7 percent of the population can be directly geocoded to the level of address location. For different reasons the remaining 0.3 percent cannot. By using references to Real Property location in instead of address location, another 0.1 percent of the population can be properly geocoded. The remaining 0.2 percent represents individuals without a permanent place of residence (homeless people, prisoners and elderly people in special care centres etc.) and cannot be more accurate geo-located than to the municipality in which they were registered. In order to obtain a fully geocoded record, different location objects are used. Metadata at the level of each individual record describes the matching type and quality according to a fixed coding system. See table below. When conducting the calculations on access to public transportation stops, only population accurately assigned to address location is regarded.
Table 1: Metadata describing geocoding quality at unit record level
Quality Code | Number of people geocoded | % |
1 – Direct match on physical address | 9 820 305 | 99.7 |
2 – Direct match on Real Property ID | 13 899 | 0.1 |
3 – Match on key code area centroid | 64 | 0.0 |
4 – Match on municipality centroid | 16 749 | 0.2 |
Total population 2015 | 9 851 017 | 100 |
b) Delimitation of urban agglomerations:
In principle, this step has already been completed prior to the indicator analysis. Two different concepts/data sources have been tested:
- Classification of urban areas based on national data (Swedish “localities”).
- Classification of urban areas on European data on high-density and urban clusters (using data from Eurostat based on the grid cluster method).
Statistics Sweden has recurrently delineated the geographical extent of urban areas (“localities”) as part of the production of urban official statistics every five years since 1960. Digital boundaries exist from 1980. A locality consists of a group of buildings normally not more than 200 metres apart, and must fulfil a minimum criterion of having at least 200 inhabitants. Thus, localities include the largest cities as well as small settlements with 200 inhabitants as the lower threshold. The delimitation is conducted as an automated workflow involving high quality authoritative geospatial data from the NSDI in combination with point-based population data geocoded to the level of address location. The result is a national polygon dataset representing the urban extent of each locality (some 2 000 in Sweden). Data is available under open data license agreements (http://www.scb.se/hitta-statistik/regional-statistik-och-kartor/geodata/oppna-geodata/tatorter/).
In order to enable comparison between national and European data, a cut-off has been applied to the national data, taking into account only those urban clusters in national data having 5 000 inhabitants or more.
For testing the harmonised European concept of urban, grid data on high-density and urban clusters were downloaded from Eurostat’s homage. Data was prepared by creating a national subset of the European grid, conversion of grid data to vector and finally merging of the high-density clusters and the urban clusters into the same layer with a code separating the two urban categories. The vector features representing high-density clusters and urban clusters were loaded into SQL-server for further processing.
Despite differences in the underlying methodology and the varying granularity of the boundaries, when comparing population figures calculated according to the two different concepts of urban, they produce a surprisingly coherent result, given that the threshold of 5 000 inhabitants is applied. The table below shows the outcome of the calculations. For all calculations, population data geocoded to address-location has been used.
Table 2: Calculation of urban population using national vs European urban concept. Reference time 31 of December 2010
Data source | Urban population |
Population in urban areas according to national localities without threshold (>= 200 inhab.) | 8 015 797 |
Population in urban areas according to national localities with threshold (>= 5 000 inhab.) | 6 326 524 |
Population in urban areas according to grid cluster data (High-density clusters & urban clusters) | 6 289 663 |
The map below illustrates the spatial differences between national and European data. The white boundaries represent the grid-based boundaries of an urban cluster and the yellow boundaries represents the national boundaries. The main difference is that the national data is contextual; besides population density, it takes into account a number of spatial metrics such as connectivity, barriers and land use, hence offering a more precise representation of the urban outline than the grid based clustering. The national data separates cluster 1, 2, 3 and 4, because the distance between the buildings of the clusters exceeds 200 meters.
Figure 1: the differences between national and European data.
c) Selection and preparation of public transportation stops:
A complete national database (covering the whole country and all modes of transportation) on officially recognized public transportation stops, is maintained by a consortium of all public transportation service providers in the country (www.trafiklab.se). The data is not authoritative in a strict sense but as the information is serving a wide range of different timetable services, the quality is good and the information is reliable. In addition, the GTFS format offer a kind of open de facto standard for serving public transportation data.
The database includes coordinates along with extensive information about routes, trips and traffic frequency for each stop. Data is provided through an API under open data license in GTFS format (Google General Transit Feed Specification). Data can be accessed on real-time basis through the API but also downloaded for specific years. The first complete version of national data was released in 2012.
The GTFS data is structured in a number of different related files. Not all data providers use the full model and the model allows a certain flexibility and can be applied in different ways. In case of the Swedish data, information is served using the files described in the table below.
Table 3: Content of the national data on public transportation (GTFS data model)
agency.txt | One or more transit agencies that provide the data in this feed. |
stops.txt | Individual locations where vehicles pick up or drop off passengers. |
routes.txt | Transit routes. A route is a group of trips that are displayed to riders as a single service. |
trips.txt | Trips for each route. A trip is a sequence of two or more stops that occurs at specific time. |
stop_times.txt | Times that a vehicle arrives at and departs from individual stops for each trip. |
calendar.txt | Dates for service IDs using a weekly schedule. Specify when service starts and ends, as well as days of the week where service is available. |
calendar_dates.txt | Exceptions for the service IDs defined in the calendar.txt file. If calendar_dates.txt includes ALL dates of service, this file may be specified instead of calendar.txt. |
transfers.txt | Rules for making connections at transfer points between routes. |
feed_info.txt | Additional information about the feed itself, including publisher, version, and expiration information. |
Data for the year 2015 was downloaded and loaded into SQL-server. Lat/long coordinate values were transformed to national planar system and point-geometries were created for each public transportation stop. A filter was created in order to select only those public transportation stops that were regularly trafficked during business hours 06:00-20:00 with at least one departure per hour. To apply this rule, a number of files had to be used. In “stop_times” all departures for a certain stop_id could be identified in “hh:mm:ss”. But these departure times are typical values and exceptions may occur for certain days. All exceptions are listed by service_id in “calender_dates” in “yyyy:mm:dd”. By transforming the date values to weekdays, exceptions for services occurring only during weekends could be identified and excluded. By using “trips” exceptions from typical departures could be linked to “stops” and a final calculation of the number of departures occurring during 06:00-20:00 for each stop could be conducted. After that the stops fulfilling the criteria could be selected for further processing.
d) Computation of service areas
Service areas were computed using a Euclidian distance buffering operation (0.5 km) based on the public transportation stops that were selected in the previous step. Buffering was undertaken in FME and the resulting buffers were loaded into SQL-server for further computation.
A test was also conducted on network distance using the national road network. The outcome of this test is described further below.
e) Calculation of the population within service areas:
Once all the previous steps were completed and spatial features for national urban areas, European urban areas, service area buffers and administrative geographies for counties/NUTS3 areas had been loaded into SQL-server, the share of population within service areas could be calculated.
By using the ST_geometry[1] functions in SQL-server, each geocoded record of individuals were intersected against service areas, urban areas and administrative geographies for counties/NUTS3. The result was written to a new, temporary SQL table containing all the spatial relationships between population (points) and area features (polygons). From this master table containing the whole population and its relation to service areas etc, any combination of variables could be calculated.
Table 4: Concept of the master table from which final calculations are retrieved
Sex | Age | In service area | In urban area (national) | In urban area (European) | NUTS3 | |
Pop A | Male | 42 | 1 | 1 | 1 | SE212 |
Pop B | Female | 15 | 1 | 1 | 1 | SE212 |
Pop C | Female | 67 | 1 | 1 | 0 | SE212 |
Pop D | Male | 24 | 0 | 0 | 0 | SE213 |
Network distance
A small test was conducted on using network distance instead of Euclidian distance buffers. The test comprised only a small subset of the country (the county of Gotland). The same input data were used as described above, and in addition a subset of the authoritative national road network from the Swedish Transportation Administration was used. The calculations were conducted in QGIS 2.18.
Results
On national level, 7 910 189 people, or 80.3 percent of the total population had convenient access to public transportation stops in 2015. There was a minor difference between sexes as a slightly greater share, 81.1 percent of women had convenient access to public transportation stops.
Table 5: Share of population with convenient access to public transportation, disaggregation by sex
Men | Women | Total | |
Convenient access | 79,5 | 81,1 | 80,3 |
No convenient access | 20,5 | 18,9 | 19,7 |
Disaggregated on broad age groups some interesting differences occur in the data. The population in the interval aged 15-24 had the best access to public transportation, whereas the population aged 65 and over, had the lowest share with convenient access to public transportation in a national perspective.
Table 6: Share of population with convenient access to public transportation, disaggregation by age
Age 0-14 | Age 15-24 | Age 25-64 | Age 65- | Total | |
Convenient access | 79,9 | 83,1 | 80,3 | 79,0 | 80,3 |
No convenient access | 20,1 | 16,9 | 19,7 | 21,0 | 19,7 |
As expected, the share of the urban population with convenient access to public transportation is far higher than the national average. Regardless of what data sources and definitions of urban were used, the share amounts to over 90 percent.
The table below shows disaggregation of the urban population with convenient access to public transportation using different urban concepts. In the national concept, no distinction is made between different urban typologies. Any individual that sits within the urban zone (localities with at least 5 000 inhabitants) are considered “urban”. In the European concept, there is a distinction between High-density clusters[2] and urban clusters[3]. In the last column, the combined figure for the total urban population (both high-density clusters and urban clusters) is presented. The combined figure for urban population according to the European urban concept is very close to the figure retrieved using national data sources.
Table 6: Share of urban population with convenient access to public transportation, national vs European concept of urban
National urban* | European High-density clusters | European Urban cluster | European total urban | |
Convenient access | 93 | 98 | 91 | 94 |
No convenient access | 7 | 2 | 9 | 6 |
* Using a cut-off at 5 000 inhabitants.
When data is disaggregated on NUTS3 level, the share of urban population with convenient access to public transportation shows similar coherence between national and European application of urban. In all NUTS3 areas, the share calculated using the European concept is slightly higher than the national concept.
Graph 1: Share of urban population with convenient access to public transportation, national vs European (high-density clusters and urban clusters), by NUTS3 area
Result from the network distance calculations
In total, the county of Gotland has 32 767 inhabitants. Calculations based on Euclidian distance buffers (as described above) returns a figure of 20 081 inhabitants (or 61.3 percent) with convenient access to public transportation. Using road network distance calculation based on the national authoritative road network data produces a significantly lower figure, 15 757 inhabitants (or 48.1 percent) with convenient access to public transportation.
From this brief test, it is difficult to conclude which figure is the most “true”. Both approaches has its strengths and caveats. Underestimations and overestimation may occur for both approaches. Below are a few illustrations on results differing between the two methods.
Figure 2: Underestimation due to missing link
In figure 2, the resident population within the black circle do not have access to public transport according to method 2 (network distance). The reason is that the road network used for the calculations does not contain bicycle paths or walkways. The missing path, that would represent the closest network distance, is clearly visible in the areal imagery.
Figure 3: Euclidian distance overestimation
In figure 3, method 2 returns a truer estimation compared to method 1. There are no paths or walkways missing and the area in between the roads is covered by forest. It is unlikely that the population within the black circle would use any other path than along the road network on their way to the public transport stop. In this example method 1 overestimates the population with convenient access to the public transportation.
Table 7: A summary of the pros and cons for the two different approaches
Method | Pros | Cons |
Method 1: Euclidian distance buffering | Easy to use, robust and fast.
|
Does not take barriers into account (e.g. a buffers crossing water, railways etc), resulting in overestimation of the population with convenient access to public transportation |
Method 2: Network distance measurement | If street network is complete and includes walkways and bicycle lanes, distance calculations are very accurate and close to truth. | If street network is not complete, the calculations will most likely underestimate the population with convenient access to public transportation.
Very demanding and complex calculations. |
Remarks
There is a slight temporal gap between the European data on high-density clusters and urban clusters and the national data on localities. The European data is based on the GEOSTAT 2011 population grid whereas the national data depicts the urban extent of 2015. Until 2015, national data on urban extent were produced every five years. However, Statistics Sweden has decided to step up the frequency to every three years. The production of a pan-European dataset on high-density clusters and urban clusters is restricted to the official Census years, which means that the next pan-European grid will have the reference year 2021. In theory however, an urban grid with national coverage following the European methodology, could be created annually as annual gridded population data is mandatory for Sweden according to INSPIRE.
Evaluation
The general conclusion is that following the requirements and recommendations provided by the European implementation guide for the GSGF, will create a robust and efficient setting for calculations of indicator 11.2.1. This indicator clearly demonstrates the great potential of geospatial-statistical integration through use of a point-based geocoding as described by the GEOSTAT 2 project and repeated through the recommendations by GEOSTAT 3.
In this case, mainly principles 1, 2, 3 and partially 4 have been possible to evaluate. As dissemination of the result has not really been part of the task, principle 5 is out of scope.
In the case of Statistics Sweden, many of the crucial elements suggested by GEOSTAT 3, with relevance for the calculation of this indicator, have already been put in place.
Most significantly the strengths recognised are:
- Availability of authoritative, point-based location data for geocoding
- Availability of population data from administrative sources, enabling easy, annual updates of the indicator without having to use population estimations
- Use of point-of-entry validation of address information in population registry providing very good conditions for geocoding and few non-matching observations
- Availability of traffic data with national coverage from a trusted provider
Contact information:
jerker.mostrom@scb.se
[1] ST_Geometry is a spatial data type used to perform spatial operations in databases following OGC/ISO standards.
[2] High-density clusters are defined as groups of contiguous raster cells of 1 sqkm size, having a population density of at least 1500 inhabitants/sqkm and a total population of at least 50 000.
[3] An urban cluster is a cluster of contiguous grid cells of 1 km2 with a density of at least 300 inhabitants per km2 and a minimum population of 5 000.