Development of Guidelines for Publishing Georeferenced Statistical Data Using Linked Open Data Technologies

12.03.2019
NTTS 2019 Conference / Brussels / Belgium
M
i
r
o
s
ł
a
w
 
M
i
g
a
c
z
GIS Consultant
Statistics Poland
Merging statistics and geospatial information grant series
P
u
b
l
i
s
h
i
n
g
 
g
e
o
r
e
f
e
r
e
n
c
e
d
s
t
a
t
i
s
t
i
c
a
l
 
d
a
t
a
 
u
s
i
n
g
l
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
 
t
e
c
h
n
o
l
o
g
i
e
s
undefined
Title: 
„Development of guidelines for publishing statistical data as linked
open data”
„Merging statistics and geospatial information”
 grant series
2016 – 2017
m
a
i
n
 
g
o
a
l
:
 
p
r
e
p
a
r
e
 
a
 
b
a
c
k
g
r
o
u
n
d
 
f
o
r
 
L
O
D
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
 
i
n
 
o
f
f
i
c
i
a
l
s
t
a
t
i
s
t
i
c
s
T
h
e
 
p
r
o
j
e
c
t
undefined
 
powiat
łobeski
(LAU 1)
 
3218
 
4.4.32.64.18
 
lobeski
 
4326418
B
e
f
o
r
e
undefined
 
powiat łobeski
 
http://
nts.stat.gov.pl/4/4/32/64/18
A
f
t
e
r
undefined
S
p
e
c
i
f
i
c
 
o
b
j
e
c
t
i
v
e
s
 
identify data sources
identify statistical units
harmonize, generalize and build URIs for statistical units
transform statistical data, geospatial data and metadata into RDF
(pilot)
conclude the pilot transformation and fomulate recommendations
for a full-on implementation
undefined
P
r
i
m
a
r
y
 
d
a
t
a
 
s
o
u
r
c
e
s
undefined
I
d
e
n
t
i
f
i
c
a
t
i
o
n
 
o
f
 
d
a
t
a
 
s
o
u
r
c
e
s
Other data sources:
·
publications
·
tables
·
communiques
·
announcements
·
articles
undefined
D
a
t
a
 
s
o
u
r
c
e
s
 
-
 
i
n
v
e
n
t
o
r
y
Metadata:
·
thematic category,
·
format (PDF, DOC, XLS, CSV),
·
spatial reference (country, NUTS, LAU, functional areas, urban areas),
·
temporal reference (years)
·
presence of identifiers (TERYT, NTS, NUTS)
·
update cycle
Preliminary analysis of data sources:
·
openness
·
redundance of information
·
popularity (based on view / download stats)
undefined
a
d
m
i
n
i
s
t
r
a
t
i
v
e
 
b
o
u
n
d
a
r
i
e
s
:
·
administrative units
·
NUTS
N
o
n
-
s
t
a
n
d
a
r
d
 
s
t
a
t
i
s
t
i
c
a
l
 
u
n
i
t
s
:
·
functional areas /
urban areas
·
Groups of administrative /
statistical units
·
Derive mostly
from strategic documents
S
t
a
t
i
s
t
i
c
a
l
 
u
n
i
t
s
 
i
n
v
e
n
t
o
r
y
undefined
S
t
a
t
i
s
t
i
c
a
l
 
u
n
i
t
s
 
h
a
r
m
o
n
i
z
a
t
i
o
n
 
 
K
T
S
KTS – classification combining administrative and statistical units
introduced last year to comply with NUTS 2016
14-digit code
undefined
G
e
o
m
e
t
r
y
 
h
a
r
m
o
n
i
z
a
t
i
o
n
/
g
e
n
e
r
a
l
i
z
a
t
i
o
n
I
n
p
u
t
 
d
a
t
a
:
·
administrative boundaries since 2002 for LAU 2 (gmina), excluding
2007
H
a
r
m
o
n
i
z
a
t
i
o
n
 
p
r
o
c
e
s
s
:
·
structure standardization
·
standardization of identifiers (creating KTS identifiers)
·
aggregation to higher level units (LAU 1 -> NUTS 1)
G
e
n
e
r
a
l
i
z
a
t
i
o
n
:
·
several generalization scenarios tested for purposes of choosing
an optimal one
·
d
a
t
a
s
e
t
s
 
w
i
t
h
 
g
e
n
e
r
a
l
i
z
e
d
 
a
n
d
 
n
o
n
-
g
e
n
e
r
a
l
i
z
e
d
g
e
o
m
e
t
r
i
e
s
 
p
r
e
p
a
r
e
d
 
f
o
r
 
2
0
0
2
-
2
0
1
6
undefined
L
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
 
p
i
l
o
t
undefined
L
O
D
 
p
i
l
o
t
 
 
s
t
a
t
i
s
t
i
c
a
l
 
d
a
t
a
d
a
t
a
:
·
demographic data for 2016 from three major databases (Local Data
Bank, Demography Database, STRATEG system),
o
n
t
o
l
o
g
i
e
s
 
f
o
r
 
c
l
a
s
s
i
f
i
c
a
t
i
o
n
s
:
·
a
g
e
 
c
o
d
e
l
i
s
t
 
d
e
f
i
n
e
d
 
u
s
i
n
g
 
S
K
O
S
 
(
s
k
o
s
)
 
&
 
D
u
b
l
i
n
 
C
o
r
e
 
(
d
c
t
)
,
·
s
e
x
 
c
o
d
e
l
i
s
t
 
r
e
-
u
s
e
d
 
f
r
o
m
 
S
D
M
X
,
 
a
d
d
e
d
 
P
o
l
i
s
h
 
t
r
a
n
s
l
a
t
i
o
n
,
d
e
f
i
n
i
n
i
n
g
 
m
e
t
a
d
a
t
a
 
f
o
r
 
s
t
a
t
i
s
t
i
c
a
l
 
v
a
l
u
e
s
 
(
o
b
s
e
r
v
a
t
i
o
n
s
)
:
·
based primarily on SDMX ontologies (
attribute
, 
code
, 
measure
,
dimension
),
·
qb:Observation
 class from Data Cube.
undefined
L
O
D
 
p
i
l
o
t
 
 
g
e
o
s
p
a
t
i
a
l
 
d
a
t
a
i
n
p
u
t
 
g
e
o
m
e
t
r
i
e
s
:
·
voivodship geometries for 2016,
o
n
t
o
l
o
g
i
e
s
:
·
ontology for the KTS classification defined using RDF Schema (
rdfs
) &
GeoSPARQL (
geo
) vocabularies,
g
e
o
m
e
t
r
y
 
e
n
c
o
d
i
n
g
:
·
separate 
geo:Geometry
 entities with geometry encoded in WKT (Well
Known Text) format (
geo:wktLiteral
).
undefined
L
O
D
 
p
i
l
o
t
 
 
d
a
t
a
 
s
o
u
r
c
e
s
 
c
a
t
a
l
o
g
u
e
D
C
A
T
-
A
P
 
(
d
c
a
t
)
 
a
p
p
l
i
c
a
t
i
o
n
p
r
o
f
i
l
e
 
f
o
r
 
d
a
t
a
 
p
o
r
t
a
l
s
 
i
n
 
E
u
r
o
p
e
,
data sources as 
dcat:Dataset
classes,
links to other vocabularies:
·
EuroVoc (for thematic
categories),
·
EU Publication Office
continent / country codelist (for
spatial reference)
·
Internet Media Type (MIME)
undefined
L
O
D
 
p
i
l
o
t
 
 
l
i
n
k
i
n
g
geometries 
for observations
spatial domain
for datasets
dataset definitions
for statistical data
undefined
D
a
t
a
 
t
r
a
n
s
f
o
r
m
a
t
i
o
n
 
i
n
t
o
 
R
D
F
1. Source files in CSV
undefined
D
a
t
a
 
t
r
a
n
s
f
o
r
m
a
t
i
o
n
 
i
n
t
o
 
R
D
F
2. Python script using RDFlib module for transformation:
undefined
D
a
t
a
 
t
r
a
n
s
f
o
r
m
a
t
i
o
n
 
i
n
t
o
 
R
D
F
3a. Results in any desired format (RDF-XML):
undefined
D
a
t
a
 
t
r
a
n
s
f
o
r
m
a
t
i
o
n
 
i
n
t
o
 
R
D
F
3b. Results in any desired format (Turtle):
undefined
L
O
D
 
p
i
l
o
t
 
 
t
r
i
p
l
e
 
s
t
o
r
e
A
p
a
c
h
e
 
J
e
n
a
 
F
u
s
e
k
i
 
u
s
e
d
 
a
s
 
a
 
S
P
A
R
Q
L
 
s
e
r
v
e
r
,
7
1
7
1
7
 
t
r
i
p
l
e
s
 
l
o
a
d
e
d
,
s
i
n
g
l
e
 
F
u
s
e
k
i
 
d
a
t
a
s
e
t
 
(
S
T
A
T
_
L
O
D
)
 
t
o
 
a
l
l
o
w
 
c
r
o
s
s
-
q
u
e
r
y
i
n
g
 
a
n
d
 
c
r
o
s
s
-
b
r
o
w
s
i
n
g
 
d
a
t
a
 
c
r
e
a
t
e
d
 
i
n
i
t
i
a
l
l
y
 
i
n
 
s
e
p
a
r
a
t
e
 
f
i
l
e
s
S
P
A
R
Q
L
 
e
n
d
p
o
i
n
t
 
f
o
r
 
q
u
e
r
y
i
n
g
undefined
L
O
D
 
p
i
l
o
t
 
 
S
P
A
R
Q
L
 
e
n
d
p
o
i
n
t
undefined
L
O
D
 
p
i
l
o
t
 
 
P
u
b
b
y
 
f
r
o
n
t
e
n
d
 
(
c
a
t
a
l
o
g
u
e
)
undefined
L
O
D
 
p
i
l
o
t
 
 
P
u
b
b
y
 
f
r
o
n
t
e
n
d
 
(
d
a
t
a
s
e
t
)
undefined
L
O
D
 
p
i
l
o
t
 
 
P
u
b
b
y
 
f
r
o
n
t
e
n
d
 
(
v
a
l
u
e
)
undefined
L
O
D
 
p
i
l
o
t
 
 
P
u
b
b
y
 
f
r
o
n
t
e
n
d
 
(
g
e
o
m
e
t
r
y
)
undefined
N
o
 
r
e
f
e
r
e
n
c
e
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
 
f
o
r
 
s
t
a
t
i
s
t
i
c
a
l
 
l
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
:
·
lack of integrity between RDF metadata sets published by one
authority
,
·
links to non-existing entities
,
·
lack of maintenance
,
L
a
c
k
 
o
f
 
p
a
n
-
E
u
r
o
p
e
a
n
 
g
u
i
d
e
l
i
n
e
s
 
f
o
r
 
s
t
a
t
i
s
t
i
c
a
l
 
l
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
:
·
common vocabularies,
·
recommended or dedicated software components,
·
DIGICOM ESSNet LOD project.
L
O
D
 
p
i
l
o
t
 
 
c
o
n
c
l
u
s
i
o
n
s
undefined
S
o
m
e
 
s
o
f
t
w
a
r
e
 
/
 
p
r
o
g
r
a
m
m
i
n
g
 
c
o
m
p
o
n
e
n
t
s
 
n
o
t
 
b
e
i
n
g
 
d
e
v
e
l
o
p
e
d
a
n
y
m
o
r
e
,
·
implementations might become unstable,
·
Python-based implementation seem sustainable at this point,
S
e
m
a
n
t
i
c
 
h
a
r
m
o
n
i
z
a
t
i
o
n
 
o
f
 
s
t
a
t
i
s
t
i
c
a
l
 
c
l
a
s
s
i
f
i
c
a
t
i
o
n
s
:
·
different meanings for supposedly the same classification
elements, e.g. 
0-5 can be “0 to 5” or “0 to less than five”
,
·
not only a pan-European issue, may exist
at country level,
L
O
D
 
p
i
l
o
t
 
 
c
o
n
c
l
u
s
i
o
n
s
undefined
M
e
t
h
o
d
o
l
o
g
y
 
f
o
r
 
p
u
b
l
i
s
h
i
n
g
 
s
p
a
t
i
a
l
 
d
a
t
a
 
a
s
 
l
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
:
·
single entity per single geometry:
·
inventory of boundary changes,
·
g
e
o
m
e
t
r
y
 
i
n
s
t
a
n
c
e
s
 
w
i
t
h
 
n
o
n
-
m
e
a
n
i
n
g
f
u
l
 
i
d
e
n
t
i
f
i
e
r
s
 
(
U
U
I
D
s
)
,
·
separate geometries for respective years:
·
a complete set of geometries each year, regardless of changes,
·
g
e
o
m
e
t
r
y
 
i
n
s
t
a
n
c
e
s
 
w
i
t
h
 
m
e
a
n
i
n
g
f
u
l
i
d
e
n
t
i
f
i
e
r
s
 
(
K
T
S
 
+
 
y
e
a
r
)
.
L
O
D
 
p
i
l
o
t
 
 
c
o
n
c
l
u
s
i
o
n
s
undefined
M
o
s
t
 
l
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
s
 
a
r
e
 
t
e
c
h
n
i
c
a
l
l
y
 
c
o
r
r
e
c
t
:
·
it is nearly impossible to produce incorrect RDF metadata files
,
·
you can put anything in the RDF graph, but does it make sense
semantically
?
L
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
s
 
b
a
s
e
d
 
o
n
 
P
y
t
h
o
n
 
s
c
r
i
p
t
s
 
a
r
e
e
a
s
y
 
t
o
 
a
m
e
n
d
 
i
n
 
t
h
e
 
f
u
t
u
r
e
,
R
D
F
 
v
o
c
a
b
u
l
a
r
y
 
s
p
e
c
i
f
i
c
a
t
i
o
n
s
 
a
r
e
 
e
a
s
i
e
r
 
t
o
 
i
n
t
e
r
p
r
e
t
 
w
i
t
h
 
a
 
U
M
L
m
o
d
e
l
 
p
r
o
v
i
d
e
d
 
(
T
h
a
n
k
 
y
o
u
,
 
C
a
p
t
a
i
n
 
O
b
v
i
o
u
s
 
)
L
O
D
 
p
i
l
o
t
 
 
c
o
n
c
l
u
s
i
o
n
s
12.03.2019
NTTS 2018 Conference / Brussels / Belgium
Merging statistics and geospatial information grant series
M
i
r
o
s
ł
a
w
 
M
i
g
a
c
z
GIS Consultant
Statistics Poland
P
u
b
l
i
s
h
i
n
g
 
g
e
o
r
e
f
e
r
e
n
c
e
d
s
t
a
t
i
s
t
i
c
a
l
 
d
a
t
a
 
u
s
i
n
g
l
i
n
k
e
d
 
o
p
e
n
 
d
a
t
a
 
t
e
c
h
n
o
l
o
g
i
e
s
www.linkedin.com/in/migacz
m.migacz@stat.gov.pl
Slide Note
Embed
Share

Development of guidelines for publishing statistical data as linked open data, merging statistics and geospatial information, with a primary focus on preparing a background for LOD implementation in official statistics. The project aims to identify data sources, harmonize statistical units, transform data into RDF format, and provide recommendations for full implementation. Primary data sources include the Local Data Bank and Demography Database. Other sources include publications, communiques, and articles. The project involves inventorying metadata, analyzing data sources for openness and popularity, and integrating statistical and geospatial information effectively.

  • Georeferenced data
  • Linked open data
  • Statistics
  • Geospatial information
  • LOD implementation

Uploaded on Sep 11, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Publishing georeferenced statistical data using linked open data technologies Merging statistics and geospatial information grant series Miros aw Migacz GIS Consultant Statistics Poland 12.03.2019 NTTS 2019 Conference / Brussels / Belgium 1

  2. The project Title: Development of guidelines for publishing statistical data as linked open data Merging statistics and geospatial information grant series 2016 2017 main goal: prepare a background for LOD implementation in official statistics 2

  3. Before 3218 4.4.32.64.18 powiat obeski (LAU 1) lobeski 4326418 3

  4. After powiat obeski http:// nts.stat.gov.pl/4/4/32/64/18 4

  5. Specific objectives identify data sources identify statistical units harmonize, generalize and build URIs for statistical units transform statistical data, geospatial data and metadata into RDF (pilot) conclude the pilot transformation and fomulate recommendations for a full-on implementation 5

  6. Primary data sources biggest set of statistical information available for a wide range of years updated monthly Local Data Bank integrated data source for state and structure of population, vital statistics and migrations Demography Database a system for facilitating and monitoring the development policy key measures to monitor execution of strategies at local, regional, transregional and EU level. Development monitoring system STRATEG 6

  7. Identification of data sources Other data sources: publications tables communiques announcements articles 7

  8. Data sources - inventory Metadata: thematic category, format (PDF, DOC, XLS, CSV), spatial reference (country, NUTS, LAU, functional areas, urban areas), temporal reference (years) presence of identifiers (TERYT, NTS, NUTS) update cycle Preliminary analysis of data sources: openness redundance of information popularity (based on view / download stats) 8

  9. Statistical units inventory administrative boundaries: administrative units NUTS Non-standard statistical units: functional areas / urban areas Groups of administrative / statistical units Derive mostly from strategic documents macroregion (NUTS 1) voivodship NUTS region (NUTS 2) ADMINISTRATIVE subregion (NUTS 3) powiat (LAU 1) gmina (LAU 2) 9

  10. Statistical units harmonization KTS KTS classification combining administrative and statistical units introduced last year to comply with NUTS 2016 14-digit code symbol 10000000000000 10020000000000 10023200000000 10023210000000 10023216400000 10023216418000 10023216418053 name Poland macroregion voivodship region subregion powiat gmina 10

  11. Geometry harmonization/generalization Input data: administrative boundaries since 2002 for LAU 2 (gmina), excluding 2007 Harmonization process: structure standardization standardization of identifiers (creating KTS identifiers) aggregation to higher level units (LAU 1 -> NUTS 1) Generalization: several generalization scenarios tested for purposes of choosing an optimal one datasets with generalized and non-generalized geometries prepared for 2002-2016 11

  12. Linked open data pilot geospatial data statistical unit geometries statistical data demographic classifications data sources catalogue metadata data 12

  13. LOD pilot statistical data data: demographic data for 2016 from three major databases (Local Data Bank, Demography Database, STRATEG system), ontologies for classifications: age codelist defined using SKOS (skos) & Dublin Core (dct), sex codelist re-used from SDMX, added Polish translation, definining metadata for statistical values (observations): based primarily on SDMX ontologies (attribute, code, measure, dimension), qb:Observation class from Data Cube. 13

  14. LOD pilot geospatial data input geometries: voivodship geometries for 2016, ontologies: ontology for the KTS classification defined using RDF Schema (rdfs) & GeoSPARQL (geo) vocabularies, geometry encoding: separate geo:Geometry entities with geometry encoded in WKT (Well Known Text) format (geo:wktLiteral). 14

  15. LOD pilot data sources catalogue DCAT-AP (dcat) application profile for data portals in Europe, data sources as dcat:Dataset classes, links to other vocabularies: EuroVoc (for thematic categories), EU Publication Office continent / country codelist (for spatial reference) Internet Media Type (MIME) 15

  16. LOD pilot linking dataset catalogue spatial domain for datasets dataset definitions for statistical data geospatial data statistical data geometries for observations 16

  17. Data transformation into RDF 1. Source files in CSV 17

  18. Data transformation into RDF 2. Python script using RDFlib module for transformation: 18

  19. Data transformation into RDF 3a. Results in any desired format (RDF-XML): 19

  20. Data transformation into RDF 3b. Results in any desired format (Turtle): 20

  21. LOD pilot triple store Apache Jena Fuseki used as a SPARQL server, 71717 triples loaded, single Fuseki dataset (STAT_LOD) to allow cross-querying and cross- browsing data created initially in separate files SPARQL endpoint for querying 21

  22. LOD pilot SPARQL endpoint 22

  23. LOD pilot Pubby frontend (catalogue) 23

  24. LOD pilot Pubby frontend (dataset) 24

  25. LOD pilot Pubby frontend (value) 25

  26. LOD pilot Pubby frontend (geometry) 26

  27. LOD pilot conclusions No reference implementation for statistical linked open data: lack of integrity between RDF metadata sets published by one authority, links to non-existing entities, lack of maintenance, Lack of pan-European guidelines for statistical linked open data: common vocabularies, recommended or dedicated software components, DIGICOM ESSNet LOD project. 27

  28. LOD pilot conclusions Some software / programming components not being developed anymore, implementations might become unstable, Python-based implementation seem sustainable at this point, Semantic harmonization of statistical classifications: different meanings for supposedly the same classification elements, e.g. 0-5 can be 0 to 5 or 0 to less than five , not only a pan-European issue, may exist at country level, 28

  29. LOD pilot conclusions Methodology for publishing spatial data as linked open data: single entity per single geometry: inventory of boundary changes, geometry instances with non-meaningful identifiers (UUIDs), separate geometries for respective years: a complete set of geometries each year, regardless of changes, geometry instances with meaningful identifiers (KTS + year). 29

  30. LOD pilot conclusions Most linked open data implementations are technically correct: it is nearly impossible to produce incorrect RDF metadata files, you can put anything in the RDF graph, but does it make sense semantically? Linked open data implementations based on Python scripts are easy to amend in the future, RDF vocabulary specifications are easier to interpret with a UML model provided (Thank you, Captain Obvious ) 30

  31. Publishing georeferenced statistical data using linked open data technologies Merging statistics and geospatial information grant series Miros aw Migacz GIS Consultant Statistics Poland www.linkedin.com/in/migacz m.migacz@stat.gov.pl 12.03.2019 NTTS 2018 Conference / Brussels / Belgium 31

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#