Project

advertisement
Research Information Linked Open Data
Store
euroCRIS members meeting, Bonn, may 2013
Overview
• Needs & Drivers
• Information and data sources
• Structured
• Unstructerd
• Architecture
• Planned
• Realised
• Tools
2
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Project
•
Partners
•
•
•
Knowledge Management unit, EWI
IBM Belgium
Goals
•
•
•
•
Merge all sources into one open environment.
Apply entity resolution technique to remove data silo’s
Crawling and content analysis of full text elements
Build and test the proposed Pilot Architecture
•
•
•
•
•
3
Information integration form structured and unstructured data in one
container
Build a number of visualisations of the information
Develop a roadmap towards the Operational Architecture
Timing:
•
4 months starting from January 20113
•
124k euro
Cost
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Needs & drivers
Better information:
correct, actual ,
complete
Open FRIS data for services
and application
devellopment
Flemish government
Open Data policy
Maximum reuse of
components
Increase strategic
intelligence
Maximum reuse of data
Policy monitoring:
efficient & effective
More information
services
Connect data silo’s
Reduce system costs
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
FRIS structured information
5
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
FRIS Unstructured Data
Homepage
project
Publication
Abstract
Homepage
persoon
Organisation
Activity
descriptions
Publication
Full text
Project
Abstracts
6
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Information and Data sources
•
Structured Data
•
FRIS research portal database
•
•
•
4 university OAR’s
•
•
•
Format: MODS records
Coverage: X publication records, X full tekst resources
VABB-SSH: publication monitoring data set on Social Sciences and
Humanities
•
•
•
Format: CERIF2006 database
Coverage: All universities 1 university college
Format: MODS records
Coverage: All universities
Semantics and information model
•
Business Semantics Glossary
•
•
7
FRIS model: CERIF2006
Semantics: Entitiy Classifications
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Information and Data sources
• Unstructured Data
• All textual information form the structured data
• Project Abstracts
• Publication Abstracts
• Organisation Activity descriptions
• Full text of Publication
• Websites
• Project
• Researcher
• Organisation
8
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Links and Locators
• Access to unstructured data
• Textual elements in CERIF model
• Project Abstracts
• Publication Abstracts
• Organisation Activity descriptions
• Websites
• URI fields in CERIF entities
• Links to fulltext
• Resource links in MODS records
9
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Scope
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Some numbers
• CERIF records:
• Person:22.006 (FRIS) +1.454.208 (OAI without
resolution)
• Project:24.634 (FRIS)
• Organisation:1.398 (OAI) + 2.022 (FRIS)
• Publications: 3.596 (FRIS)
• MODS records
• OAR’s:598.035 (OAI) + VABB database
• Publication Full text :45.294 (OAI)
11
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Planned Architecture
Content Analysis
Concept Extraction
Visualisation
Structured
Data input
Operational
Store
Triple
Store
Semantic
control
12
Identifiers &
Entity
Resolution
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
RELOD Structured Data Architecture
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
OAR Harvesting Architecture
Crawler
management
XML
VABB
MODS
to
CERIF
conversion
D2R
transformation
…
Crawler
CERIF database
UHasselt
OAI-PMH
UGent
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
RELOD Architecture
15
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Architectuur – Tools & Standards
BSG
SBVR
D2R
Jena
TDB
Java
HTTP
REST
Java
SPARQL
OWL
SKOS
RDFS
WEB 2.0
FUSEKI
Oracle
RDF
CERIF
APACHE
TOMCAT
SILK
R2R
HARVESTER
OAI-PMH
MODS
SIEVE
ICA
ICC
UIMA
LDIF
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Some numbers
• Entities
• Projecten: 24.634 (FRIS)
• Personen: 22.006 (FRIS) +1.454.208 (OAI zonder
resolutie!))
• Publicaties: 598.035 (OAI) + 3.596 (FRIS)
• With full text: 45.294 (OAI)
• OrgUnit: 1.398 (OAI) + 2.022 (FRIS)
• Recognised author affiliation from full text: 55662
• Triple Store
• Triples FRIS+OAI : 57M
• Triples text mining (author recognition + lemmas) : 144M
• --> Still without inference (no inference deduce triples)
17
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Analyse - Visualisatie
18
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Visualisations
•
Two test visualisations build sofar:
•
Word cloud for person
•
•
Persons related to Concepts
•
•
http://ewisclod3.vlaanderen.be/words/
http://ewisclod3.vlaanderen.be/persons/
New visualisations will be build on well defined use cases
•
Tuning the Content analytics to the case
•
Supervised learning for specific domains
•
•
19
Give an contextual overview of research from the last 10 years on
social security issues in Belgium
Annual report on research in the domain of renewable energy
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
20
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Entity resolution
A few tools tested
Silk Link Discovery Framework
• used to map authors from the OAR harvest onto Persons form the
CERIF sources.
• Experimented with
•
•
•
•
manual construction of matching ruls via de Silk workbench
Active learning combined with the Silk generic algoritms
Several metrics on the tekst dimensions: Levenstein, tf-idf, Jaro,
Jacard in combination with numerical and temporal dimensions
Results still have to be validated in detail.
Tests with OKKAM are planned
21
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Architecture Roadmap Elements
Full-CERIF automatic
D2R template
generation
(optional) Replace
D2R with standard:
R2RML
Support incremental
CERIF/RDF loading
Integration of Data
Governance Center
via he API
Complete modelling of
CERIF and Semantics in
Data Governance Center
Full-CERIF automatic
ontology template
generation
manueel
geautomatiseerd
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
23
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
D2R Views
 FRIS: http://ewisclod3.vlaanderen.be/d2rq/fris/
 OAI-PMH: http://ewisclod3.vlaanderen.be/d2rq/oai/
 Text Mining: http://ewisclod3.vlaanderen.be/d2rq/tm/
SPARQL
 Test pagina: http://ewisclod3.vlaanderen.be/ewilod/html/sparql-test.html
 Endpoint (enkel query): http://ewisclod3.vlaanderen.be/ewilod/sparql
RESTful API (GET)
 Resource basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/resource/
 Ontologie basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/ontology
Triplestore grafe URIs
 FRIS: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#fris
 OAI-PMH: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#oai
 Text Mining: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#tm
 Mappings: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#ld
LDIF
 Status monitor: http://ewisclod3.vlaanderen.be/ldif/status/
Silk
 Workbench: http://localhost:8080 (via SSH tunnel)
Visualisaties
 Index pagina: http://ewisclod3.vlaanderen.be/ewilod/html/vis/index.html
 Hierbij de visualisaties:
http://ewisclod3.vlaanderen.be/persons/
http://ewisclod3.vlaanderen.be/words/
24
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Hierbij de visualisaties:
http://ewisclod3.vlaanderen.be/persons/
http://ewisclod3.vlaanderen.be/words/
25
Vlaamse overheid | Departement Economie, Wetenschap en Innovatie
Download