Research Information Linked Open Data Store euroCRIS members meeting, Bonn, may 2013 Overview • Needs & Drivers • Information and data sources • Structured • Unstructerd • Architecture • Planned • Realised • Tools 2 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Project • Partners • • • Knowledge Management unit, EWI IBM Belgium Goals • • • • Merge all sources into one open environment. Apply entity resolution technique to remove data silo’s Crawling and content analysis of full text elements Build and test the proposed Pilot Architecture • • • • • 3 Information integration form structured and unstructured data in one container Build a number of visualisations of the information Develop a roadmap towards the Operational Architecture Timing: • 4 months starting from January 20113 • 124k euro Cost Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Needs & drivers Better information: correct, actual , complete Open FRIS data for services and application devellopment Flemish government Open Data policy Maximum reuse of components Increase strategic intelligence Maximum reuse of data Policy monitoring: efficient & effective More information services Connect data silo’s Reduce system costs Vlaamse overheid | Departement Economie, Wetenschap en Innovatie FRIS structured information 5 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie FRIS Unstructured Data Homepage project Publication Abstract Homepage persoon Organisation Activity descriptions Publication Full text Project Abstracts 6 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Information and Data sources • Structured Data • FRIS research portal database • • • 4 university OAR’s • • • Format: MODS records Coverage: X publication records, X full tekst resources VABB-SSH: publication monitoring data set on Social Sciences and Humanities • • • Format: CERIF2006 database Coverage: All universities 1 university college Format: MODS records Coverage: All universities Semantics and information model • Business Semantics Glossary • • 7 FRIS model: CERIF2006 Semantics: Entitiy Classifications Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Information and Data sources • Unstructured Data • All textual information form the structured data • Project Abstracts • Publication Abstracts • Organisation Activity descriptions • Full text of Publication • Websites • Project • Researcher • Organisation 8 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Links and Locators • Access to unstructured data • Textual elements in CERIF model • Project Abstracts • Publication Abstracts • Organisation Activity descriptions • Websites • URI fields in CERIF entities • Links to fulltext • Resource links in MODS records 9 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Scope Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Some numbers • CERIF records: • Person:22.006 (FRIS) +1.454.208 (OAI without resolution) • Project:24.634 (FRIS) • Organisation:1.398 (OAI) + 2.022 (FRIS) • Publications: 3.596 (FRIS) • MODS records • OAR’s:598.035 (OAI) + VABB database • Publication Full text :45.294 (OAI) 11 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Planned Architecture Content Analysis Concept Extraction Visualisation Structured Data input Operational Store Triple Store Semantic control 12 Identifiers & Entity Resolution Vlaamse overheid | Departement Economie, Wetenschap en Innovatie RELOD Structured Data Architecture Vlaamse overheid | Departement Economie, Wetenschap en Innovatie OAR Harvesting Architecture Crawler management XML VABB MODS to CERIF conversion D2R transformation … Crawler CERIF database UHasselt OAI-PMH UGent Vlaamse overheid | Departement Economie, Wetenschap en Innovatie RELOD Architecture 15 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Architectuur – Tools & Standards BSG SBVR D2R Jena TDB Java HTTP REST Java SPARQL OWL SKOS RDFS WEB 2.0 FUSEKI Oracle RDF CERIF APACHE TOMCAT SILK R2R HARVESTER OAI-PMH MODS SIEVE ICA ICC UIMA LDIF Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Some numbers • Entities • Projecten: 24.634 (FRIS) • Personen: 22.006 (FRIS) +1.454.208 (OAI zonder resolutie!)) • Publicaties: 598.035 (OAI) + 3.596 (FRIS) • With full text: 45.294 (OAI) • OrgUnit: 1.398 (OAI) + 2.022 (FRIS) • Recognised author affiliation from full text: 55662 • Triple Store • Triples FRIS+OAI : 57M • Triples text mining (author recognition + lemmas) : 144M • --> Still without inference (no inference deduce triples) 17 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Analyse - Visualisatie 18 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Visualisations • Two test visualisations build sofar: • Word cloud for person • • Persons related to Concepts • • http://ewisclod3.vlaanderen.be/words/ http://ewisclod3.vlaanderen.be/persons/ New visualisations will be build on well defined use cases • Tuning the Content analytics to the case • Supervised learning for specific domains • • 19 Give an contextual overview of research from the last 10 years on social security issues in Belgium Annual report on research in the domain of renewable energy Vlaamse overheid | Departement Economie, Wetenschap en Innovatie 20 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Entity resolution A few tools tested Silk Link Discovery Framework • used to map authors from the OAR harvest onto Persons form the CERIF sources. • Experimented with • • • • manual construction of matching ruls via de Silk workbench Active learning combined with the Silk generic algoritms Several metrics on the tekst dimensions: Levenstein, tf-idf, Jaro, Jacard in combination with numerical and temporal dimensions Results still have to be validated in detail. Tests with OKKAM are planned 21 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Architecture Roadmap Elements Full-CERIF automatic D2R template generation (optional) Replace D2R with standard: R2RML Support incremental CERIF/RDF loading Integration of Data Governance Center via he API Complete modelling of CERIF and Semantics in Data Governance Center Full-CERIF automatic ontology template generation manueel geautomatiseerd Vlaamse overheid | Departement Economie, Wetenschap en Innovatie 23 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie D2R Views FRIS: http://ewisclod3.vlaanderen.be/d2rq/fris/ OAI-PMH: http://ewisclod3.vlaanderen.be/d2rq/oai/ Text Mining: http://ewisclod3.vlaanderen.be/d2rq/tm/ SPARQL Test pagina: http://ewisclod3.vlaanderen.be/ewilod/html/sparql-test.html Endpoint (enkel query): http://ewisclod3.vlaanderen.be/ewilod/sparql RESTful API (GET) Resource basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/resource/ Ontologie basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/ontology Triplestore grafe URIs FRIS: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#fris OAI-PMH: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#oai Text Mining: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#tm Mappings: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#ld LDIF Status monitor: http://ewisclod3.vlaanderen.be/ldif/status/ Silk Workbench: http://localhost:8080 (via SSH tunnel) Visualisaties Index pagina: http://ewisclod3.vlaanderen.be/ewilod/html/vis/index.html Hierbij de visualisaties: http://ewisclod3.vlaanderen.be/persons/ http://ewisclod3.vlaanderen.be/words/ 24 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie Hierbij de visualisaties: http://ewisclod3.vlaanderen.be/persons/ http://ewisclod3.vlaanderen.be/words/ 25 Vlaamse overheid | Departement Economie, Wetenschap en Innovatie