Sample trial (search with keyword "mouse AND phenotype")

--- Summary ---
BioSPARQL automates complex SPARQL query building for users without knowing data structures of RDF data repositories


Broadly Integrated Ontological SPARQL Protocol and RDF Query Language, BioSPARQL for short, allows users to build a SPARQL query and manage dispersed LOD data set versions even without knowing their data structures or traversing several LOD data repositories on the Web. BioSPARQL provides the following functionalities to realise advanced features to execute a SPARQL query for a practical use on the field of bio-medical science research:

  1. Executes the query only over the necessary subsets of data broadly distributed on the web, by classifying huge amount of data into small subsets by ontology.
  2. Assists users to interactively and automatically build a SPARQL Query template by logically analysing RDF/ OWL datasets, even without the user knowing the schema and structure of the datasets.
  3. Executes SPARQL query by automatically linking both Open and User’s Private Data, and downloading a “snapshot” locally for local processing as needed, with precise RDF Data version Control

The BioSPARQL tool implements unifying logic to construct a SPARQL query built upon RDF/OWL datasets and integrating various public bio-medical databases in the user's local environment. In order to realise the advanced features of BioSPARQL, the core software components including query builder and data file manager are implemented.

BioSPARQL query builder
Provides an intuitive graphic user interface including a network of RDF Linked Open Data that designates a user's interesting concept as its starting point to find another interesting concept, and automatically builds a SPARQL query template to search a semantic path between these interesting concepts by logically analysing RDF/OWL datasets.

BioSPARQL data file manager
Management of a local private SPARQL endpoint for
  1. Precise RDF data version control
  2. Linking both Linked Open Data and Linked Private Data
  3. Ignores timeout on a SPARQL endpoint over HTTP protocol during large scale querying
In our implementation, query builder is developed as a web application to assist users to generate SPARQL queries with their web browser using an intuitive graphical user interface covering BioLOD.org data sets. To aid in query writing, BioSPARQL uses its graphical user interface to suggest possible data path schema and filters as the result of analysing the corresponding ontological BioLOD data structure.

Data file manager operates as a program installed in a user’s computer. It analyses and evaluates generated queries by accessing a user’s SPARQL endpoint and automatically downloading and updating local copies of Biological LOD data files as a snapshot of LOD data sets necessary to evaluate the query.

Though BioSPARQL allows users to generate a SPARQL query via our BioSPARQL query builder web service, the SPARQL query generated is designed to perform in the user’s local environment with its corresponding downloaded BioLOD data files. This controls the influence of data updates on query results.

Example scenario.

  1. open Query Builder with RIKEN ENU-induced allele in mouse
  2. browse How to use.
  3. BioSPARQL generated SPARQL



BioSPARQL Tutorial

Version 19 November, 2011

Bioinformatics And Systems Engineering (BASE) division, RIKEN
http://www.base.riken.jp/

Creative Commons License

Index

  1. Introduction
  2. Architecture
  3. RDF data and SPARQL queries supported by BioSPARQL
  4. Relationship of BioSPARQL with existing semantic web technologies
  5. Downloading and Installation
  6. How to use BioSPARQL
  7. Sample applications for biological integrated RDF data sets
  8. Limitations of the BioSPARQL

Quick links for trials

  1. open Query Builder with RIKEN ENU-induced allele in mouse
  2. browse How to use.
  3. BioSPARQL generated SPARQL

1. Introduction

The BioSPARQL framework generates suitable SPARQL queries by analysing RDF/OWL data structure. Unlike existing systems that generate queries based on SPARQL syntax, BioSPARQL analyses RDF/OWL datasets logically and presents users with a template for its corresponding optimal suitable queries. BioSPARQL allows users to easily generate a complete SPARQL query by typing keywords in the template.

BioSPARQL stands for Broadly Integrated Ontological SPARQL Protocol and RDF Query Language. It generates and executes SPARQL queries by performing advanced ontology-based knowledge processing on RDF/OWL datasets that it collects and integrates based on ontology such as category, class and data concept. More concretely, BioSPARQL realises effective data search by following three functions listed below:

BioLOD.org is a database service that provides RDF/OWL datasets categorising the data by ontology. The BioLOD.org database integrates and collects from various selected major bio-medical databases published worldwide by converting the original data into RDF/OWL format. In BioLOD.org, data are gathered into each ontology class, and for each class those data can be downloaded as files in various RDF formats. In this tutorial, we explain the usage of BioSPARQL using BioLOD.org as a targeted database.

Note) The data domain supported by BioSPARQL is not limited to the life-sciences. BioSPARQL is applicable for other fields since it is designed to work with general RDF/OWL data.

2. Architecture

The Core BioSPARQL Software Components

To implement the three search functions introcuced in Section 1; The BioSPARQL software components consist of:
  1. the BioSPARQL Web Service with web server including the graph searcher that finds RDF graphs of interest to the user and the Semantic-JSON interface that provides fragments of RDF data needed to build a SPARQL query,
  2. the Query Builder that implements function 1 and
  3. the Data File Manager that implements functions 2 and 3.

2.1. BioSPARQL Web Service

BioSPARQL Web Service manages RDF data sets provided by RDF file repositories, and mashes-up these data sets to generate a full-text search index for graph search function and data source for a Semantic-JSON service. Using the mashed-up data, the BioSPARQL Web Server provides keyword search service that discovers RDF graphs that a user is interested in, and Semantic-JSON service to build a SPARQL query by the Query Builder.

2.2. BioSPARQL Query Builder

BioSPARQL Query Builder is a JavaScript program running on a user's web browser. The Query Builder accesses to a Semantic-JSON service to obtain fragments of RDF data necessary for building a SPARQL query. In order to realise intuitive query building, the Query Builder draws a network of RDF graphs linked from a graph Gs that the user is interested via several steps of semantic links (properties). The network allows a user to select the end graph Ge to build a query of RDF path from Gs to Ge (see Section 3). Queries generated by BioSPARQL Query Builder are sent to BioSPARQL data file manager and executed on the BioSPARQL Data File Manager.

2.3. BioSPARQL Data file manager

The BioSPARQL Data File Manager software component manages the data files necessary to execute queries generated by BioSPARQL Query Builder. It also sends the queries and data to a SPARQL endpoint for evaluation. Data File Manager controls file versions over such data files as needed, and also functions to download the latest RDF file from an RDF file repository if necessary. This Java program runs on the user’s local computer. It requires that a SPARQL endpoint be deployed on the user’s local computer.

3. RDF data and SPARQL queries supported by BioSPARQL

This section introduces RDF data structure and SPARQL queries supported by BioSPARQL.

3.1. RDF data sets and RDF graph

BioSPARQL supports RDF/OWL data sets categorised on the bases of concepts and ontologies described in OWL and RDFs, and each data set forms an RDF graph. The RDF graphs are further categorised into the following two types:
A) Ontology graph (database)
An ontology graph is a set of RDF/OWL classes denoting ontological terms such as concepts and vocabularies so that this graph is corresponds to ontologies such as Gene Ontology and Mammalian Phenotype. Each class is related with other classes in the graph, and a relationship is described by typical ontological relationships, for instance ‘subclassOf’ and ‘part of’ relationships defined by OWL and OBO. Since a set of concepts forms a database, we also call the set of concepts a database as well as an ontology graph.

B) Data class graph
A data class graph is a set of instances included in an RDF/OWL class. These instances have semantic links using the properties described as a class restriction of the RDF/OWL class.

3.2. SPARQL query

BioSPARQL generates a SPARQL query having a select clause to discover a sequential path indirectly connecting between two resources r1 and rn via other resources r2, …, rn-1 by specifying its data schema G1 ← P1 → G2 ← P2 → ... ← Pn-1 → Gn, where r1, r2,..., rn are resources belonging to graphs G1, G2,..., Gn respectively, and ← Pi (1 <= i <= n-1)is a set of properties connecting ri and ri+1 of the forward and/or reverse directions. The following figure shows a sample data schema including both data class graph and ontology graph, connecting by properties of the forward and reverse directions.

Further, BioSPARQL allows users to specify keywords for each graph to obtain the paths including only the resources of the graph having the keywords.

4. Relationship of BioSPARQL with existing semantic web technologies

RIKEN uses an information infrastructure based on semantic web technologies for integrating and publishing various life-sciences data. The software components for each function comprising this infrastructure are shown in the following figure.


5. Downloading and Installation

As mentioned previously, BioSPARQL consists of software components called the BioSPARQL Query Builder and BioSPARQL Data File Manager. BioSPARQL Query Builder is provided by the BioSPARQL.org web site, and the program runs on a user's web browser. Firefox and Chrome web browsers are supported. BioSPARQL Data File Manager is a Java application that runs on a user’s local computer.

The program can be downloaded from here after reading section 8 on limitations and licensing of this Tutorial and must be installed by following the instructions shown below.

5.1. Installation of BioSPARQL Data File Manager

  1. First prepare a Windows or Mac to run the BioSPARQL data file manager and a SPARQL endpoint.
  2. Please confirm that Java Runtime Environment (JRE) version 6 is installed on the computer. If it is not installed, install the latest version 6 JRE on the computer.
  3. Next install the Virtuoso Open-Source Edition. For Windows PCs, binary programs are available and installation instructions are shown on the website ( http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSUsageWindows)
  4. Deploy the BioSPARQL data file manager JAR file to a directory in which the JAR file is executable.
  5. Determine a directory on the user’s PC for storing configuration files and downloaded RDF files. We call this directory the BioSPARQL directory and here we assume that /biosparql is the BioSPARQL directory.
  6. Set up a configuration where Virtuoso can access the BioSPARQL directory. In specific, edit the database/virtuoso.ini file located under the directory where Virtuoso Open-Source Edition program is installed.
    1. The BioSPARQL directory is added to the term DirsAllowed in this file as follows:
    2. Before editing
      DirsAllowed = ., ../vad

      After editing
      DirsAllowed = ., ../vad, /biosparql

    3. Further, to evaluate a SPARQL query, a large size of memory may be required. In order to assign suitable memory to Virtuoso, set the variables NumberOfBuffers and MaxDirtyBuffer. For instance, in order to assign 4GB memory, set the variables as follows:

      NumberOfBuffers = 340000

      MaxDirtyBuffers = 250000

  7. Launch Virtuoso service on the PC.

A document describing more detail installation instruction is included in the binary program package (see Section 8).

6. How to use BioSPARQL

Overview of the workflow

We start with viewing an overview of the workflow of how to use BioSPARQL. The following figure shows a workflow including 3 steps of the individual services provided by BioSPARQL.

The detail of each step is discussed in the following sections.

6.1. Step 1: Graph Search

This service receives a keyword typed by a user and discovers the graphs include at least one resource having the keyword on its labels.

The search result is a list of graphs of RDF/OWL classes and ontology databases having the keyword and allows users to select a graph from the list as the starting point of a network for selecting a semantic path in the query build service.

Sample Operation

  1. Launch the Firefox or Chrome web browser.
  2. The home page of BioSPARQL.org (this page) has a graph search interface (go to the graph search interface). Type a keyword and press the search button.
  3. As the default, a list of search results of data class graphs having the keyword is shown.

    1. By clicking the database tab of the Class/Database select tabs, a list of hit ontology graphs (databases) is displayed.

    2. The search results are filtered when a character of the initial character selector is clicked. In this example, the initial character 'A' is clicked and then only the search results whose names starts with 'A' are displayed.

    3. By clicking a name of a hit class graph, a query builder is launched in another window of the web browser.

6.2. Step 2: Building a Query

This service generates a SPARQL query that discovers a semantic path from the starting graph selected in the graph search service to the graph that the user is interested in.

In order to realise an intuitive path data schema for the user's selection, a network is displayed including all graphs of sequentially connected semantic links (properties) in the forward and reverse directions. Note that nodes and edges in the network correspond to both the graphs and forward/reverse links connecting two graphs as well. The network allows a user to select a node of interest as the end node for the path data schema.

Even if the user selects an end graph node, one or more paths from the starting graph node to the end graph node may exist. The query building service allows the user to select between paths on the path list.

When the steps listed above are done, a path data schema is already specified. Here, a SPARQL query that discovers all linked data satisfying the path data schema is generated. In order to narrow the query result, keywords for each graph included in the path data schema can be specified. More concretely, for each graph, keywords can be specified by a user against labels for its instances and also of its linked instances and literals. These keywords are described as filter clauses in the SPARQL query.

The generated SPARQL query is posted to the data file management service described below.

Sample Operation

  1. BioSPARQL Query Builder opens in another window when a user clicks a hit class/database name included in the search results in the graph searcher shown previously in Section 6.1. In this example, the following page is displayed. (URL: http://biosparql.org/class/RIKEN_ENU-induced_allele_in_mouse/crib190u2i)

  2. A semantic graph is displayed starting from the class/database shown in the centre. BioSPARQL generates the suitable SPARQL query to find a path from this starting class/database or the class/database of interest to the user. A user selects a class/database for the end of the search path in the graph by clicking the corresponding class/database icon or class name. Immediately, all possible semantic paths from starting class/database to the selected class/database are shown below the graph.

  3. By clicking the“Choose” button, select a search path from a list of all possible semantic paths.

  4. For each graph included in the selected semantic path, a text box is displayed to input a user’s keyword for each property. To search among linked data of interest that satisfies the corresponding semantic path, type keywords in each text box as necessary. When a keyword is typed, the SPARQL query displayed on the bottom of the web browser is revised continuously. Further, on the bottom of the keyword panel, a result term selector for each graph is displayed, which allows a user to select to include:
    1. both URIs and names of hit resources,
    2. only URIs of hit resources,
    3. only names of hit resources or
    4. nothing in the SPARQL results.

  5. The generated SPARQL query is posted to the Data File Manager by clicking the "Post" button.

6.3. Step 3: Evaluate the SPARQL query using Data File Manager

This service first analyses the posted query and obtains a list of graph URIs. By accessing to the RDF file download service, it downloads corresponding data files for the graphs.

The downloaded files are managed under precise data version control, and suitable versions of graph data sets are uploaded to a SPARQL endpoint running in the user’s local PC. Here, these data are mashed-up to form a network as an integrated RDF data set. The SPARQL query is evaluated over the network on the SPARQL endpoint and the result is obtained in a standardised format.

For SPARQL experts, the SPARQL query can be modified not only to search over public data obtained by BioSPARQL but also over the user’s private data, and the corresponding private data sets are added to the network on the SPARQL endpoint.

Data File Manager is implemented as a servlet version operated via a web browser and a command line version operated from a command line terminal.

6.3.1. Usage of the servlet version

In the servlet version, by clicking "Post" button of the Query Builder, new window is opened on the user's web browser and queries are performed in the window. In order to use the servlet version, a servlet process is launched by executing the following command in advance.

C:\biosparql>java -jar BioSPARQLServlet.jar /biosparql/BioLOD/config/BioSPARQLWin.conf

  1. By clicking the "Post" button of the Query Builder, new window is opened of the user's web browser. Confirm that the query generated in the Query Builder is correctly displayed on the query box, and then click "Extract resources from the query".

  2. A list of RDF files needed to perform the query is displayed. In the next steps of downloading RDF files and uploading these files into the SPARQL endpoint, it takes time when the file size is large. In order to download RDF files and upload these downloaded files into the SPARQL endpoint, click "Download Files and setup SPARQL endpoint" button.

  3. Here, data preparation needed to perform the query is finished and the query can be performed immediately. An expert SPARQL user can edit the SPARQL query in the query editor in order to change the query to introduce the user's private data sets. First, select the data format (JSON, HTML TURTLE, XML, RDF/XML or TTL) of the result of query. Then by clicking "Evaluate the query" button, the query is performed.

  4. The result is displayed. Queries can be edited and the revised queries are performed continuously. However, revised queries cannot be described over previously prepared datasets if the user's private data sets are not provided as needed to evaluate the revised query. If the user selects HTML as the resulting format, each URI is displayed with its hyper link. By clicking a hyper link, the web page of the URI is displayed and a user can read a description of the data specified by URI at the original data source site.

  5. The web page of the URI is displayed.

  6. If the data is not deleted when the same query is performed later, preparation for performing an additional query can be finished quickly. When you finish with the Data File Manager, the data prepared can be removed. The button to delete the data is displayed at the bottom of the window. In this process data loaded on the SPARQL endpoint can also be deleted, however by checking "Completely remove downloaded local RDF data files" and the downloaded RDF files are also deleted. To delete the data, click the "Remove RDF resources from SPARQL endpoint" button.

6.3.2. Usage of command line version

The command line version is used to handle large data sets. By typing the following command in a command line terminal, the Data File Manager is executed.

C:\biosparql> java -jar BioSPARQL.jar command=eval config=/biosparql/config/BioLOD/BioSPARQLWin.conf query=/tmp/BioLOD/query/sparql.query localMode=false outputFile=/tmp/result.txt outputFormat=JSON

Here, the file specified as the query option includes the SPARQL query generated by the Query Builder. After query execution, in order to remove RDF data sources used for the query execution, type the following command in a command line terminal.

C:\biosparql> java -jar BioSPARQL.jar command=clean config=/biosparql/BioLOD/config/BioSPARQLconfig.conf query=/biosparql/query/query.txt outputFormat=JSON

7. Sample applications for biological integrated RDF data sets

In this section, we view several practical biological examples over integrated RDF data sets for mammals and plants.

7.1. Example of mammalian integrated data sets

This example obtains mouse resources (RIKEN ENU-induced mouse line (in BRC)) associated with MGI Allele Il6ra<tm1.1Jcbr>
Query Manager URL is http://biosparql.org/class/RIKEN_ENU-induced_mouse_line_%28in_BRC%29/crib190u1i, and specify keyword 'Il6ra<tm1.1Jcbr>' as shown in the following figure.

The generated query can be downloaded from here. This example leads one solution as in the following figure.

By clicking the resulting URI (http://biolod.org/M101156/crib190u1rib190s88i), the file repository redirects to the original source page (http://www2.brc.riken.jp/lab/animal/detail.php?brc_no=RBRC-GSC0298) and the original source page is displayed. The file size is displayed in red when it is larger than 10MB.

7.2. Example of the plant integrated data sets

This example obtains Arabidopsis and Rice genes semantically linked from 'Seedling-albino' phenome data record (URI: http://scinets.org/item/crib32s99rib32s1i of RIKEN Arabidopsis Phenome Information Database (RAPID). As shown in the following figure the Query Manager URL is http://biosparql.org/class/Seedling/crib32s99i, and the specifying keyword is 'albino'.

The generated query can be downloaded from here. This example leads to 13 solutions as in the following figure.

By clicking the first resultant URI of ins4 (http://biolod.org/At-Os_Ortholog3924/crib127u1rib127u3924i), the file repository redirects to the original source page (https://database.riken.jp/sw/en/At-Os_Ortholog3924/crib127u1rib127u3924i/) and the original source page is displayed.

According to this original page, the resultant Arabidopsis and Rice genes are AT5G008400 (TAIR Locus) and Os02g58120 (O.sativa TIGR) respectively.

8. Limitations of the BioSPARQL

The BioSPARQL Data File Manager is distributed as JAR archives and a set of source codes as an open source program. The license of the program is the Apache software license version 2.0.

Downloading the programs

This software is originally developed and copyrighted by the BioSPARQL project owned by Norio Kobayashi and Tetsuro Toyoda, RIKEN, Japan.
Copyright (c) 2011, RIKEN.

The current version supports only Virtuoso Open-Source Edition (http://www.openlinksw.com/wiki/main/Main) as a SPARQL endpoint. The authors of BioSPARQL have tested on Windows and Mac OS X.

BioSPARQL Data File Manager and a SPARQL endpoint must be deployed on the same PC.

BioSPARQL cannot handle data sources and/or queries that exceed the processing capacity of the SPARQL endpoint. BioSPARQL Query Builder may create such queries. When the SPARQL endpoint exceeds its capacity to handle data sources and/or queries, an error code is returned from BioSPARQL.