Query classes and ontology databases intended to generate a SPARQL query
--- Summary ---
BioSPARQL automates complex SPARQL query building for users without knowing data structures of RDF data repositories
Broadly Integrated Ontological SPARQL Protocol and RDF Query Language, BioSPARQL for short, allows users to build a SPARQL query and manage dispersed LOD data set versions even without knowing their data structures or traversing several LOD data repositories on the Web.
BioSPARQL provides the following functionalities to realise advanced features to execute a SPARQL query for a practical use on the field of bio-medical science research:
The BioSPARQL tool implements unifying logic to construct a SPARQL query built upon RDF/OWL datasets and integrating various public bio-medical databases in the user's local environment. In order to realise the advanced features of BioSPARQL, the core software components including query builder and data file manager are implemented.
- BioSPARQL query builder
- Provides an intuitive graphic user interface including a network of RDF Linked Open Data that designates a user's interesting concept as its starting point to find another interesting concept, and automatically builds a SPARQL query template to search a semantic path between these interesting concepts by logically analysing RDF/OWL datasets.
- BioSPARQL data file manager
- Management of a local private SPARQL endpoint for
- Precise RDF data version control
- Linking both Linked Open Data and Linked Private Data
- Ignores timeout on a SPARQL endpoint over HTTP protocol during large scale querying
Data file manager operates as a program installed in a user’s computer. It analyses and evaluates generated queries by accessing a user’s SPARQL endpoint and automatically downloading and updating local copies of Biological LOD data files as a snapshot of LOD data sets necessary to evaluate the query.
Though BioSPARQL allows users to generate a SPARQL query via our BioSPARQL query builder web service, the SPARQL query generated is designed to perform in the user’s local environment with its corresponding downloaded BioLOD data files. This controls the influence of data updates on query results.
Example scenario.
- open Query Builder with RIKEN ENU-induced allele in mouse
- browse How to use.
- BioSPARQL generated SPARQL
BioSPARQL Tutorial
Version 19 November, 2011
Bioinformatics And Systems Engineering (BASE) division, RIKEN
http://www.base.riken.jp/
Index
- Introduction
- Architecture
- RDF data and SPARQL queries supported by BioSPARQL
- Relationship of BioSPARQL with existing semantic web technologies
- Downloading and Installation
- How to use BioSPARQL
- Sample applications for biological integrated RDF data sets
- Limitations of the BioSPARQL
Quick links for trials
- open Query Builder with RIKEN ENU-induced allele in mouse
- browse How to use.
- BioSPARQL generated SPARQL
1. Introduction
The BioSPARQL framework generates suitable SPARQL queries by analysing RDF/OWL data structure. Unlike existing systems that generate queries based on SPARQL syntax, BioSPARQL analyses RDF/OWL datasets logically and presents users with a template for its corresponding optimal suitable queries. BioSPARQL allows users to easily generate a complete SPARQL query by typing keywords in the template.
BioSPARQL stands for Broadly Integrated Ontological SPARQL Protocol and RDF Query Language. It generates and executes SPARQL queries by performing advanced ontology-based knowledge processing on RDF/OWL datasets that it collects and integrates based on ontology such as category, class and data concept. More concretely, BioSPARQL realises effective data search by following three functions listed below:
- Function 1:
BioSPARQL assists users to generate a SPARQL query that is difficult in form for humans to understand. It does this by presenting matched-up connection relationships among a number of databases to the user in advance, and then automatically generating a complex SPARQL query template presented on the user’s web browser. - Function 2:
Analysing the generated SPARQL query by function 1, BioSPARQL discovers database sets having target search data, automatically downloads such distributed databases to the user’s local computer, and concatenates those databases, joining them together as an integrated database. - Function 3:
BioSPARQL searches with the query over the integrated database and presents the connected semantic web links it has mined.
Note) The data domain supported by BioSPARQL is not limited to the life-sciences. BioSPARQL is applicable for other fields since it is designed to work with general RDF/OWL data.
2. Architecture
The Core BioSPARQL Software Components
To implement the three search functions introcuced in Section 1; The BioSPARQL software components consist of:- the BioSPARQL Web Service with web server including the graph searcher that finds RDF graphs of interest to the user and the Semantic-JSON interface that provides fragments of RDF data needed to build a SPARQL query,
- the Query Builder that implements function 1 and
- the Data File Manager that implements functions 2 and 3.
2.1. BioSPARQL Web Service
2.2. BioSPARQL Query Builder
2.3. BioSPARQL Data file manager
3. RDF data and SPARQL queries supported by BioSPARQL
This section introduces RDF data structure and SPARQL queries supported by BioSPARQL.3.1. RDF data sets and RDF graph
BioSPARQL supports RDF/OWL data sets categorised on the bases of concepts and ontologies described in OWL and RDFs, and each data set forms an RDF graph. The RDF graphs are further categorised into the following two types:- A) Ontology graph (database)
- An ontology graph is a set of RDF/OWL classes denoting ontological terms such as concepts and vocabularies so that this graph is corresponds to ontologies such as Gene Ontology and Mammalian Phenotype. Each class is related with other classes in the graph, and a relationship is described by typical ontological relationships, for instance ‘subclassOf’ and ‘part of’ relationships defined by OWL and OBO. Since a set of concepts forms a database, we also call the set of concepts a database as well as an ontology graph.
- B) Data class graph
- A data class graph is a set of instances included in an RDF/OWL class. These instances have semantic links using the properties described as a class restriction of the RDF/OWL class.
3.2. SPARQL query
BioSPARQL generates a SPARQL query having a select clause to discover a sequential path indirectly connecting between two resources
4. Relationship of BioSPARQL with existing semantic web technologies
RIKEN uses an information infrastructure based on semantic web technologies for integrating and publishing various life-sciences data. The software components for each function comprising this infrastructure are shown in the following figure.
- Data interchange: RDF layer
This layer manages posting and publishing Linked Open Data, and converts the data from its former existing data format into RDF. The LinkedData.org web service using this technique is implemented and published on the web. - RDFS layer
This layer realizes data integration by adding standardized and controlled semantics to the data using classes and properties based on RDF Schema (RDFS), through describing data sets belonging to data classes and semantic relationship between classes. SciNets.org is a web service published by RIKEN that implements this layer. - Ontology OWL layer
In the ontology description language named OWL, this layer organizes educible datasets that can be elicited for each class, based on standardized vocabulary and concept classes. BioLOD.org is a web-based database that provides the data sets organized for each class in various RDF data formats. - OWL-based query builder layer
This layer implements a data access interface for intelligent processing among the data in OWL. The Semantic-JSON.org web service provides a light-weight application programming interface (API) necessary for advanced intelligent data processing that accesses RDF/OWL data provided by BioLOD.org. - Logic for ontology layer
This layer implements the logic to generate advanced and optimal suitable SPARQL queries over ontology based data such as OWL data. BioSPARQL belongs to this layer and is implemented using the Semantic-JSON.org interface.
5. Downloading and Installation
As mentioned previously, BioSPARQL consists of software components called the BioSPARQL Query Builder and BioSPARQL Data File Manager. BioSPARQL Query Builder is provided by the BioSPARQL.org web site, and the program runs on a user's web browser. Firefox and Chrome web browsers are supported. BioSPARQL Data File Manager is a Java application that runs on a user’s local computer.
The program can be downloaded from here after reading section 8 on limitations and licensing of this Tutorial and must be installed by following the instructions shown below.
5.1. Installation of BioSPARQL Data File Manager
- First prepare a Windows or Mac to run the BioSPARQL data file manager and a SPARQL endpoint.
- Please confirm that Java Runtime Environment (JRE) version 6 is installed on the computer. If it is not installed, install the latest version 6 JRE on the computer.
- Next install the Virtuoso Open-Source Edition. For Windows PCs, binary programs are available and installation instructions are shown on the website ( http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSUsageWindows)
- Deploy the BioSPARQL data file manager JAR file to a directory in which the JAR file is executable.
- Determine a directory on the user’s PC for storing configuration files and downloaded RDF files. We call this directory the BioSPARQL directory and here we assume that /biosparql is the BioSPARQL directory.
- Set up a configuration where Virtuoso can access the BioSPARQL directory. In specific, edit the database/virtuoso.ini file located under the directory where Virtuoso Open-Source Edition program is installed.
- The BioSPARQL directory is added to the term DirsAllowed in this file as follows:
Before editing
DirsAllowed = ., ../vadAfter editing
DirsAllowed = ., ../vad, /biosparql - Further, to evaluate a SPARQL query, a large size of memory may be required. In order to assign suitable memory to Virtuoso, set the variables NumberOfBuffers and MaxDirtyBuffer. For instance, in order to assign 4GB memory, set the variables as follows:
NumberOfBuffers = 340000
MaxDirtyBuffers = 250000
A document describing more detail installation instruction is included in the binary program package (see Section 8).
6. How to use BioSPARQL
Overview of the workflow
We start with viewing an overview of the workflow of how to use BioSPARQL. The following figure shows a workflow including 3 steps of the individual services provided by BioSPARQL.
6.1. Step 1: Graph Search

This service receives a keyword typed by a user and discovers the graphs include at least one resource having the keyword on its labels.
The search result is a list of graphs of RDF/OWL classes and ontology databases having the keyword and allows users to select a graph from the list as the starting point of a network for selecting a semantic path in the query build service.
Sample Operation
- Launch the Firefox or Chrome web browser.
- The home page of BioSPARQL.org (this page) has a graph search interface (go to the graph search interface). Type a keyword and press the search button.
- As the default, a list of search results of data class graphs having the keyword is shown.
- By clicking the database tab of the Class/Database select tabs, a list of hit ontology graphs (databases) is displayed.
- The search results are filtered when a character of the initial character selector is clicked. In this example, the initial character 'A' is clicked and then only the search results whose names starts with 'A' are displayed.
- By clicking a name of a hit class graph, a query builder is launched in another window of the web browser.




6.2. Step 2: Building a Query

This service generates a SPARQL query that discovers a semantic path from the starting graph selected in the graph search service to the graph that the user is interested in.
In order to realise an intuitive path data schema for the user's selection, a network is displayed including all graphs of sequentially connected semantic links (properties) in the forward and reverse directions. Note that nodes and edges in the network correspond to both the graphs and forward/reverse links connecting two graphs as well. The network allows a user to select a node of interest as the end node for the path data schema.
Even if the user selects an end graph node, one or more paths from the starting graph node to the end graph node may exist. The query building service allows the user to select between paths on the path list.
When the steps listed above are done, a path data schema is already specified. Here, a SPARQL query that discovers all linked data satisfying the path data schema is generated. In order to narrow the query result, keywords for each graph included in the path data schema can be specified. More concretely, for each graph, keywords can be specified by a user against labels for its instances and also of its linked instances and literals. These keywords are described as filter clauses in the SPARQL query.
The generated SPARQL query is posted to the data file management service described below.
Sample Operation
- BioSPARQL Query Builder opens in another window when a user clicks a hit class/database name included in the search results in the graph searcher shown previously in Section 6.1. In this example, the following page is displayed. (URL: http://biosparql.org/class/RIKEN_ENU-induced_allele_in_mouse/crib190u2i)
- A semantic graph is displayed starting from the class/database shown in the centre. BioSPARQL generates the suitable SPARQL query to find a path from this starting class/database or the class/database of interest to the user. A user selects a class/database for the end of the search path in the graph by clicking the corresponding class/database icon or class name. Immediately, all possible semantic paths from starting class/database to the selected class/database are shown below the graph.
- By clicking the“Choose” button, select a search path from a list of all possible semantic paths.
- For each graph included in the selected semantic path, a text box is displayed to input a user’s keyword for each property. To search among linked data of interest that satisfies the corresponding semantic path, type keywords in each text box as necessary.
When a keyword is typed, the SPARQL query displayed on the bottom of the web browser is revised continuously.
Further, on the bottom of the keyword panel, a result term selector for each graph is displayed, which allows a user to select to include:
- both URIs and names of hit resources,
- only URIs of hit resources,
- only names of hit resources or
- nothing in the SPARQL results.
- The generated SPARQL query is posted to the Data File Manager by clicking the "Post" button.





6.3. Step 3: Evaluate the SPARQL query using Data File Manager

This service first analyses the posted query and obtains a list of graph URIs. By accessing to the RDF file download service, it downloads corresponding data files for the graphs.
The downloaded files are managed under precise data version control, and suitable versions of graph data sets are uploaded to a SPARQL endpoint running in the user’s local PC. Here, these data are mashed-up to form a network as an integrated RDF data set. The SPARQL query is evaluated over the network on the SPARQL endpoint and the result is obtained in a standardised format.
For SPARQL experts, the SPARQL query can be modified not only to search over public data obtained by BioSPARQL but also over the user’s private data, and the corresponding private data sets are added to the network on the SPARQL endpoint.
Data File Manager is implemented as a servlet version operated via a web browser and a command line version operated from a command line terminal.
6.3.1. Usage of the servlet version
In the servlet version, by clicking "Post" button of the Query Builder, new window is opened on the user's web browser and queries are performed in the window. In order to use the servlet version, a servlet process is launched by executing the following command in advance.C:\biosparql>java -jar BioSPARQLServlet.jar /biosparql/BioLOD/config/BioSPARQLWin.conf
- By clicking the "Post" button of the Query Builder, new window is opened of the user's web browser. Confirm that the query generated in the Query Builder is correctly displayed on the query box, and then click "Extract resources from the query".
- A list of RDF files needed to perform the query is displayed. In the next steps of downloading RDF files and uploading these files into the SPARQL endpoint, it takes time when the file size is large. In order to download RDF files and upload these downloaded files into the SPARQL endpoint, click "Download Files and setup SPARQL endpoint" button.
- Here, data preparation needed to perform the query is finished and the query can be performed immediately. An expert SPARQL user can edit the SPARQL query in the query editor in order to change the query to introduce the user's private data sets. First, select the data format (JSON, HTML TURTLE, XML, RDF/XML or TTL) of the result of query. Then by clicking "Evaluate the query" button, the query is performed.
- The result is displayed. Queries can be edited and the revised queries are performed continuously. However, revised queries cannot be described over previously prepared datasets if the user's private data sets are not provided as needed to evaluate the revised query. If the user selects HTML as the resulting format, each URI is displayed with its hyper link. By clicking a hyper link, the web page of the URI is displayed and a user can read a description of the data specified by URI at the original data source site.
- The web page of the URI is displayed.
- If the data is not deleted when the same query is performed later, preparation for performing an additional query can be finished quickly. When you finish with the Data File Manager, the data prepared can be removed. The button to delete the data is displayed at the bottom of the window. In this process data loaded on the SPARQL endpoint can also be deleted, however by checking "Completely remove downloaded local RDF data files" and the downloaded RDF files are also deleted. To delete the data, click the "Remove RDF resources from SPARQL endpoint" button.






6.3.2. Usage of command line version
The command line version is used to handle large data sets. By typing the following command in a command line terminal, the Data File Manager is executed.C:\biosparql> java -jar BioSPARQL.jar command=eval config=/biosparql/config/BioLOD/BioSPARQLWin.conf query=/tmp/BioLOD/query/sparql.query localMode=false outputFile=/tmp/result.txt outputFormat=JSON
Here, the file specified as the query option includes the SPARQL query generated by the Query Builder. After query execution, in order to remove RDF data sources used for the query execution, type the following command in a command line terminal.C:\biosparql> java -jar BioSPARQL.jar command=clean config=/biosparql/BioLOD/config/BioSPARQLconfig.conf query=/biosparql/query/query.txt outputFormat=JSON
7. Sample applications for biological integrated RDF data sets
In this section, we view several practical biological examples over integrated RDF data sets for mammals and plants.7.1. Example of mammalian integrated data sets
This example obtains mouse resources (RIKEN ENU-induced mouse line (in BRC)) associated with MGI Allele Il6ra<tm1.1Jcbr>Query Manager URL is http://biosparql.org/class/RIKEN_ENU-induced_mouse_line_%28in_BRC%29/crib190u1i, and specify keyword 'Il6ra<tm1.1Jcbr>' as shown in the following figure.



7.2. Example of the plant integrated data sets
This example obtains Arabidopsis and Rice genes semantically linked from 'Seedling-albino' phenome data record (URI: http://scinets.org/item/crib32s99rib32s1i of RIKEN Arabidopsis Phenome Information Database (RAPID). As shown in the following figure the Query Manager URL is http://biosparql.org/class/Seedling/crib32s99i, and the specifying keyword is 'albino'.


According to this original page, the resultant Arabidopsis and Rice genes are AT5G008400 (TAIR Locus) and Os02g58120 (O.sativa TIGR) respectively.
8. Limitations of the BioSPARQL
The BioSPARQL Data File Manager is distributed as JAR archives and a set of source codes as an open source program. The license of the program is the Apache software license version 2.0.
Downloading the programs
This software is originally developed and copyrighted by the BioSPARQL project owned by Norio Kobayashi and Tetsuro Toyoda, RIKEN, Japan.
Copyright (c) 2011, RIKEN.
- Binary Programs (This binary program is customised to extract C:\ on Windows, and /User/base/ on Mac OS X.)
- Source programs
BioSPARQL Data File Manager and a SPARQL endpoint must be deployed on the same PC.
BioSPARQL cannot handle data sources and/or queries that exceed the processing capacity of the SPARQL endpoint. BioSPARQL Query Builder may create such queries. When the SPARQL endpoint exceeds its capacity to handle data sources and/or queries, an error code is returned from BioSPARQL.
