Description of IBWS, IST Bioinformatics Web Services
Biological data is made available through heterogeneous information systems that are distributed over the Internet
and the need is therefore felt for a system that is able to improve the information accessibility, by also raising
it at an automatic level.
Among current ICT technologies, workflow management systems (WMS), in connection with Web Services (WS), seem
to be the most promising ones.
Current limitations of WMS mainly resides in the infancy of tools, that are still lacking some basic features and
did not yet prove their efficiency with high volume of data, and in the limited number of databases that currently
are available through standard programmatic interfaces.
While new network information sources are often developed by taking into account these new technologies, many
existing databases does not yet offer a programmatic access to them. Although it is likely that some of them
will finally be restructured or will be included in new systems, it would be extremely useful if they would now
be available as WS.
IST Bioinfomatics Web Services have been deployed at the National Cancer Research Institute of Genoa to support
and directly contribute to this effort.
IST Bioinformatics Web Services can be accessed through any WSDL-SOAP compliant software, including the well known
Developed Web Services currently include three main groups, respectively referring to CABRI catalogues,
data sets of the IARC TP53 Mutation Database and public SRS sites.
The Common Access to Biological Resources and Information (CABRI)
project was funded by the European Union from 1996 to 1999 in the sphere of the V Framework programme.
Among other achievements, it led to the setting up of so called CABRI Network Services offering access to 28
collections of biological resources from some of the most known and appreciated European Biological Resources Centers,
including more than 120,000 resources.
These catalogues underwent a careful analysis and comparison that led to the definition of common data sets and formats.
All collections then submitted their catalogues by using the common formats and these were implemented in a common
SRS site for data and search integration.
The CABRI site is one of the most used information sources on
biological resources, but it does not offer a programmatic interface.
For this reason, we developed and deployed the CABRI Web Services.
CABRI WS allow for the execution of a search either by name, by identifier or free text to CABRI catalogues through the
Integration of these data with other sources was planned by using either unique IDs or common terms.
In order to do this, two types of services were implemented, searching either for a specific feature
(usually, name and free text) and returning IDs or searching for an ID and returning full records.
Such services have been developed for each of the main types of biological resources in the CABRI system,
i.e., human and animal cell lines, bacteria and archaea strains, filamentous fungi strains, yeasts strains,
plasmids, phages, and for all resources together.
IARC TP53 Mutation Database
The TP53 Mutation Database of the
of the International Agency for the Research on Cancer (IARC)
is the biggest and most detailed database of mutations described in literature on the TP53 human gene and related protein.
It includes somatic mutations (mutation type with references, mutation prevalence, mutation and prognosis),
germline mutations (both data and references), polymorphisms, mutant functions and cell line status.
Release 14, released in November 2009, includes 26,597 somatic mutations whose description has been derived from
2,198 papers which are included in Medline.
Information on somatic mutations includes data on the mutation, the sample, the patient and his/her life style.
Reference vocabularies and standardized annotations are used extensively for the description of the mutation,
tumour site, type and origin and for literature references.
Examples of the former are ICD-O (International Classification of Diseases, Oncology) and SNOMED nomenclatures.
In the IARC web site, queries can only be executed on-line and imply a human interaction.
Moreover, some data sets are not searchable by online queries.
In order to improve accessibility and support programmatic access, we implemented the
Database in an SRS site,
developed a set of Web Services and deployed them.
TP53 WS allow for the retrieval of databases' identifiers or of complete records through the SRS implementation.
Again, integration of these data with other sources can be implemented by using either unique IDs or common terms
and two types of services were implemented, searching the various data sets either for a specific feature and returning
IDs or searching for an ID and returning full records.
The mutation data set can be searched by using many different features, like exon, or intron, effect of the mutation
on the protein, mutation type, tumor origin and localization of metastasis, occurrence of the mutation in a splice
site or in a CpG island.
SRS by Web Services
A list of public SRS sites is maintained by BioWisdom.
It includes more than 1,000 databanks available in about 40 SRS sites.
Each of these sites can be queried through its web interface, but their contents cannot be made available to workflows.
SWS (SRS by WS) is a suite of WS allowing to query biological databases available in this list and to return
results in a simple text-only format.
It allows to query selected systems and to retrieve essential information on the sites, such as actual availability
and lists of included databases and tools, and on available databases, such as sites where they are implemented
and related sizes/versions.
SWS can be invoked by specifying the name of the databank to be queried and query terms.
It then automatically choose the best site, performs the query and returns the complete results.
Users can also specify the following information:
the SRS site to be queried, the fields where the information must be searched, the desired output fields.
SWS currently includes four Web Services.
getDBs retrieves acronyms of all libraries (databases) that are available in a specified site.
getSites retrieves acronyms of all SRS sites that include a specified library.
getImplementations retrieves all implementations of a specified library.
These do not actually query any SRS site.
querySWS, allows to actually perform queries on a specified library.
The query (i.e. terms that must be searched in the database) is a mandatory parameter.
The site, instead, can be omitted and SWS identifies the best one by selecting, among those that are active,
the site where that specific library has the greatest number of entries and, when more sites have the same
number, the most recent version of SRS.
This function is limited to SRS versions 6 and 7.
Further parameters of this WS allow to determine which parts (fields) of the library must be queried,
and which parts of the entries (records) must be returned.
CABRI and TP53 data sources
CABRI catalogues and the IARC TP53 Mutation Database were previously implemented in SRS sites.
Web Services are implemented by using the SoapLab tool that, in this case, was queried by using
either remote links or local executions.
Public SRS sites
The list of SRS sites made available by BioWisdom is available as a simple HTML page and it is updated daily.
Our system daily checks the list and extract data that is stored into a local database.
Web Services were then implemented by means of perl scripts associated to SoapLab.
Web Services deployment
Web Services have been deployed by using Soaplab, a SOAP-based Analysis Web Service providing a programmatic
access to local, command-line applications, like the EMBOSS software, and to the contents of ordinary web pages.
The only requirements of Soaplab are the Apache Tomcat servlet engine with the Axis SOAP toolkit,
a Java Virtual Machine and, optionally, perl and mySQL.
Once the server has been installed, new Web Services are deployed (added to the system) by defining simple
descriptions of related execution commands.
Definitions are written in the AJAX Command Definition (ACD) language and are then converted into XML before
they can be used by remote users.
This work was partially supported by the Italian Ministry of Education, University and Research (MIUR), projects
"Oncology over Internet (O2I)",
"Laboratory of Interdisciplinary Technologies in Bioinformatics (LITBIO)",
and "Italian Network for oncology Bioinformatics (RNBIO)".
Our system is partially based on open source.
SRS by WS