|
1. Carrara GE, Stella A, Pinciroli F, Alcalay M, Masseroli M Automatic extraction of gene annotations from data-rich HTML pages Meeting: BITS 2004 - Year: 2004 Full text in a new tab Topic: Unspecified Abstract: High-throughput technologies create the necessity to integrate the resulting gene expression data with information mined from large amounts of gene annotations within several different biomolecular databanks. Most of these databanks can be queried only via web, for a single gene at a time, and query results are generally available in HTML format. Although some databanks provide batch retrieval of data via FTP, this requires expertise and resources for locally re-implementing the databank. Web wrappers can automate extraction of the information of numerous genes from different web-based databanks. As the content of a dynamic web page can change from one query to another (e.g. tables with extra rows or missing fields), such wrappers should be able to locate and extract data of interest inside different HTML pages. Unfortunately, HTML tags describe the visual formatting of data, not their semantics. Thus, human-readability and machinereadability are often not equivalent. Wrapper generation tools help creating a wrapper for a specific source, i.e. a web-based biomolecular databank with its own HTML layout. First, the user is invited via a Graphic User Interface to select data of interest inside one or more sample HTML pages. Then, the system saves this information as an extraction template for that specific source. The long term goal is to generate wrappers that scale well with the number of processed web pages. |