Introduction

Mass spectrometry (ms) is an analytical procedure for proteomics data for studying protein structure and posttranslational modifications. Raw data produced by a mass spectrometer is analyzed in a multistep process that yields a list of identified entities and their quantification. The protocol followed at the Complex Carbohydrate Research Center (CCRC) for protein identification from ms data is typical in proteomics research. This high-throughput process may generate more than 500 data files from a single sample. This analytical procedure was originally conducted manually by transferring data across distributed systems and then invoking software tools. The scientists, who were responsible for keeping track of each result file across multiple projects, often spent frustratingly long hours searching for a previous result or trying to correlate results using handwritten notes. We completely automated this analytical process as a scientific workflow using semantic Web services (Web services annotated with ontological concepts) that were orchestrated using the Taverna workflow engine [http://taverna.sourceforge.net/]. Many prior efforts have automated scientific protocols and workflow automation in itself is not novel; what is new is the support for semantic provenance. To help the scientists manage the large volumes of data using provenance information, as the next step, we developed the ProPreO proteomics provenance ontology (described in the next section). Next we implemented a set of semantic provenance creation services that are plugged in at each intermediate step of the workflow. This infrastructure is called the Semantic Provenance Annotation of Data in protEomics (SPADE).


  • Entity extraction: Relevant descriptions for the creation of provenance information such as parameter details, project descriptions, and identified biological entities (e.g., protein groups) are extracted either from Web forms filled out by users at the start of the workflow or from data files generated during the sample run. These entities are categorized as instances of ProPreO ontology classes using class membership relations based on a set of heuristic rules. The entity extraction and classification at each step of the workflow results in an aggregated list of ProPreO ontology class instances at the end of the workflow.
  • Assertions of named relationships to link extracted entities: Using the ProPreO ontology schema as reference, named relationships that apply between two entities (categorized as instances of ProPreO classes in the previous step) are asserted. We use Jena to traverse the ontology schema and identify the correct relationship between two entities.