Knowledge Extraction from Community-Generated Content

Overview

This project aims at extracting entities and relationships from text. The two major parts of the project are:

  1. Domain model generation - extract a hierarchy of terms/instances and topics/classes from the Wikipedia corpus
  2. Pattern-based relationship extraction - extract binary relationships between entities using pattern-vectors

Domain model generation

Doozer is an application that aims at generating or extracting a domain model from Wikipedia or other similarly structured knowledge sources. It takes as input an incomplete description of a domain, such as a query or list of seed concepts. Doozer then expands on these seeds to get related concepts, which are then again evaluated regarding their indicativeness of the domain. The output is an extended model that still focuses on the intended domain.

Creating a model for the Neoplasms domain

In order to evaluate the resulting models to a gold standard taxonomy, we ran several trial versions with different parameters on the following domain description:
seed query: Adenoma Carcinoma Vipoma Fibroma Glucagonoma Glioblastoma Leukemia Lymphoma Melanoma Myoma Neoplasm Papilloma
The Broader Focus Domain is: oncology, medicine
The World View taken is: biology

MeSH-Neoplasms comparison model 1, settings were as follows:

# initial search results expansion threshold min p(Domain|Article)
40 0.5 0.1

MeSH-Neoplasms comparison model 2, settings were as follows:

# initial search results expansion threshold min p(Domain|Article)
40 0.5 0.4

MeSH-Neoplasms comparison model 3, settings were as follows:

# initial search results expansion threshold min p(Domain|Article)
25 0.8 0.5

Evaluation


Evaluation of the above Neoplasm models.

The following model was manually refined from model 3. This process took about 5 minutes. This example is to show that even though it is almost impossible to automatically create a model without any false positives, the manual labor involved in building these models is extremely reduced with Doozer.
MeSH-Neoplasms comparison model 3 - manually modified


Pattern-based relationship extraction