Knowledge Extraction from Community-Generated Content
Overview
This project aims at extracting entities and relationships from text. The two major parts of the project are:
- Domain model generation - extract a hierarchy of terms/instances and topics/classes from the Wikipedia corpus
- Pattern-based relationship extraction - extract binary relationships between entities using pattern-vectors
Domain model generation
Doozer is an application that aims at generating or extracting a domain model from Wikipedia or other similarly structured knowledge sources. It takes as input an incomplete description of a domain, such as a query or list of seed concepts. Doozer then expands on these seeds to get related concepts, which are then again evaluated regarding their indicativeness of the domain. The output is an extended model that still focuses on the intended domain.
Creating a model for the Neoplasms domain
In order to evaluate the resulting models to a gold standard taxonomy, we ran several trial versions with different parameters on the following domain description:
seed query: Adenoma Carcinoma Vipoma Fibroma Glucagonoma Glioblastoma Leukemia Lymphoma Melanoma Myoma Neoplasm Papilloma
The Broader Focus Domain is: oncology, medicine
The World View taken is: biology
MeSH-Neoplasms comparison model 1, settings were as follows:
| # initial search results | expansion threshold | min p(Domain|Article) |
| 40 | 0.5 | 0.1 |
MeSH-Neoplasms comparison model 2, settings were as follows:
| # initial search results | expansion threshold | min p(Domain|Article) |
| 40 | 0.5 | 0.4 |
MeSH-Neoplasms comparison model 3, settings were as follows:
| # initial search results | expansion threshold | min p(Domain|Article) |
| 25 | 0.8 | 0.5 |
Evaluation
Evaluation of the above Neoplasm models.
The following model was manually refined from model 3. This process took about 5 minutes. This example is to show that even though it is almost impossible to automatically create a model without any false positives, the manual labor involved in building these models is extremely reduced with Doozer.
MeSH-Neoplasms comparison model 3 - manually modified
Back