- Dr. Delroy Cameron
- Dr. Karthik Gomadam
- Dr. Cory Henson
- Dr. Prateek Jain
- Dr. Pablo Mendes
- Dr. Meena Nagarajan
- Dr. Matthew Perry
- Dr. Hemant Purohit
- Dr. Cartic Ramakrishnan
- Dr. Ajith Ranabhu
- Dr. Satya Sahoo
- Dr. Christopher Thomas
- Dr. Wenbo Wang
Dr. Delroy Cameron
A Context-Driven Subgraph Model for Literature-Based Discovery
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, prevention, and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. Rather, they only allude to the existence of meaningful underlying associations. To gain in-depth insights into the meaning of hidden (and other) connections, complementary methods have often been employed. Some of these methods include: 1) the use of domain expertise for concept filtering and knowledge exploration, 2) leveraging structured background knowledge for context and to supplement concept filtering and 3) developing heuristics a priori to help eliminate spurious connections.
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. The main issue is that the intricate context of complex associations is not always known a priori and cannot easily be computed without understanding the underlying semantics of the associations. Complex associations should not be overlooked, since they are often needed to elucidate the mechanisms of interaction and causality relationships among concepts. Moreover, they can capture the broader aspects of a domain by segregating associations along different thematic dimensions, such as Metabolic Function, Pharmaceutical Treatment and Neurological Activity.
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
- D. Cameron, R. Kavuluru, T. C. Rindflesch, A. P. Sheth, K. Thirunarayan, O. Bodenreider, Context-Driven Automatic Subgraph Creation for Literature-Based Discovery. Journal of Biomedical Informatics. 54: 141-157 (2015).
- R. Daniulaityte, R. Carlson, G. Brigham, D. Cameron, A. P. Sheth, "Sub is a weird drug:" A Web-based study of lay attitudes about use of buprenorphine to self-treat opioid withdrawal symptoms(to appear in the American Journal on Addictions).
- D. Cameron, A. P. Sheth, N. Jaykumar, G. Anand, K. Thirunarayan, G. A. Smith, A Hybrid Approach to Finding Relevant Social Media Content for Complex Domain Specific Information Needs. Journal of Web Semantics. 29: 39-52 (2014) [ ScienceDirect].
- D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, D. Dave, L. Chen, G. Anand, R. Carlson, K. Z. Watkins, R. Falck, PREDOSE: A Semantic Web Platform for Drug Abuse Epidemiology using Social Media. Journal of Biomedical Informatics. 46(6): 985-997, 2013. [ScienceDirect], [PMID 23892295].
- D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch, A Graph-Based Recovery and Decomposition of Swanson's Hypothesis using Semantic Predications, Journal of Biomedical Informatics. 46(2): 238-251, (2013). [ScienceDirect] [PMID23026233] [PMC 4031661].
- R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, A. P. Sheth, "I Just Wanted to Tell You That Loperamide WILL WORK": A Web-Based Study of Extra-Medical Use of Loperamide. Journal of Drug and Alcohol Dependence. 130(1-3): 241-244, 2013. [ScienceDirect], [PMID 23201175] [PMC 3633632].
- D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan, Semantic Predications for Complex Information Needs in Biomedical Literature, 5th International Conference on Bioinformatics and Biomedicine BIBM2011, Atlanta GA, November 12-15, 2011. p. 512-519 (acceptance rate=19.4%).
- D. Cameron, B. Aleman-Meza, I. B. Arpinar, S. L. Decker, A. P. Sheth, A Taxonomy-based Model for Expertise Extrapolation, 4th International Conference on Semantic ComputingICSC2010, Pittsburgh PA, September 22-24, 2010 (acceptance rate=32%).
- D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan, Semantics-Empowered Text Exploration for Knowledge Discovery, 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15-17, 2010.
- D. Cameron, V. Bhagwan, A. P. Sheth, Towards Comprehensive Longitudinal Healthcare Data Capture. The 1st International Workshop on the role of Semantic Web in Literature-Based Discovery SWLBD2012, (co-located with the IEEE International Conference on Bioinformatics and Biomedicine BIBM 2012) Philadelphia PA USA, October 4, 2012. p. 241-247.
- D. Cameron, B. Aleman-Meza, I. B. Arpinar, Collecting Expertise of Researchers for Finding Relevant Experts in a Peer-Review Setting , 1st International ExpertFinder Workshop EFW 2007, (Co-located with 7th Knowledge Web General Assembly) Berlin Germany, January 16, 2007.
- R. Daniulaityte, R. Carlson, D. Cameron, G. A. Smith, A. P. Sheth, When less is more: A web-based study of user beliefs about buprenorphine dosing in self-treatment of opioid withdrawal symptoms.The College on Problems of Drug Dependence CPDD 2014, San Juan, Puerto Rico, June 14-17, 2014.
- R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, A. P. Sheth, A Web-Based Study of Self-Treatment of Opioid Withdrawal Symptoms with Loperamide. The College on Problems of Drug Dependence CPDD 2012, Palm Springs, CA USA, June 9-14, 2012.
- C. Thomas, W. Wang, P. Mehra, D. Cameron, P. N. Mendes, A. P. Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, Web Science Conference, WebSci10, (Co-located with 19th International World Wide Web Conference - WWW10) Raleigh NC, April 26-27, 2010.
- P. N. Mendes, P. Kapanipathi, D. Cameron, A. P. Sheth, Dynamic Associative Relationships on the Linked Open Data Web, Web Science Conference, WebSci10, (Co-located with 19th International World Wide Web Conference - WWW10) Raleigh NC, April 26-27, 2010
- B. Aleman-Meza, S. L. Decker, D. Cameron, I. B. Arpinar, Association Analytics for Network Connectivity in a Bibliographic and Expertise Dataset , In J. Cardoso, & M. Lytras (Eds.), Semantic Web Engineering in the Knowledge Society (pp. 188-207). 2009.
Dr. Karthik Gomadam
Semantics Enriched Service Environments
During the past seven years services centric computing has emerged as the preferred approach to architect complex software. Software is increasingly developed by integrating remotely existing components, popularly called services. This architectural paradigm, also called Service Oriented Architecture (SOA), brings with it the benefits of interoperability, agility and flexibility to software design and development. One can easily add or change new features to existing systems, either by the addition of new services or by replacing existing ones. Two popular approaches have emerged for realizing SOA. The first approach is based on the SOAP protocol for communication and the Web Service Description Language (WSDL) for service interface description. SOAP and WSDL are built over XML, thus guaranteeing minimal structural and syntactic interoperability. In addition to SOAP and WSDL, the WS-* (WS-Star) stack or SOAP stack comprises other standards and specification that enable features such as security and services integration. More recently, the RESTful approach has emerged as an alternative to the SOAP stack. This approach advocates the use of the HTTP operations of GET/PUT/POST/DELETE as standard service operations and the REpresentational State Transfer (REST) paradigm for maintaining service states. The RESTful approach leverages on the HTTP protocol and has gained a lot of traction, especially in the context of consumer Web applications such as Maps.
Despite their growing adoption, the stated objectives of interoperability, agility, and flexibility have been hard to achieve using either of the two approaches. This is largely because of the various heterogeneities that exist between different service providers. These heterogeneities are present both at the data and the interaction levels. Fundamental to addressing these heterogeneities are the problems of service Description, Discovery, Data mediation and Dynamic configuration. Currently, service descriptions capture the various operations, the structure of the data, and the invocation protocol. They however, do not capture the semantics of either the data or the interactions. This minimal description impedes the ability to find the right set of services for a given task, thus affecting the important task of service discovery. Data mediation is by far the most arduous task in service integration. This has been a well studied problem in the areas of workflow management, multi-database systems and services computing. Data models that describe real world data, such as enterprise data, often involve hundreds of attributes. Approaches for automatic mediation have not been very successful, while the complexity of the models require considerable human effort. The above mentioned problems in description, discovery and data mediation pose considerable challenge to creating software that can be dynamically configured.
This dissertation is one of the first attempts to address the problems of description, discovery, data mediation and dynamic configuration in the context of both SOAP and RESTful services. This work builds on past research in the areas of Semantic Web, Semantic Web services and Service Oriented Architectures. In addition to addressing these problems, this dissertation also extends the principles of services computing to the emerging area of social and human computation. The core contributions of this work include a mechanism to add semantic metadata to RESTful services and resources on the Web, an algorithm for service discovery and ranking, techniques for aiding data mediation and dynamic configuration. This work also addresses the problem of identifying events during service execution, and data integration in the context of socially powered services.
Dr. Cory Henson
A Semantics-based Approach to Machine Perception
Machine perception can be formalized using semantic web technologies in order to derive abstractions from sensor data using background knowledge on the Web, and efficiently executed on resource-constrained devices.
Advances in sensing technology hold the promise to revolutionize our ability to observe and understand the world around us. Yet the gap between observation and understanding is vast. As sensors are becoming more advanced and cost-effective, the result is an avalanche of data of high volume, velocity, and of varied type, leading to the problem of too much data and not enough knowledge (i.e., insights leading to actions). Current estimates predict over 50 billion sensors connected to the Web by 2020.1 While the challenge of data deluge is formidable, a resolution has profound implications. The ability to translate low-level data into high-level abstractions closer to human understanding and decision-making has the potential to disrupt data-driven interdisciplinary sciences, such as environmental science, healthcare, and bioinformatics, as well as enable other emerging technologies, such as the Internet of Things.
The ability to make sense of sensory input is called perception; and while people are able to perceive their environment almost instantaneously, and seemingly without effort, machines continue to struggle with the task. Machine perception is a hard problem in computer science, with many fundamental issues that are yet to be adequately addressed, including: (a) annotation of sensor data, (b) interpretation of sensor data, and (c) efficient implementation and execution. This dissertation presents a semantics-based machine perception framework to address these issues.
Dr. Prateek Jain
Linked Open Data Alignment and Querying
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 295 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
Video link - Prateek Jain dissertation defense
Dr. Pablo Mendes
Adaptive Semantic Annotation of Entity an Concept Mentions in Text
The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications.
Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective.
In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implement-tation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation.
- 2013, Pablo N. Mendes, Dirk Weissenborn, Chris Hokamp: DBpedia Spotlight at the MSM2013 Challenge. #MSM 2013 : 57-61, at WWW 2013.
- 2012, Pablo N. Mendes, Peter Mika, Hugo Zaragoza, and Roi Blanco. Measuring website similarity using an entity-aware click graph. In 21st ACM International Conference on Information and Knowledge Management (CIKM’12), pages 1697–1701, 2012.
- 2012, Mihály Héder, and Pablo N. Mendes. Round-trip semantics with sztakipedia and dbpedia spotlight. In Proceedings of the 21st World Wide Web Conference, WWW 2012 (Companion Volume), pages 357–360, 2012.
- 2012, Pablo N. Mendes, Joachim Daiber, Rohana Rajapakse, Felix Sasaki, and Christian Bizer. Evaluating the impact of phrase recognition on concept tagging. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 2012.
- 2011, Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. Dbpedia spotlight: shedding light on the web of documents. In Proceedings the 7th International Conference on Semantic Systems, I-SEMANTICS 2011, pages 1–8, 2011.
- 2011, Pablo N. Mendes, Joachim Daiber, Max Jakob, and Christian Bizer. Evaluating DBpedia Spotlight for the TAC-KBP entity linking task. In Proceedings of the TAC-KBP 2011 Workshop, 2011.
- 2010, Pablo N. Mendes, Alexandre Passant, Pavan Kapanipathi: Twarql: tapping into the wisdom of the crowd. I-SEMANTICS 2010
- 2010, Pablo N. Mendes, Alexandre Passant, Pavan Kapanipathi, Amit P. Sheth: Linked Open Social Signals. Web Intelligence 2010: 224-231
Dr. Meena Nagarajan
Understanding User-generated Content on Social Media
Over the last few years, there has been a growing public and enterprise fascination with 'social media' and its role in modern society. At the heart of this fascination is the ability for users to participate, collaborate, consume, create and share content via a variety of platforms such as blogs, micro-blogs, email, instant messaging services, social network services, collaborative wikis, social bookmarking sites, and multimedia sharing sites. Today, in addition to any factual information, we are also able to access conversations, opinions and emotions that these facts evoke among other users. We are able to ask questions such as, what are people saying about any news-worthy event or entity? Can we use this information to assess a population's preference? Can we study how these preferences propagate in a network of friends? Are such crowd-sourced preferences a good substitute for traditional polling methods?This dissertation is devoted to understanding informal user-generated textual content on social media platforms and using the results of the analysis to build Social Intelligence Applications.The body of research presented in this thesis focuses on understanding what a piece of user- generated content is about via two sub-goals of Named Entity Recognition and Key Phrase Ex- traction on informal text. In light of the poor context and informal nature of content on social media platforms, we investigate the role of contextual information from documents, domain mod- els and the social medium to supplement and improve the reliability and performance of existing text mining algorithms for Named Entity Recognition and Key Phrase Extraction.In all cases we find that using multiple contextual cues together lends to reliable inter-dependent decisions, better than using the cues in isolation and that such improvements are robust across domains and content of varying characteristics, from micro-blogs like Twitter, social networking forums such as those on MySpace and Facebook, and blogs on the Web.Finally, we showcase two deployed Social Intelligence applications that build over the results of Named Entity Recognition and Key Phrase Extraction algorithms to provide near real-time information about the pulse of an online populace. Specifically, we describe what it takes to build applications that wish to exploit the wisdom of the crowds, highlighting challenges in data collection, processing informal English text, metadata extraction and presentation of the resulting information.
Amit P. Sheth, Ph.D.
John M. Flach, Ph.D.
Daniel Gruhl, Ph.D.
Kevin Haas, M.S.
Michael L. Raymer, Ph.D.
Shaojun Wang, Ph.D.
Dr. Matthew Perry
A Framework to Support Spatial, Temporal and Thematic Analytics over Semantic Web Data
Spatial and temporal data are critical components in many applications. This is especially true in analytical applications ranging from scientific discovery to national security and criminal investigation. The analytical process often requires uncovering and analyzing complex thematic relationships between disparate people, places and events. Fundamentally new query operators based on the graph structure of Semantic Web data models, such as semantic associations, are proving useful for this purpose. However, these analysis mechanisms are primarily intended for thematic relationships. This dissertation proposes a framework built around the RDF data model for analysis of thematic, spatial and temporal relationships between named entities. We present a spatiotemporal modeling approach that uses an upper-level ontology in combination with temporal RDF graphs. A set of query operators that use graph patterns to specify a form of context are formally defined, and an extension of the W3C-recommended SPARQL query language to support these query operators is presented. We also describe an efficient implementation of the framework that extends a state-of-the-art commercial database system. We demonstrate the scalability of our approach with a performance study using both synthetic and real-world RDF datasets of over 25 million triples.
Dr. Hemant Purohit
Mining Behavior of Citizen Sensor Communities to Improve Cooperation with Organizational Actors
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges.
This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks.
The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award. Additionally, the intent classification technology for identifying seeking-offering of help during a crisis was integrated by the crisis-mapping pioneer Ushahidi’s project, CrisisNET for broader impact.
Dr. Cartic Ramakrishna
Extracting, Representing and Mining Semantic Metadata from Text: Facilitating Knowledge Discovery in Biomedicine
The information access paradigm offered by most contemporary text information systems is a search-and-sift paradigm where users have to manually glean and aggregate relevant information from the large number of documents that are typically returned in response to keyword queries. Expecting the users to glean and aggregate information has lead to several inadequacies in these information systems. Owing to the size of many text databases, search-and-sift is a very tedious often requiring repeated keyword searches refining or generalizing queries terms. A more serious limitation arises from the lack of automated mechanisms to aggregate content across different documents to discover new knowledge. This dissertation focuses on processing text to assign semantic interpretations to its content (extracting Semantic metadata) and the design of algorithms and heuristics to utilize the extracted semantic metadata to support knowledge discovery operations over text content. Contributions in extracting semantic metadata in this dissertation cover the extraction of compound entities and complex relationships connecting entities. Extraction results are represented using a standard Semantic Web representation language (RDF) and are manually evaluated for accuracy. Knowledge discovery algorithms presented herein operate on RDF data. To further improve access mechanisms to text content, applications supporting semantic browsing and semantic search of text are presented.
Dr. Ajith Ranabhu
Abstraction Driven Application and Data Portability in Cloud Computing
Cloud computing has changed the way organizations create, manage, and evolve their applications. While many organizations are eager to use the cloud, tempted by substantial cost savings and convenience, the implications of using clouds is not well understood yet. One of the major concerns in cloud adoption is the vendor lock-in of applications, caused by the heterogeneity of the numerous cloud service offerings. Vendor locked applications are difficult, if not impossible , to port from one cloud system to another. This forces cloud service consumers to use undesired or suboptimal solutions and makes it difficult to incorporate the redundancy needed by some organization for high availability.
Given the current state-of-the-art, supporting multiple cloud systems require multiple development efforts, thus avoiding vendor lock-in is an expensive proposition. In the long run, this problem negatively affects the adoption of cloud technologies.
This dissertation investigates a comprehensive solution to address the issue of application lockin in cloud computing. Our primary principle is the use of carefully designed abstractions in a manner that makes the heterogeneity of the clouds invisible. The first part of this dissertation investigates the development of cloud applications using abstract specifications. Given the domain specific nature of many cloud workloads, we focused on using Domain Specific Languages (DSLs). We applied DSL based development techniques to two domains with different characteristics and learnt that our solution indeed results in significant savings in cost and effort when building portable cloud applications. The second part of this dissertation presents the use of process abstractions for application deployment and management in clouds. Many cloud service consumers are focused on specific application oriented tasks, thus we provided abstractions for the most useful cloud interactions via a middleware layer. Our middleware system, Altocumulus not only provided the independence from the various process differences, but also provided the means to reuse known best practices. The success of Altocumulus also influenced a commercial product, the IBM Workload Deployer (http://www-01.ibm.com/software/webservers/workload-deployer/) .
Finally, we showcase two publicly hosted Web tools, MobiCloud (http://mobicloud.knoesis.org/) and SCALE (http://metabolink.knoesis.org/SCALE), that encapsulate the abstractions in every step of the application life-cycle. These tools allow domain experts to quickly create applications and deploy them to clouds, irrespective of the target cloud system, highlighting the applicability of our solutions in practice.
Dr. Satya Sahoo
Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery
Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery Provenance metadata, describing the history or lineage of an entity, is essential for ensuring data quality, correctness of process execution, and computing trust values. Traditionally, provenance management issues have been dealt with in the context of workflow or relational database systems. However, existing provenance systems are inadequate to address the requirements of an emerging set of applications in the new eScience or Cyberinfrastructure paradigm and the Semantic Web. Provenance in these applications incorporates complex domain semantics on a large scale with a variety of uses, including accurate interpretation by software agents, trustworthy data integration, reproducibility, attribution for commercial or legal applications, and trust computation. In this dissertation, we introduce the notion of 'semantic provenance' to address these requirements for eScience and Semantic Web applications. In addition, we describe a framework for management of semantic provenance by addressing the three issues of, (a) provenance representation, (b) query & analysis, and (c) scalable implementation. First, we introduce a foundational model of provenance called Provenir to serve as an upper-level reference ontology to facilitate provenance interoperability. Second, we define a classification scheme for provenance queries based on the query characteristics and use this scheme to define a set of specialized provenance query operators. Third, we describe the implementation of a highly scalable query engine to support the provenance query operators, which uses a new class of materialized views based on the Provenir ontology, called Materialized Provenance Views (MPV), for query optimization. We also define a novel provenance tracking approach called Provenance Context Entity (PaCE) for the Resource Description Framework (RDF) model used in Semantic Web applications. PaCE, defined in terms of the Provenir ontology, is an effective and scalable approach for RDF provenance tracking in comparison to the currently used RDF reification vocabulary. Finally, we describe the application of the semantic provenance framework in biomedical and oceanography research projects.
Amit P. Sheth, Ph.D.
Olivier Bodenreider, Ph.D.
Michael L. Raymer, Ph.D.
Nicholas V. Reo, Ph.D.
Krishnaprasad Thirunarayan, PhD
William S. York, Ph.D.
Dr. Christopher Thomas
Knowledge Acquisition in a System
I present a method for growing the amount of knowledge available on the Web using a hermeneutic method that involves background knowledge, Information Extraction techniques and validation through discourse and use of the extracted information. I present the metaphor of the “Circle of Knowledge on the Web”. In this context., knowledge acquisition on the web is seen as analogous to the way scienti?c disciplines gradually increase the knowledge available in their ?eld. Here, formal models of interest domains are created automatically or manually and then validated by implicit and explicit validation methods before the statements in the created models can be added to larger knowledge repositories, such as the Linked open Data cloud. This knowledge is then available for the next iteration of the knowledge acquisition cycle. I will both give a theoretical underpinning as well as practical methods for the acquisition of knowledge in collaborative systems. I will cover both the Knowledge Engineering angle as well as the Information Extraction angle of this problem. Unlike traditional approaches, however, this dissertation will show how Information Extraction can be incorporated into a mostly Knowledge Engineering based approach as well as how an Information Extraction-based approach can make use of engineered concept repositories. Validation is seen as an integral part of this systemic approach to knowledge acquisition. The centerpiece of the dissertation is a domain model extraction framework that implements the idea of the “Circle of Knowledge” to automatically create semantic models for domains of interest. It splits the involved Information Extraction tasks into that of Domain De?nition, in which pertinent concepts are identi?ed and categorized, and that of Domain Description, in which facts are extracted from free text that describe the extracted concepts. I then outline a social computing strategy for information validation in order to create knowledge from the extracted models.
Dr. Wenbo Wang
Automatic Emotion Identification from Text
People’s emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.
Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.
This dissertation aims at understanding the emotion identification problem and developing general techniques to tackle the above challenges. First, to address the challenge of fine-grained emotion classification, we investigate a variety of lexical, syntactic, knowledge-based, context-based and class-specific features, and show how much these features contribute to the performance of the machine learning classifiers. We also propose a method that automatically extracts syntactic patterns to build a rule-based classifier to improve the accuracy of identifying minority emotions. Second, to deal with the challenge of manual annotation, we leverage emotion hashtags to harvest Twitter `big data' and collect millions of self-labeled emotion tweets, the labeling quality of which is further improved by filtering heuristics. We discover that the size of the training data plays an important role in emotion identification task as it provides a comprehensive coverage of different emotion-triggering events/situations. Further, the unigram and bigram features alone can achieve a performance that is competitive with the best performance of using a combination of ngram, knowledge-based and syntactic features. Third, to handle the paucity of the labeled emotion datasets in many domains, we seek to exploit the abundant self-labeled tweet collection to improve emotion identification in text from other domains, e.g., blog posts, fairy tales. We propose an effective data selection approach to iteratively select source data that are informative about the target domain, and use the selected data to enrich the target domain training data. Experimental results show that the proposed method outperforms the state-of-the-art domain adaptation techniques on datasets from four different domains including blog, experience, diary and fairy tales.
Finally, we apply the proposed research to analyze cursing, an emotion rich activity, on Twitter. We explore a set of questions that have been recognized as crucial for understanding cursing in offline communications by prior studies, including ubiquity, utility, contextual dependencies, and people factors.
- Wenbo Wang, Lei Duan, Anirudh Koul, Amit P. Sheth. YouRank: Let User Engagement Rank Microblog Search Results. In the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM'14) 2014
- Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter. In ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'14) 2014 [errata]
- Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012 U.S. Republican Presidential Primaries. In Proceedings of the Fourth International Conference on Social Informatics (SocInfo'12) 2012
- Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012), (dataset download )
- Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit P. Sheth. Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012
- Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012
- Ramakanth Kavuluru, Christopher Thomas, Amit Sheth, Victor Chan, Wenbo Wang, Alan Smith, ?An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains, IHI 2012 - 2nd ACM SIGHIT Intl Health Informatics Symposium, January 28-30, 2012.
- Alan Smith, Amit Sheth, Ashutosh Jadhav, Hemant Purohit, Lu Chen, Michael Cooney, Pavan Kapanipathi, Pramod Anantharam, Pramod Koneru and Wenbo Wang. Twitris+: Social Media Analytics Platform for Effective Coordination. NSF SoCS Symposium, 2012
- Wenbo Wang, Christopher Thomas, Amit Sheth, Victor Chan. Pattern-Based Synonym and Antonym Extraction. 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15-17, 2010
- Christopher J. Thomas, Wenbo Wang, Pankaj Mehra, Delroy Cameron, Pablo N. Mendes, and Amit P. Sheth.. What Goes Around Comes Around – Improving Linked Opend Data through On-Demand Model Creation. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US.
- Ashutosh Jadhav, Wenbo Wang, Raghava Mutharaju, Pramod Anantharam, Vinh Nyugen, Amit P. Sheth, Karthik Gomadam, Meenakshi Nagarajan, and Ajith Ranabahu, Twitris: Socially Influenced Browsing, Semantic Web Challenge 2009, demo at 8th International Semantic Web Conference, Oct. 25-29 2009, Washington, DC, USA