Thesis Defense

Select a name to jump to the corresponding section:

Dr. Pramod Anantharam

Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social Systems


Abstract

There is a rapid intertwining of sensors and mobile devices into the fabric of our lives. This has resulted in unprecedented growth in the number of observations from the physical and social worlds reported in the cyber world. Sensing and computational components embedded in the physical world is termed as Cyber-Physical System (CPS). Current science of CPS is yet to effectively integrate citizen observations in CPS analysis. We demonstrate the role of citizen observations in CPS and propose a novel approach to perform a holistic analysis of machine and citizen sensor observations. Specifically, we demonstrate the complementary, corroborative, and timely aspects of citizen sensor observations compared to machine sensor observations in Physical-Cyber-Social (PCS) Systems.

Physical processes are inherently complex and embody uncertainties. They manifest as machine and citizen sensor observations in PCS Systems. We propose a generic framework to move from observations to decision-making and actions in PCS systems consisting of: (a) PCS event extraction, (b) PCS event understanding, and (c) PCS action recommendation. We demonstrate the role of Probabilistic Graphical Models (PGMs) as a unified framework to deal with uncertainty, complexity, and dynamism that help translate observations into actions. Data driven approaches alone are not guaranteed to be able to synthesize PGMs reflecting real-world dependencies accurately. To overcome this limitation, we propose to empower PGMs using the declarative domain knowledge. Specifically, we propose four techniques: (a) automatic creation of massive training data for Conditional Random Fields (CRFs) using domain knowledge of entities used in PCS event extraction, (b) Bayesian Network structure refinement using causal knowledge from Concept Net used in PCS event understanding, (c) knowledge-driven piecewise linear approximation of nonlinear time series dynamics using Linear Dynamical Systems (LDS) used in PCS event understanding, and the (d) transforming knowledge of goals and actions into a Markov Decision Process (MDP) model used in PCS action recommendation.

We evaluate the benefits of the proposed techniques on real-world applications involving traffic analytics and Internet of Things (IoT).


Video


Dissertation in the Knoesis Library
http://knoesis.org/node/2702


Dr. Delroy Cameron

A Context-Driven Subgraph Model for Literature-Based Discovery


Abstract

Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, prevention, and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. Rather, they only allude to the existence of meaningful underlying associations. To gain in-depth insights into the meaning of hidden (and other) connections, complementary methods have often been employed. Some of these methods include: 1) the use of domain expertise for concept filtering and knowledge exploration, 2) leveraging structured background knowledge for context and to supplement concept filtering and 3) developing heuristics a priori to help eliminate spurious connections.

While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. The main issue is that the intricate context of complex associations is not always known a priori and cannot easily be computed without understanding the underlying semantics of the associations. Complex associations should not be overlooked, since they are often needed to elucidate the mechanisms of interaction and causality relationships among concepts. Moreover, they can capture the broader aspects of a domain by segregating associations along different thematic dimensions, such as Metabolic Function, Pharmaceutical Treatment and Neurological Activity.

This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.


Video


Publications

Dr. Karthik Gomadam

Semantics Enriched Service Environments


Abstract

During the past seven years services centric computing has emerged as the preferred approach to architect complex software. Software is increasingly developed by integrating remotely existing components, popularly called services. This architectural paradigm, also called Service Oriented Architecture (SOA), brings with it the benefits of interoperability, agility and flexibility to software design and development. One can easily add or change new features to existing systems, either by the addition of new services or by replacing existing ones. Two popular approaches have emerged for realizing SOA. The first approach is based on the SOAP protocol for communication and the Web Service Description Language (WSDL) for service interface description. SOAP and WSDL are built over XML, thus guaranteeing minimal structural and syntactic interoperability. In addition to SOAP and WSDL, the WS-* (WS-Star) stack or SOAP stack comprises other standards and specification that enable features such as security and services integration. More recently, the RESTful approach has emerged as an alternative to the SOAP stack. This approach advocates the use of the HTTP operations of GET/PUT/POST/DELETE as standard service operations and the REpresentational State Transfer (REST) paradigm for maintaining service states. The RESTful approach leverages on the HTTP protocol and has gained a lot of traction, especially in the context of consumer Web applications such as Maps.

Despite their growing adoption, the stated objectives of interoperability, agility, and flexibility have been hard to achieve using either of the two approaches. This is largely because of the various heterogeneities that exist between different service providers. These heterogeneities are present both at the data and the interaction levels. Fundamental to addressing these heterogeneities are the problems of service Description, Discovery, Data mediation and Dynamic configuration. Currently, service descriptions capture the various operations, the structure of the data, and the invocation protocol. They however, do not capture the semantics of either the data or the interactions. This minimal description impedes the ability to find the right set of services for a given task, thus affecting the important task of service discovery. Data mediation is by far the most arduous task in service integration. This has been a well studied problem in the areas of workflow management, multi-database systems and services computing. Data models that describe real world data, such as enterprise data, often involve hundreds of attributes. Approaches for automatic mediation have not been very successful, while the complexity of the models require considerable human effort. The above mentioned problems in description, discovery and data mediation pose considerable challenge to creating software that can be dynamically configured.

This dissertation is one of the first attempts to address the problems of description, discovery, data mediation and dynamic configuration in the context of both SOAP and RESTful services. This work builds on past research in the areas of Semantic Web, Semantic Web services and Service Oriented Architectures. In addition to addressing these problems, this dissertation also extends the principles of services computing to the emerging area of social and human computation. The core contributions of this work include a mechanism to add semantic metadata to RESTful services and resources on the Web, an algorithm for service discovery and ranking, techniques for aiding data mediation and dynamic configuration. This work also addresses the problem of identifying events during service execution, and data integration in the context of socially powered services.



Dr. Cory Henson

A Semantics-based Approach to Machine Perception


Abstract

Machine perception can be formalized using semantic web technologies in order to derive abstractions from sensor data using background knowledge on the Web, and efficiently executed on resource-constrained devices.

Advances in sensing technology hold the promise to revolutionize our ability to observe and understand the world around us. Yet the gap between observation and understanding is vast. As sensors are becoming more advanced and cost-effective, the result is an avalanche of data of high volume, velocity, and of varied type, leading to the problem of too much data and not enough knowledge (i.e., insights leading to actions). Current estimates predict over 50 billion sensors connected to the Web by 2020.1 While the challenge of data deluge is formidable, a resolution has profound implications. The ability to translate low-level data into high-level abstractions closer to human understanding and decision-making has the potential to disrupt data-driven interdisciplinary sciences, such as environmental science, healthcare, and bioinformatics, as well as enable other emerging technologies, such as the Internet of Things.

The ability to make sense of sensory input is called perception; and while people are able to perceive their environment almost instantaneously, and seemingly without effort, machines continue to struggle with the task. Machine perception is a hard problem in computer science, with many fundamental issues that are yet to be adequately addressed, including: (a) annotation of sensor data, (b) interpretation of sensor data, and (c) efficient implementation and execution. This dissertation presents a semantics-based machine perception framework to address these issues.


Dr. Ashutosh Jadhav

Knowledge-driven Search Intent Mining


Abstract

Understanding users’ latent intents behind search queries is essential for satisfying a user’s search needs. Search intent mining can help search engines to enhance its ranking of search results, enabling new search features like instant answers, personalization, search result diversification, and the recommendation of more relevant ads. Consequently, there has been increasing attention on studying how to effectively mine search intents by analyzing search engine query logs. While state-of-the-art techniques can identify the domain of the queries (e.g. sports, movies, health), identifying domain-specific intent is still an open problem. Among all the topics available on the Internet, health is one of the most important in terms of impact on the user and it is one of the most frequently searched areas. This dissertation presents a knowledge-driven approach for domain-specific search intent mining with a focus on health-related search queries.

First, we identified 14 consumer-oriented health search intent classes based on inputs from focus group studies and based on analyses of popular health websites, literature surveys, and an empirical study of search queries. We defined the problem of classifying millions of health search queries into zero or more intent classes as a multi-label classification problem. Popular machine learning approaches for multi-label classification tasks (namely, problem transformation and algorithm adaptation methods) were not feasible due to the limitation of label data creations and health domain constraints. Another challenge in solving the search intent identification problem was mapping terms used by laymen to medical terms. To address these challenges, we developed a semantics-driven, rule-based search intent mining approach leveraging rich background knowledge encoded in Unified Medical Language System (UMLS) and a crowd sourced encyclopedia (Wikipedia). The approach can identify search intent in a disease-agnostic manner and has been evaluated on three major diseases.

While users often turn to search engines to learn about health conditions, a surprising amount of health information is also shared and consumed via social media, such as public social platforms like Twitter. Although Twitter is an excellent information source, the identification of informative tweets from the deluge of tweets is the major challenge. We used a hybrid approach consisting of supervised machine learning, rule-based classifiers, and biomedical domain knowledge to facilitate the retrieval of relevant and reliable health information shared on Twitter in real time. Furthermore, we extended our search intent mining algorithm to classify health-related tweets into health categories. Finally, we performed a large-scale study to compare health search intents and features that contribute in the expression of search intent from more than 100 million search queries from smarts devices (smartphones or tablets) and personal computers (desktops or laptops).



Video



Dr. Prateek Jain

Linked Open Data Alignment and Querying


Abstract

The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 295 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.

This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.



Video link - Prateek Jain dissertation defense


Dr. Pavan Kapanipathi

Personalized and Adaptive Semantic Information Filtering for Social Media


Abstract

Social media has experienced immense growth in recent times. These platforms are becoming increasingly common for information seeking and consumption, and as part of its growing popularity, information overload pose a significant challenge to users. For instance, Twitter alone generates around 500 million tweets per day and it is impractical for users to have to parse through such an enormous stream to find information that are interesting to them. This situation necessitates efficient personalized filtering mechanisms for users to consume relevant, interesting information from social media.

Building a personalized filtering system involves understanding users interests and utilizing these interests to deliver relevant information to users. These tasks primarily include analyzing and processing social media text which is challenging due to its shortness in length, and the real-time nature of the medium. The challenges include: (1) Lack of semantic context: Social Media posts are on an average short in length, which provides limited semantic context to perform textual analysis. This is particularly detrimental for topic identification which is a necessary task for mining users interests; (2) Dynamically changing vocabulary: Most social media websites such as Twitter and Facebook generate posts that are of current (timely) interests to the users. Due to this real-time nature, information relevant to dynamic topics of interest evolve reflecting the changes in the real world. This in turn changes the vocabulary associated with these dynamic topics of interest making it harder to filter relevant information; (3) Scalability: The number of users on social media platforms are significantly large, which is difficult for centralized systems to scale to deliver relevant information to users. This dissertation is devoted to exploring semantic techniques and Semantic Web technologies to address the above mentioned challenges in building a personalized information filtering system for social media. Particularly, the necessary semantics (knowledge) is derived from crowd sourced knowledge bases such as Wikipedia to improve context for understanding short-text and dynamic topics on social media.


Video



Dr. Pablo Mendes

Adaptive Semantic Annotation of Entity an Concept Mentions in Text



Abstract

The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications.

Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective.

In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implement-tation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation.


Publications
  • 2013, Pablo N. Mendes, Dirk Weissenborn, Chris Hokamp: DBpedia Spotlight at the MSM2013 Challenge. #MSM 2013 : 57-61, at WWW 2013.
  • 2012, Pablo N. Mendes, Peter Mika, Hugo Zaragoza, and Roi Blanco. Measuring website similarity using an entity-aware click graph. In 21st ACM International Conference on Information and Knowledge Management (CIKM’12), pages 1697–1701, 2012.
  • 2012, Mihály Héder, and Pablo N. Mendes. Round-trip semantics with sztakipedia and dbpedia spotlight. In Proceedings of the 21st World Wide Web Conference, WWW 2012 (Companion Volume), pages 357–360, 2012.
  • 2012, Pablo N. Mendes, Joachim Daiber, Rohana Rajapakse, Felix Sasaki, and Christian Bizer. Evaluating the impact of phrase recognition on concept tagging. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 2012.
  • 2011, Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. Dbpedia spotlight: shedding light on the web of documents. In Proceedings the 7th International Conference on Semantic Systems, I-SEMANTICS 2011, pages 1–8, 2011.
  • 2011, Pablo N. Mendes, Joachim Daiber, Max Jakob, and Christian Bizer. Evaluating DBpedia Spotlight for the TAC-KBP entity linking task. In Proceedings of the TAC-KBP 2011 Workshop, 2011.
  • 2010, Pablo N. Mendes, Alexandre Passant, Pavan Kapanipathi: Twarql: tapping into the wisdom of the crowd. I-SEMANTICS 2010
  • 2010, Pablo N. Mendes, Alexandre Passant, Pavan Kapanipathi, Amit P. Sheth: Linked Open Social Signals. Web Intelligence 2010: 224-231

Dr. Meena Nagarajan

Understanding User-generated Content on Social Media


Abstract

Over the last few years, there has been a growing public and enterprise fascination with 'social media' and its role in modern society. At the heart of this fascination is the ability for users to participate, collaborate, consume, create and share content via a variety of platforms such as blogs, micro-blogs, email, instant messaging services, social network services, collaborative wikis, social bookmarking sites, and multimedia sharing sites. Today, in addition to any factual information, we are also able to access conversations, opinions and emotions that these facts evoke among other users. We are able to ask questions such as, what are people saying about any news-worthy event or entity? Can we use this information to assess a population's preference? Can we study how these preferences propagate in a network of friends? Are such crowd-sourced preferences a good substitute for traditional polling methods?This dissertation is devoted to understanding informal user-generated textual content on social media platforms and using the results of the analysis to build Social Intelligence Applications.The body of research presented in this thesis focuses on understanding what a piece of user- generated content is about via two sub-goals of Named Entity Recognition and Key Phrase Ex- traction on informal text. In light of the poor context and informal nature of content on social media platforms, we investigate the role of contextual information from documents, domain mod- els and the social medium to supplement and improve the reliability and performance of existing text mining algorithms for Named Entity Recognition and Key Phrase Extraction.In all cases we find that using multiple contextual cues together lends to reliable inter-dependent decisions, better than using the cues in isolation and that such improvements are robust across domains and content of varying characteristics, from micro-blogs like Twitter, social networking forums such as those on MySpace and Facebook, and blogs on the Web.Finally, we showcase two deployed Social Intelligence applications that build over the results of Named Entity Recognition and Key Phrase Extraction algorithms to provide near real-time information about the pulse of an online populace. Specifically, we describe what it takes to build applications that wish to exploit the wisdom of the crowds, highlighting challenges in data collection, processing informal English text, metadata extraction and presentation of the resulting information.



Committee Members

Amit P. Sheth, Ph.D. (Advisor)

John M. Flach, Ph.D.

Daniel Gruhl, Ph.D.

Kevin Haas, M.S.

Michael L. Raymer, Ph.D.

Shaojun Wang, Ph.D.

Dr. Matthew Perry

A Framework to Support Spatial, Temporal and Thematic Analytics over Semantic Web Data


Abstract

Spatial and temporal data are critical components in many applications. This is especially true in analytical applications ranging from scientific discovery to national security and criminal investigation. The analytical process often requires uncovering and analyzing complex thematic relationships between disparate people, places and events. Fundamentally new query operators based on the graph structure of Semantic Web data models, such as semantic associations, are proving useful for this purpose. However, these analysis mechanisms are primarily intended for thematic relationships. This dissertation proposes a framework built around the RDF data model for analysis of thematic, spatial and temporal relationships between named entities. We present a spatiotemporal modeling approach that uses an upper-level ontology in combination with temporal RDF graphs. A set of query operators that use graph patterns to specify a form of context are formally defined, and an extension of the W3C-recommended SPARQL query language to support these query operators is presented. We also describe an efficient implementation of the framework that extends a state-of-the-art commercial database system. We demonstrate the scalability of our approach with a performance study using both synthetic and real-world RDF datasets of over 25 million triples.

Committee Members

Amit P. Sheth, Ph.D. (Advisor)

Krishnaprasad Thirunarayan, PhD

Soon Chung, Ph.D.

Christopher Barton, PhD

Kate Beard, PhD

Dr. Hemant Purohit

Mining Behavior of Citizen Sensor Communities to Improve Cooperation with Organizational Actors


Abstract

Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges.

This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks.

The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award. Additionally, the intent classification technology for identifying seeking-offering of help during a crisis was integrated by the crisis-mapping pioneer Ushahidi’s project, CrisisNET for broader impact.


Dr. Cartic Ramakrishna

Extracting, Representing and Mining Semantic Metadata from Text: Facilitating Knowledge Discovery in Biomedicine


Abstract

The information access paradigm offered by most contemporary text information systems is a search-and-sift paradigm where users have to manually glean and aggregate relevant information from the large number of documents that are typically returned in response to keyword queries. Expecting the users to glean and aggregate information has lead to several inadequacies in these information systems. Owing to the size of many text databases, search-and-sift is a very tedious often requiring repeated keyword searches refining or generalizing queries terms. A more serious limitation arises from the lack of automated mechanisms to aggregate content across different documents to discover new knowledge. This dissertation focuses on processing text to assign semantic interpretations to its content (extracting Semantic metadata) and the design of algorithms and heuristics to utilize the extracted semantic metadata to support knowledge discovery operations over text content. Contributions in extracting semantic metadata in this dissertation cover the extraction of compound entities and complex relationships connecting entities. Extraction results are represented using a standard Semantic Web representation language (RDF) and are manually evaluated for accuracy. Knowledge discovery algorithms presented herein operate on RDF data. To further improve access mechanisms to text content, applications supporting semantic browsing and semantic search of text are presented.


Committee Members

Amit P. Sheth, Ph.D. (Advisor)

Dr. Vassant Honavar, Ph.D.

Michael L. Raymer, Ph.D.

Dr. Thaddeus Tarpey, Ph.D.

Shaojun Wang, Ph.D.

Dr. Ajith Ranabhu

Abstraction Driven Application and Data Portability in Cloud Computing


Abstract

Cloud computing has changed the way organizations create, manage, and evolve their applications. While many organizations are eager to use the cloud, tempted by substantial cost savings and convenience, the implications of using clouds is not well understood yet. One of the major concerns in cloud adoption is the vendor lock-in of applications, caused by the heterogeneity of the numerous cloud service offerings. Vendor locked applications are difficult, if not impossible , to port from one cloud system to another. This forces cloud service consumers to use undesired or suboptimal solutions and makes it difficult to incorporate the redundancy needed by some organization for high availability.

Given the current state-of-the-art, supporting multiple cloud systems require multiple development efforts, thus avoiding vendor lock-in is an expensive proposition. In the long run, this problem negatively affects the adoption of cloud technologies.

This dissertation investigates a comprehensive solution to address the issue of application lockin in cloud computing. Our primary principle is the use of carefully designed abstractions in a manner that makes the heterogeneity of the clouds invisible. The first part of this dissertation investigates the development of cloud applications using abstract specifications. Given the domain specific nature of many cloud workloads, we focused on using Domain Specific Languages (DSLs). We applied DSL based development techniques to two domains with different characteristics and learnt that our solution indeed results in significant savings in cost and effort when building portable cloud applications. The second part of this dissertation presents the use of process abstractions for application deployment and management in clouds. Many cloud service consumers are focused on specific application oriented tasks, thus we provided abstractions for the most useful cloud interactions via a middleware layer. Our middleware system, Altocumulus not only provided the independence from the various process differences, but also provided the means to reuse known best practices. The success of Altocumulus also influenced a commercial product, the IBM Workload Deployer (http://www-01.ibm.com/software/webservers/workload-deployer/) .

Finally, we showcase two publicly hosted Web tools, MobiCloud (http://mobicloud.knoesis.org/) and SCALE (http://metabolink.knoesis.org/SCALE), that encapsulate the abstractions in every step of the application life-cycle. These tools allow domain experts to quickly create applications and deploy them to clouds, irrespective of the target cloud system, highlighting the applicability of our solutions in practice.


Video

Committee Members

Amit P. Sheth, Ph.D. (Advisor)

Dr. Keke Chen

Dr. E. Michael Maximilien (IBM Research)

Krishnaprasad Thirunarayan, PhD

Dr. Satya Sahoo

Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery


Abstract

Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery Provenance metadata, describing the history or lineage of an entity, is essential for ensuring data quality, correctness of process execution, and computing trust values. Traditionally, provenance management issues have been dealt with in the context of workflow or relational database systems. However, existing provenance systems are inadequate to address the requirements of an emerging set of applications in the new eScience or Cyberinfrastructure paradigm and the Semantic Web. Provenance in these applications incorporates complex domain semantics on a large scale with a variety of uses, including accurate interpretation by software agents, trustworthy data integration, reproducibility, attribution for commercial or legal applications, and trust computation. In this dissertation, we introduce the notion of 'semantic provenance' to address these requirements for eScience and Semantic Web applications. In addition, we describe a framework for management of semantic provenance by addressing the three issues of, (a) provenance representation, (b) query & analysis, and (c) scalable implementation. First, we introduce a foundational model of provenance called Provenir to serve as an upper-level reference ontology to facilitate provenance interoperability. Second, we define a classification scheme for provenance queries based on the query characteristics and use this scheme to define a set of specialized provenance query operators. Third, we describe the implementation of a highly scalable query engine to support the provenance query operators, which uses a new class of materialized views based on the Provenir ontology, called Materialized Provenance Views (MPV), for query optimization. We also define a novel provenance tracking approach called Provenance Context Entity (PaCE) for the Resource Description Framework (RDF) model used in Semantic Web applications. PaCE, defined in terms of the Provenir ontology, is an effective and scalable approach for RDF provenance tracking in comparison to the currently used RDF reification vocabulary. Finally, we describe the application of the semantic provenance framework in biomedical and oceanography research projects.


Committee Members

Amit P. Sheth, Ph.D. (Advisor)

Olivier Bodenreider, Ph.D.

Michael L. Raymer, Ph.D.

Nicholas V. Reo, Ph.D.

Krishnaprasad Thirunarayan, PhD

William S. York, Ph.D.

Dr. Christopher Thomas

Knowledge Acquisition in a System


Abstract

I present a method for growing the amount of knowledge available on the Web using a hermeneutic method that involves background knowledge, Information Extraction techniques and validation through discourse and use of the extracted information. I present the metaphor of the “Circle of Knowledge on the Web”. In this context., knowledge acquisition on the web is seen as analogous to the way scienti?c disciplines gradually increase the knowledge available in their ?eld. Here, formal models of interest domains are created automatically or manually and then validated by implicit and explicit validation methods before the statements in the created models can be added to larger knowledge repositories, such as the Linked open Data cloud. This knowledge is then available for the next iteration of the knowledge acquisition cycle. I will both give a theoretical underpinning as well as practical methods for the acquisition of knowledge in collaborative systems. I will cover both the Knowledge Engineering angle as well as the Information Extraction angle of this problem. Unlike traditional approaches, however, this dissertation will show how Information Extraction can be incorporated into a mostly Knowledge Engineering based approach as well as how an Information Extraction-based approach can make use of engineered concept repositories. Validation is seen as an integral part of this systemic approach to knowledge acquisition. The centerpiece of the dissertation is a domain model extraction framework that implements the idea of the “Circle of Knowledge” to automatically create semantic models for domains of interest. It splits the involved Information Extraction tasks into that of Domain De?nition, in which pertinent concepts are identi?ed and categorized, and that of Domain Description, in which facts are extracted from free text that describe the extracted concepts. I then outline a social computing strategy for information validation in order to create knowledge from the extracted models.

Slideshow
Video

Committee Members

Amit P. Sheth, Ph.D. (Advisor)

Dr.Pascal Hitzler

Pankaj Mehra, PhD

Dr. Shaojun Wang

Dr.Gerhard Weikum

Dr. Wenbo Wang

Automatic Emotion Identification from Text


Abstract
People’s emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.

Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.

This dissertation aims at understanding the emotion identification problem and developing general techniques to tackle the above challenges. First, to address the challenge of fine-grained emotion classification, we investigate a variety of lexical, syntactic, knowledge-based, context-based and class-specific features, and show how much these features contribute to the performance of the machine learning classifiers. We also propose a method that automatically extracts syntactic patterns to build a rule-based classifier to improve the accuracy of identifying minority emotions. Second, to deal with the challenge of manual annotation, we leverage emotion hashtags to harvest Twitter `big data' and collect millions of self-labeled emotion tweets, the labeling quality of which is further improved by filtering heuristics. We discover that the size of the training data plays an important role in emotion identification task as it provides a comprehensive coverage of different emotion-triggering events/situations. Further, the unigram and bigram features alone can achieve a performance that is competitive with the best performance of using a combination of ngram, knowledge-based and syntactic features. Third, to handle the paucity of the labeled emotion datasets in many domains, we seek to exploit the abundant self-labeled tweet collection to improve emotion identification in text from other domains, e.g., blog posts, fairy tales. We propose an effective data selection approach to iteratively select source data that are informative about the target domain, and use the selected data to enrich the target domain training data. Experimental results show that the proposed method outperforms the state-of-the-art domain adaptation techniques on datasets from four different domains including blog, experience, diary and fairy tales.

Finally, we apply the proposed research to analyze cursing, an emotion rich activity, on Twitter. We explore a set of questions that have been recognized as crucial for understanding cursing in offline communications by prior studies, including ubiquity, utility, contextual dependencies, and people factors.

Video


Publications