- About Us
- Research & Labs
- Projects and Funding
Recent advances in computing, communication, and digital storage technologies have enabled incredible volumes of data to be accessible remotely across geographical and administrative boundaries. There is an increasing demand on summarizing, understanding, monitoring, learning, and collaboratively mining from large, evolving, and possibly private data stores. In DIAC lab, we study the research problems and applications related to such large datasets.
Data clouds, consisting of hundreds or thousands of cheap multi-core PCs and disks, are available for rent in low cost (e.g., Amazon EC2 and S3 services). Many cloud-based applications generate large amount of data in the cloud, which needs to be processed with cloud-based data analytics tools. Powered with the distributed file system, e.g., hadoop distributed file system, and MapReduce programming model, the cloud becomes an economical and scalable platform for performing large-scale data analytics. We study the visual cluster exploration framework (CloudVista) for analyzing the large data hosted in the cloud and the cost model for resource-aware cloud computing.
Large datasets are also characterized by high complexity and uncertainty. Clustering is an effective tool for understanding this complexity and uncertainty. In DIAC lab, we investigate novel techniques that combine visual analytics and statistical analysis to help better understanding the clustering patterns in large datasets. In particular, we are interested in visually exploring and validating clustering patterns in large multi-dimensional datasets (VISTA, iVIBRATE), finding the optimal number of clusters in categorical ACE and BestK) and transactional datasets(Weight CoverageDensity and DMDI), and monitoring the change of clusteringpatterns in categorical data streams (CatStream).
For large-scale complicated learning problems, it is very expensive to collect sufficient amount of labeled training data. Learning to rank in web search is one of such problems. There are multiple ways to extend training dataset, such as leveraging largeamount of unlabeled data (i.e., semi-supervised learning), or searching over the large amount of unlabeled data to find the most effective candidate examples forlabeling (i.e., active learning). In learning to rank, we study some novel strategies to enhance the training data. Concretely, we develop newalgorithms to utilize pairwise preference training data mined from implicit userfeedback (GBRank), to adapt the model trained with small amountof labeled data to the pairwise preference data ( ClickAdapt), and to adapt aranking function trained on one search domain to another (Tree Adaptation or Trada). Recent developmentsinclude the understanding of effectiveness of Tree Adaptation for Ranking and tree adaptation methods for pairwise data.