Open Science Data Cloud PIRE bio photo

Open Science Data Cloud PIRE

Providing training in data intensive computing using the Open Science Data Cloud.

Email us Twitter Facebook Flickr YouTube Newsletter All Posts

AIST 2015 Projects

2015 Projects

Scientists working at AIST in Tsukuba, Japan are involved with research in data mining over large datasets and data visualization. Below are proposed projects for summer 2015 fellows.

Visual Social Media Mining

Mentor: Dr. Kyoungsook Kim

Introduction: Recently, social media has become an important source of information to understand diverse aspects of our lives from personal activities to situations in disasters. However, they contain lots of noise and heterogeneous data such as time, location, and relationship between people as well as unstructured content (texts/images/videos). In this project, we are focusing on the development of visual mining tools using social medial data, especially using Twitter. Our group previously developed the Sophy framework (Kim et al. LBSN2014) to present the spatiotemporal proximity of social topics from real-time streaming, geo-tagged tweets. The Sophy framework supports users efforts to structure, cluster and find hidden spatiotemporal relationships or patterns in Twitter messages. However, the system excludes an interactive way to explore data with user interaction elements. This project aims to extend the functionality, especially user interactions, of Sophy visualization tool that encompasses geo-spatial, time and topic data and enables finding relationships and patterns among the data.

Skills Desired: We are looking for a talented student who is interested in international collaborations and projects with real-world application. The student ideally will have the following skills:

  • Familiarity with/interest in agile development
  • R and Javascript experience or general experience: especially, d3.js and three.js
  • Service orchestration with RESTful web services
  • Interest in user-interface design

Multi-scale Functionality for HYDRA Visualization Interface

Mentor: Dr. Jason Haga

Introduction: Multiple disciplines are showing increased interest in the rapid growth of data. The biomedical sciences in particular are discovering new ways of leveraging technologies to access, visualize, and analyze biomedical data. Our group previously created an easy-to-use, scalable, high-throughput visualization software for drug discovery called HYDRA (Zhao and Haga, Supercomputing 14, Accessed online 5 Jan. 2015). This HTML5/Javascript application replicates the functionality of ViewDock TDW (Lau, et. al, Bioinformatics, 26(15), p 1915, 2010), but with an intuitive interface and greater platform independence. One area to be improved is the ability of HYDRA to connect to multiple chemical databases, as well as publications and patent information. This would allow the user to access additional information and be able to better decide to pursue a particular discovery or not. By leveraging on available open data sources, this project will facilitate the analysis of chemical interactions and the overall process of drug discovery.

Objective: To create “multi-scale” visualization functionality for drug discovery by extracting and integrating data from multiple databases into a convenient access point on the HYDRA interface. The goal for this PIRE project is to have a working prototype version by the end of the internship.

Skills desired: We are looking for a talented student who is interested in collaborating with groups on an international scale and with the desire to participate in a project with real-world application. The student ideally will have:

  • Familiarity with/interest in agile development
  • HTML5/Javascript experience or general experience with scripting languages
  • Experience with database structures and information extraction
  • Interest in user-interface design

Analysis of Linked Open Data to support Distributed Query Processing

Mentor: Steven Lynden

Introduction: The growing amount of RDF-based Linked Open Data [] (LOD) has created a need for efficient search and query tools to provide the basis for emerging Semantic Web applications. Although indexes exist such as Sindice [], which support efficient query answering over cached parts of the Semantic Web, distributed querying-based approaches can provide more complete results by directly accessing the sources from which LOD is published, while having the potential to maximize Information Retrieval (IR)-related attributes such as freshness, diversity, etc., within query results. At AIST we have developed various software for RDF distributed query processing, however they rely on metadata/statistics about LOD data sources to optimize query processing. Providing more accurate and up-to-date metadata in addition to insightful observations about the current state of LOD on the Web can help us to improve such software.

Objective: Analyze LOD on the Web to with respect to various characteristics such as how dynamic and inter- linked it is. Starting with LOD corpora such as the Billions Triples Challenge data set [], the two goals of the research are:

  1. Select, retrieve and analyze a portion of this data focusing on various IR-related characteristics of LOD (e.g. freshness, diversity). The goal is to be able to identify which parts of the LOD cloud [] are more dynamic, which parts are more inter-linked, in addition to other insights that may be gathered from the data.

  2. Produce a summary of the data, with the goal of presenting interesting statistics, insights and trends to Semantic Web researchers similar to the one that can be found here []; there is scope for producing a more interactive summary of the relevant aspects, for example using the Google Visualisation API or other tools.

Skills desired: We desire a student who is interested in international collaborations and real-world projects. The student should have:

  • An interest in the Semantic Web and large-scale data mining/analysis
  • The ability to solve practical issues when dealing with Big Data.
  • The ability to produce an attractive visual summary of the data, for example using HTML5, Javascript.