AIST 2015 Projects
2015 Projects
Scientists working at AIST in Tsukuba, Japan are involved with research in data mining over large datasets and data visualization. Below are proposed projects for summer 2015 fellows.
Visual Social Media Mining
Mentor: Dr. Kyoungsook Kim
Introduction:
Recently, social media has become an important source of information to understand
diverse aspects of our lives from personal activities to situations in disasters. However, they
contain lots of noise and heterogeneous data such as time, location, and relationship between
people as well as unstructured content (texts/images/videos). In this project, we are focusing on
the development of visual mining tools using social medial data, especially using Twitter. Our
group previously developed the Sophy framework (Kim et al. LBSN2014) to present the
spatiotemporal proximity of social topics from real-time streaming, geo-tagged tweets. The
Sophy framework supports users efforts to structure, cluster and find hidden spatiotemporal
relationships or patterns in Twitter messages. However, the system excludes an interactive way
to explore data with user interaction elements.
This project aims to extend the functionality, especially user interactions, of Sophy visualization
tool that encompasses geo-spatial, time and topic data and enables finding relationships and
patterns among the data.
Skills Desired:
We are looking for a talented student who is interested in international collaborations and
projects with real-world application. The student ideally will have the following skills:
- Familiarity with/interest in agile development
- R and Javascript experience or general experience: especially, d3.js and three.js
- Service orchestration with RESTful web services
- Interest in user-interface design
Multi-scale Functionality for HYDRA Visualization Interface
Mentor: Dr. Jason Haga
Introduction: Multiple disciplines are showing increased interest in the rapid growth of data. The
biomedical sciences in particular are discovering new ways of leveraging technologies to access,
visualize, and analyze biomedical data. Our group previously created an easy-to-use, scalable,
high-throughput visualization software for drug discovery called HYDRA (Zhao and Haga,
Supercomputing 14, Accessed online 5 Jan. 2015). This HTML5/Javascript application replicates
the functionality of ViewDock TDW (Lau, et. al, Bioinformatics, 26(15), p 1915, 2010), but with
an intuitive interface and greater platform independence. One area to be improved is the ability
of HYDRA to connect to multiple chemical databases, as well as publications and patent
information. This would allow the user to access additional information and be able to better
decide to pursue a particular discovery or not. By leveraging on available open data sources, this
project will facilitate the analysis of chemical interactions and the overall process of drug discovery.
Objective:
To create “multi-scale” visualization functionality for drug discovery by extracting and
integrating data from multiple databases into a convenient access point on the HYDRA interface.
The goal for this PIRE project is to have a working prototype version by the end of the
internship.
Skills desired: We are looking for a talented student who is interested in collaborating with groups on an
international scale and with the desire to participate in a project with real-world application. The student ideally will have:
- Familiarity with/interest in agile development
- HTML5/Javascript experience or general experience with scripting languages
- Experience with database structures and information extraction
- Interest in user-interface design
Analysis of Linked Open Data to support Distributed Query Processing
Mentor: Steven Lynden
Introduction:
The growing amount of RDF-based Linked Open Data [http://linkeddata.org] (LOD) has created a need
for efficient search and query tools to provide the basis for emerging Semantic Web applications.
Although indexes exist such as Sindice [http://sindice.com], which support efficient query answering
over cached parts of the Semantic Web, distributed querying-based approaches can provide more
complete results by directly accessing the sources from which LOD is published, while having the
potential to maximize Information Retrieval (IR)-related attributes such as freshness, diversity, etc.,
within query results. At AIST we have developed various software for RDF distributed query
processing, however they rely on metadata/statistics about LOD data sources to optimize query
processing. Providing more accurate and up-to-date metadata in addition to insightful observations
about the current state of LOD on the Web can help us to improve such software.
Objective:
Analyze LOD on the Web to with respect to various characteristics such as how dynamic and inter-
linked it is. Starting with LOD corpora such as the Billions Triples Challenge data set
[http://km.aifb.kit.edu/projects/btc-2014/], the two goals of the research are:
-
Select, retrieve and analyze a portion of this data focusing on various IR-related characteristics of LOD (e.g. freshness, diversity). The goal is to be able to identify which parts of the LOD cloud [http://lod-cloud.net/] are more dynamic, which parts are more inter-linked, in addition to other insights that may be gathered from the data.
-
Produce a summary of the data, with the goal of presenting interesting statistics, insights and trends to Semantic Web researchers similar to the one that can be found here [http://gromgull.net/blog/category/semantic-web/billion-triple-challenge/]; there is scope for producing a more interactive summary of the relevant aspects, for example using the Google Visualisation API or other tools.
Skills desired: We desire a student who is interested in international collaborations and real-world projects. The student should have:
- An interest in the Semantic Web and large-scale data mining/analysis
- The ability to solve practical issues when dealing with Big Data.
- The ability to produce an attractive visual summary of the data, for example using HTML5, Javascript.