University of Edinburgh 2015 Projects
The University of Edinburgh’s Data Intensive Research group has several possible
projects for student PIRE fellows, using a Python library for describing abstract
workflows for distribute data-intensive applications called dispel4py.
The dispel4py system (dispel4py.org) is a new framework for describing abstract stream-based workflows for distributed data-intensive applications, that has been developed as part of the VERCE European Project (verce.eu). The aim of dispel4py is to enable scientists to focus on their data analysis and computation instead of being distracted by details of the computing infrastructure they use. The dispel4py implementation allows users to run on their own laptops or to access the power of a wide range of distributed computing infrastructures. See www2.epcc.ed.ac.uk/~amrey/VERCE/Dispel4Py.
We invite others to help us by performing several projects with dispel4py.
Monitoring data-intensive workflows with dispel4py
In this project we are interested in implementing an ‘observer’ which is able to
monitor and extract the performance of the dispel4py workflow at run time.
The information extracted by the ‘observer’ could be used to learn about the
workflows and the system where they have been executed, in order to predict the
performance of future workflows.
There are several research questions that could be addressed with the profiling analysis, for example:
- Could we extract information from the performance analysis to predict the performance of an application?
- Which information will best support those predictions?
- Where are the bottlenecks and could we ameliorate them?
- In which layer should the profiling toolkit be implemented?
Exploiting VarPy library with dispel4py
VarPy is a new python library (bitbucket.org/effort/varpy) for volcanic and rock
physics data analysis. Its aim is to facilitate rapid application development
for those communities. Users can develop their own computational models,
analyses and visualization routines with VarPy or use the ones that are already
Combining dispel4py technology with VarPy library will help to run the user’s models for long periods, allowing the live analysis of seismicity and other data by streaming data directly from volcanic observatories.
Exploring astrophysics data and models with dispel4py
The existing dispel4py system would benefit from use in new application domains or to explore its performance for established benchmarks. The astrophysics observations handling streams of data, such as the aperture synthesis undertaken by multi-antennae radio telescopes LOFAR (www.lofar.org) or SKA (www.skatelescope.org), are good examples of data-intensive challenges. Sample data and synthetic data-streams, and the required algorithms are available. The challenge is to evaluate how well dispel4py would serve to organize such data analyses and how scalable the resulting system would be. Alternative application domains or established benchmarks can be essayed to push dispel4py to higher standards. Visiting OSDC fellows can introduce their own suggestions or tackle a standard benchmark such as:
- HADOOP and MapReduce:
Implementing asynchronous communications in dispel4py
The stream-handling communications within dispel4py use synchronous communication protocols. However, asynchronous communication services, such as rabbitMQ (www.rabbitmq.com) or zeroMQ (zeromq.org), provide an opportunity for more flexible communication as (a) buffering and discarding are handled automatically by these services and (b) interest in transmitted streams can be changed dynamically. This may be used for diagnostic observations, visualizations of processing, and user controls. Lightweight dispel4py processing elements (PE) would be developed to explore and evaluate these possibilities. Further details and suggestions for useful PEs are available for any OSDC fellow who is considering this project.