Open Science Data Cloud PIRE bio photo

Open Science Data Cloud PIRE

Providing training in data intensive computing using the Open Science Data Cloud.

Email us Twitter Facebook Flickr YouTube Newsletter All Posts

University of Amsterdam 2015 Projects

2015 Projects

Below are possible projects for 2015 summer PIRE fellows to become involved with at the University of Amsterdam’s Informatics Institute. These projects involve workflows, big data transfer, deep learning, and an investigation of software defined networks (SDN).

Service workflow comparison in the Open Science Data Cloud

Contacts: Zhiming Zhao and Adam Belloum

Orchestration and Choreography are two types of coordination for service workflows. In service choreography the logic of the message-based interactions among the participants are specified from a global perspective while in service orchestration, the logic is specified from the local point of view of one single participant. In the orchestration way, a centralized engine is employed to coordinate the execution sequences among services, while in the Choreography way the execution sequences are implicitly realised via the messages exchanged among services. Each way has its advantages and disadvantages. This PIRE project will compare these two service coordination methods based on existing workflow system and using OSDC as the test bed. In this project, the student will

  1. select test workflow systems for both orchestration and choreograph
  2. select data use cases from OSDC,
  3. benchmark and compare the performance of these two coordination types.

On demand network adaptation for large data transfers

Contact: Miroslav Zivkovic

One of the major requirements for Big Data systems is ability to efficiently process (analyze) large amount of both structured and unstructured data. In order to achieve speed and efficiency the algorithms used for Big Data analytics are usually parallelized and distributed over many clusters of hundreds of servers connected via high-speed (Ethernet) networks. Therefore, data processing speed (one of the biggest bottlenecks for Big Data) can only be as fast as network’s capability to transfer data between servers in different phases of the analysis process. A recent study on Facebook traces1 shows that this data transfer between successive stages may account more than 50% of job completion time. 

In this project we aim to investigate intelligent network that, during each stage of Big Data analysis adaptively scales to adjust for bandwidth/processing requirements of the data transfers. This should result in improved data processing time and improved overall utilization of the system. SDN is an ideal candidate to build such intelligent adaptive network, and it can be used to configure such network on- demand to the optimal size and shape for computing servers. As the SDN Controller has an overview of the underlying network, like network utilization, etc. the developed solutions can accurately translate the Big Data analysis needs by programming the network on demand.

Project content

  • literature review
  • SDN capabilities inventory
  • Requirements scoping to specifying the initial architecture and design of the enhanced, intelligent SDN Controller.

The ideal candidate should have expertise in at least one of the following: networking, SDN, performance engineering, software engineering.

Deep learning for image classification

Given an image of tumor tissue, we want to diagnose this image as healthy or sick. We can apply classification models to the data, such as convolutional neural networks, but we may not have enough data. Scientists have built simulators to create simulated images of cancer with properties similar to the real images. Can we use the simulator along with deep learning methods to generate images and automatically extract features to improve the classifier? In this project, the student will:

  1. Gather cancer image data
  2. Find a simulator appropriate to the problem
  3. Design methods to automatically extract features for cancer diagnosis and classify tumors using both the original data and the simulated data. This may involve iteratively estimating parameters of the images, feeding them into the simulator, and then repeating using the newly generated images.
  4. Evaluate the methods and assess the incremental value of using simulator based features
  1. Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica, Managing Transfers in Computer Clusters with Orchestra, Proceedings of the ACM SIGCOMM 2011 conference pp. 98-109.