Open Science Data Cloud PIRE bio photo

Open Science Data Cloud PIRE

Providing training in data intensive computing using the Open Science Data Cloud.

Email us Twitter Facebook Flickr YouTube Newsletter All Posts

The objective of the following tutorial is for you to be able to develop “solutions” for data-intensive applications by becoming familiar with RADICAL-Cybertools – SAGA and Pilot. In particular, you’ll be able to develop and execute task-level parallel applications on different infrastructure (of your choice).

As discussed in lecture today, many data-intensive analytics can be constructed with simple extensions to the “uncoupled multiple tasks”, i.e., Bag-of-tasks. I hope you recall the K-means application as a representative example of how once you have the ability to marshall a number of task concurrently, that ability can be extended. For the rest of this “tutorial” we’ll focus on the simple/base case.

A starting point is the RADICAL-Cybertools landing page. From there you can get to the RADICAL-Pilot and RADICAL-SAGA websites. In particular you need to access the tutorials for RADICAL-SAGA and RADICAL-Pilot.


See the SAGA-python documentation.

Everyone should have installed SAGA on das41. Assuming that is the case, please focus on Section 2.1-2.5 (i.e., Tutorial).  Be sure to understand how the Mandelbrot example submits 4 different jobs; you might consider changing the end points to different clusters.

Furthermore, as an example of the interoperability that SAGA entails, as well as an example of running Hadoop on a HPC system, here are instructions on how to run Hadoop (and simple word count example) on DAS4 machines.


See the RADICAL-Pilot documentation.

Similarly, everyone should have access to RADICAL-Pilot2. Assuming that is the case, please focus on Section 3 (3.1-3.3). You might also want to work through Section 4 (for completeness and for greater understanding).

After you have done the Pilot tutorials, you’ll be a step closer to reproducing the graphs [Figure 2] in the paper A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures.

K-means exercise

For those of you who would like to try the K-means exercise, the code is available in the following directory cloned from github with the following commands:

     git clone
     cd radical.pilot/examples/tutorial/ 

You will find two kmeans:


You will probably want to think about how to determine the best way to execute multiple kmeans tasks. For example, for the given data set, given say a pilot of 64 cores, would you be best served by running 8 Kmeans jobs, each of 8 cores (, or would you be better of running 64 Kmeans, each of 1 ( ?

Feel free to contact me for any help + guidance: Shantenu Jha

  1. I’m assuming all fellows have access to das4. Please see me if you need access. 

  2. Before using RADICAL Pilot, please remember to set environment variable per: export RADICAL_PILOT_DBURL="mongodb://" (Note this is different to the URL provided during the lecture).