RADICAL - Cybertools - SAGA and Pilot tutorial

The objective of the following tutorial is for you to be able to develop “solutions” for data-intensive applications by becoming familiar with RADICAL-Cybertools – SAGA and Pilot. In particular, you’ll be able to develop and execute task-level parallel applications on different infrastructure (of your choice).

As discussed in lecture today, many data-intensive analytics can be constructed with simple extensions to the “uncoupled multiple tasks”, i.e., Bag-of-tasks. I hope you recall the K-means application as a representative example of how once you have the ability to marshall a number of task concurrently, that ability can be extended. For the rest of this “tutorial” we’ll focus on the simple/base case.

A starting point is the RADICAL-Cybertools landing page. From there you can get to the RADICAL-Pilot and RADICAL-SAGA websites. In particular you need to access the tutorials for RADICAL-SAGA and RADICAL-Pilot.

SAGA

See the SAGA-python documentation.

Everyone should have installed SAGA on das4¹. Assuming that is the case, please focus on Section 2.1-2.5 (i.e., Tutorial). Be sure to understand how the Mandelbrot example submits 4 different jobs; you might consider changing the end points to different clusters.

Furthermore, as an example of the interoperability that SAGA entails, as well as an example of running Hadoop on a HPC system, here are instructions on how to run Hadoop (and simple word count example) on DAS4 machines.

Pilot

See the RADICAL-Pilot documentation.

Similarly, everyone should have access to RADICAL-Pilot². Assuming that is the case, please focus on Section 3 (3.1-3.3). You might also want to work through Section 4 (for completeness and for greater understanding).

After you have done the Pilot tutorials, you’ll be a step closer to reproducing the graphs [Figure 2] in the paper A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures.

K-means exercise

For those of you who would like to try the K-means exercise, the code is available in the following directory cloned from github with the following commands:

     git clone https://github.com/radical-cybertools/radical.pilot.git
     cd radical.pilot/examples/tutorial/

You will find two kmeans:

kmeans_openmp.py
kmeans_seq.py

You will probably want to think about how to determine the best way to execute multiple kmeans tasks. For example, for the given data set, given say a pilot of 64 cores, would you be best served by running 8 Kmeans jobs, each of 8 cores (kmeans_openmp.py), or would you be better of running 64 Kmeans, each of 1 (kmeans_seq.py) ?

Feel free to contact me for any help + guidance: Shantenu Jha

I’m assuming all fellows have access to das4. Please see me if you need access. ↩
Before using RADICAL Pilot, please remember to set environment variable per: export RADICAL_PILOT_DBURL="mongodb://ec2-54-83-29-124.compute-1.amazonaws.com:27017/" (Note this is different to the URL provided during the lecture). ↩

Open Science Data Cloud PIRE

RADICAL - Cybertools - SAGA and Pilot tutorial

SAGA

Pilot

K-means exercise

You might also enjoy (View all posts)