The objective of the following tutorial is for you to be able to develop “solutions” for data-intensive applications by becoming familiar with RADICAL-Cybertools – SAGA and Pilot. In particular, you’ll be able to develop and execute task-level parallel applications on different infrastructure (of your choice).
As discussed in lecture today, many data-intensive analytics can be constructed with simple extensions to the “uncoupled multiple tasks”, i.e., Bag-of-tasks. I hope you recall the K-means application as a representative example of how once you have the ability to marshall a number of task concurrently, that ability can be extended. For the rest of this “tutorial” we’ll focus on the simple/base case.
A starting point is the RADICAL-Cybertools landing page. From there you can get to the RADICAL-Pilot and RADICAL-SAGA websites. In particular you need to access the tutorials for RADICAL-SAGA and RADICAL-Pilot.
SAGA
See the SAGA-python documentation.
Everyone should have installed SAGA on das41. Assuming that is the case, please focus on Section 2.1-2.5 (i.e., Tutorial). Be sure to understand how the Mandelbrot example submits 4 different jobs; you might consider changing the end points to different clusters.
Furthermore, as an example of the interoperability that SAGA entails, as well as an example of running Hadoop on a HPC system, here are instructions on how to run Hadoop (and simple word count example) on DAS4 machines.
Pilot
See the RADICAL-Pilot documentation.
Similarly, everyone should have access to RADICAL-Pilot2. Assuming that is the case, please focus on Section 3 (3.1-3.3). You might also want to work through Section 4 (for completeness and for greater understanding).
After you have done the Pilot tutorials, you’ll be a step closer to reproducing the graphs [Figure 2] in the paper A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures.
K-means exercise
For those of you who would like to try the K-means exercise, the code is available in the following directory cloned from github with the following commands:
git clone https://github.com/radical-cybertools/radical.pilot.git cd radical.pilot/examples/tutorial/
You will find two kmeans:
- kmeans_openmp.py
- kmeans_seq.py
You will probably want to think about how to determine the best way to execute multiple
kmeans tasks. For example, for the given data set, given say a pilot of 64 cores, would you
be best served by running 8 Kmeans jobs, each of 8 cores (kmeans_openmp.py), or would you be
better of running 64 Kmeans, each of 1 (kmeans_seq.py) ?
Feel free to contact me for any help + guidance:
Shantenu Jha
-
I’m assuming all fellows have access to das4. Please see me if you need access. ↩
-
Before using RADICAL Pilot, please remember to set environment variable per:
export RADICAL_PILOT_DBURL="mongodb://ec2-54-83-29-124.compute-1.amazonaws.com:27017/"
(Note this is different to the URL provided during the lecture). ↩