5.5.1 MPI-Parallelization with NGSolve

This tutorial was prepared by Lukas Kogler for 2018 NGSolve-usermeeting. It was not updated since then, and some parts may be outdated. Nevertheless, it might be a source of inspiration

This talk will be split into two parts.

In part one of the talk, we will look at the basics:

  • How do we start a distributed computation

    • mpirun/mpiexec

    • batch systems

    • jupyter-magic

  • Basics of MPI-parallel NGSolve

    • distributed meshes

    • distributed Finite Element Spaces

    • distributed Linear Algebra

    • solvers & preconditioners

Part two will be focused on the FETI-DP method and it’s implementation in NGSolve an will be in collaboration with Stephan Köhler from TU Bergakademie Freiberg.

MPI - Message Passing Interface

With an MPI-library, multiple separate processes can exchange data very easily and thus work together to do large computations.

Within an MPI job, each process in a computation is given a "rank", a number from \(0\ldots n_p\), which is used as it’s identifier.

Communication happens within so-called 'mpi-communicators', which are contexts within which messages can be exchanged.

Running computations with MPI

If you are familiar with MPI, you already know the dos and don’ts, and if you are following the presentation on your own machine I cannot tell you what to do.

Directly - mpiexec

In the simplest case, we can start an MPI program with mpiexec -np N some_program.

In our case, we want to start N instances of python mpiexec -np N ngspy my_awesome_computation.py

On clusters, however, this is usually not an option.

We ask you not to do this if you use the cluster (it will run the computation on the login node!)

Jupyter-Magic

For the purposes of this presentation, we have set up jupyter-notebooks on the COEUS cluster at Portland State University.

We thank PICS, the Portland Institute for Computational Science for granting us access and organizing user accounts.

We would also like to acknowledge NSF Grant# DMS-1624776 which gave the funding for the cluster.

To connect, follow the following steps:

You should have gotten an email with 2 attached files:

connectme.py  usrmtg_user_data

Download those and call

python3 connectme.py your_lastname

Follow the instructions, and you will be connected to your own jupyter-notebook running on COEUS.

How to use it

We can start a "cluster" of python-processes. The cluster will be identified by some "user_id". While it is running, it will allocate N cores (in this case 5), to this specific cluster.

[1]:
num_procs = '5'
from usrmeeting_jupyterstuff import start_cluster, connect_cluster
start_cluster(num_procs)
connect_cluster()
Waiting for connection file: ~/.ipython/profile_ngsolve/security/ipcontroller-kogler-client.json
connecting ... try:6 succeeded!

Python code in a normal cell will be executed as usual.

[2]:
print('this is executed right here')
this is executed right here

Python code in a cell with that has %%px in the first line will be executed by all workers in the cluster in parallel.

[3]:
%%px
print('hello from everyone in the cluster!')
[stdout:0] hello from everyone in the cluster!
[stdout:1] hello from everyone in the cluster!
[stdout:2] hello from everyone in the cluster!
[stdout:3] hello from everyone in the cluster!
[stdout:4] hello from everyone in the cluster!
[4]:
%%px --targets 3:5
print('hello from some in the cluster')
[stdout:3] hello from some in the cluster
[stdout:4] hello from some in the cluster

We can shut down the cluster again. This frees the resources allocated for the cluster!!

[5]:
from usrmeeting_jupyterstuff import stop_cluster
stop_cluster()

Please clean up your clusters!

Batch System

On clusters, we usually have to make use of a batch system The details depend on the specific system.

COEUS uses SLURM (Simple Linux Utility for Resource Management), and we have prepared ready to go job submission scripts.

For each file.ipynb, there is a file file.py and a slurm-script slurm_file, which can be submitted with the command

** sbatch slurm_file **

You can check the status of your jobs with squeue -u username.

The slurm-scripts can be opened and modified with a text editor if you want to experiment.

The Rest of the talk:

For introductory part I will be using:

  • Basics: Distributed Meshes, Finite Element Spcaces and Lienar Algebra

For the FETI-DP part, the files are:

[ ]: