HPC Intermediate Python
This article is a work in progress
The HPC Intermediate workshop is for any researcher to use UConn's computer cluster with at least some experience of the command-line or using clusters, even if it is not with using our particular cluster. Additionally, fluency in any programming language is expected to understand concepts like variables, functions, etc. This event will be more informal and we expect to fine tune the topics based on your experiences.
This workshop covers:
- 70% Strategies to parallelize programs.
- 20% Good practices and habits for numerical programming.
- 10% A little vocabulary and theory.
If you are in the workshop classroom, sign-in as a student to UCONNHPC on socrative.com.
If you don't see the lesson materials under
/scratch/lesson-intermediate, they are also available on our public GitHub repository: HPC/lesson-intermediate
# Make your own copy of the lessons mkdir -p /scratch/$USER cd /scratch/$USER git clone https://github.uconn.edu/HPC/lesson-intermediate.git
Learning doesn't work by simply by explaining a bunch of theory and doing practical problems. Memorable learning is forming associations to existing knowledge. Therefore many of the topics will be explained by example; by showing a less optimal way of solving a problem followed by the more accepted, preferred way.
To explain the broad topic of parallelism, we will be switching between coding in Python and using SLURM. Instead of using the plain, vanilla Python shell we will introduce IPython and use it throughout.
We will cover the following topics:
- Profiling code
- Introduction to IPython
%timeiton loops, lists and arrays
- Review of pandas library API
- Built-in SLURM parallelism
- Job arrays are unpredictable
srun --multiprogworks but has limitations
- Resume jobs using "checkpointing"
- Using traps.
- Guarding against data corruption.
- Disk usage best practices
- Parallel IO
- Binary formats like HDF5
- Reading large files with MPI
- Multi-process parallelism
- External process managers like GNU Parallel
- Internal process management
- Multi-threading not usually worth it
- General debugging
- Recognize when you run out of memory
- Disable MPI to see the underlying error
- Learning to be a better programmer
- Writing a short package
- Working on projects with real consequences
- Reading good packages
To be able to learn any deep subject well we also need to know a little bit of the vocabulary:
|Library / Module||Reusable collection of code to accomplishes a specific task. "Libraries" is the more general term. In Python, libraries are called "modules". Always try to write the minimum code possible to accomplish your problem by using libraries.|
|Documentation||Flavorful explanation of why you should use a particular program or library and how it can make your life easier. Often accompanied by short examples.|
|API|| Less flavorful explanation of libraries. Libraries don't just provide functions: they can also provide, variables (e.g. |
Stands for "Application Programming Interface".
|Checkpoint||Saving progress to pickup close to where you left off from an interrupted job. Think about video game checkpoints of a convenient location where you can respawn after being killed :)|
|Multi-processing||Make temporary copies of your program with the same functions and variables.|
|Multi-threading||Share the same memory. More efficient, but also more complicated. Usually done with "OpenMP" in HPC.|
Let's quickly recall the 7 elements common to every programming language:
- Store individual things (the number 2, the word "Hello")
- Store groups of things (lists, dictionaries, DataFrames, arrays)
- Commands that operate on things (the
- Ways to create chunks (functions, objects/classes, and packages)
- Ways to repeat yourself (
- Ways to make choices (
- Ways to combine chunks (function composition)
For the 20% of numerical programming today, we will mainly focus on #3: of being mindful of efficient operations.