Shared-Memory Parallelism: CPUs, GPUs, and In-Between

A course I developed around the systems substrate beneath modern AI: parallel programming on shared-memory machines, spanning CPUs, GPUs, accelerators, and the practical space between them.

Developed course

How computation actually reaches modern AI hardware

I developed this course to teach parallelism as a systems discipline: how code, memory movement, synchronization, vectorization, and heterogeneous hardware fit together when students move from CPUs to GPUs and accelerators.

Every modern AI stack ultimately runs on parallel systems. This course is about that interface: how humans describe computation, how software maps it to hardware, and how performance constraints appear in the real machine underneath training and inference.

The systems layer beneath AI
CPUs, GPUs, and oneAPI
Code-to-hardware reasoning
Public video archive

Quick links

Course development

Lecturer Dr. Gal Oren Practice Yehonatan Fridman

Teaching assistance

Nadav Schneider Tal Kadosh

Materials The open-source community

Lecture videos and practice sessions

Lecture video

Lecture 1

Watch on YouTube

The lecture stream frames the course from the computational foundations of shared-memory parallelism to the hardware interface that modern AI workloads rely on.

Full lecture playlist

Practice video

Practice 1

Watch on YouTube

The practice stream turns the concepts into actual code, tools, and exercises around parallel execution, accelerators, and performance reasoning.

Full practice playlist

Channel

YouTube channel

The public course video archive, including lectures, practice sessions, and companion material.

Open channel

Lectures, training, homework, and projects

Lectures and training slides

Lecture decks and practice materials used throughout the course.

Open folder

Homework and solutions

Graded exercises and accompanying solutions for practice and self-study.

Open folder

Project example

Example project material showing the scope and level expected in the competitive project component.

Open folder

Specifications and examples

Supplementary examples, specs, and reference material used around the project and the course assignments.

Open folder

Reading list

High Performance Computing: Modern Systems and Practices

Thomas Sterling, Maciej Brodowicz, Matthew Anderson

Morgan Kaufmann, 2017

Using OpenMP: The Next Step

Ruud Van Der Pas, Eric Stotzer, Christian Terboven

MIT Press, 2017

The OpenMP Common Core: Making OpenMP Simple Again

Timothy G. Mattson, Yun Helen He, Alice E. Koniges

MIT Press, 2019

Cover of Programming Your GPU with OpenMP

Programming Your GPU with OpenMP: Performance Portability for GPUs

Tom Deakin, Timothy G. Mattson

MIT Press, 2023

Teaching philosophy

As an educator, my primary goal is to inspire my students to learn and grow while providing them with the tools and knowledge necessary to succeed in their careers. In my teaching, I strive to create an engaging, challenging, and supportive environment where students can feel confident and motivated to explore new concepts and ideas.

In this course, I aim to provide students with a solid foundation in parallel programming principles and equip them with the skills and knowledge required to develop high-performance, scalable applications on modern multi-core and many-core processors. Through a combination of lectures, graded programming exercises, and competitive projects, students will learn to handle data parallelism with vector instructions, task parallelism in shared memory with threads, vector programming, and SIMD extensions, efficiently offloading kernels to accelerators, as well as more advanced topics such as thread affinity, memory models, and NUMA.

At the heart of this course is the recognition that the field of high-performance computing is undergoing a profound transformation, driven by the end of Dennard scaling and the rise of multi-core and many-core architectures. As we move away from the era of Moore's law, it is more important than ever to understand the fundamental principles of parallelism and to be able to write code that can take advantage of the available hardware.

To ensure that all students have a strong foundation in the prerequisite knowledge, I require previous knowledge in C/C++ or Fortran, programming in the Linux environment, and Linux shell proficiency in the early weeks of the course.

Ultimately, my goal is to provide students with the skills and knowledge they need to succeed in future careers, whether in HPC, data center workloads, artificial intelligence, or any other field that requires expertise in shared-memory parallelism.