Devices
Pty-Chi supports GPU acceleration through PyTorch’s native CUDA support. At this moment,
multi-GPU support is only available for the AutodiffPtychographyReconstructor
and LSQMLReconstructor engines . Other engines support only 1 GPU.
On a computer with multiple GPUs, you can set the device to use by setting the CUDA_VISIBLE_DEVICES
environment variable. For example, to use the first GPU, you can run:
export CUDA_VISIBLE_DEVICES=0
To disable GPU acceleration, set the variable to an empty string.
Note that it is always recommended to set the variable in terminal before running the code.
If you have to set the variable in the Python code, make sure to set it before importing PyTorch
using os.environ["CUDA_VISIBLE_DEVICES"] = "<GPU index>". Setting the variable in Python
will not take effect if it is done after PyTorch is imported.
Non-Nvidia GPUs
Pty-Chi works on GPUs from different vendors than NVidia. For example, Intel. To run Pty-Chi with Intel GPUs, add these lines right after importing torch and ptychi:
torch.set_default_device("xpu")
ptychi.device.set_torch_accelerator_module(torch.xpu)
Multi-GPU and multi-processing
Some reconstruction engines support multi-processing. This allows you to use multiple GPUs
(one in each process) to split the computation of update vectors across different devices.
The multi-processing capability is realized using PyTorch’s torch.distributed module
(for analytical engines) and torch.nn.parallel.DistributedDataParallel (for autodiff).
The biggest benefit of using multi-GPU/multi-processing is reducing the per-device VRAM usage because fewer data are processed on each device. Note that multi-processing does not always make the computation faster unless the data size is very large because it incurs communication overhead.
Currently, the engines that support multi-processing are:
Autodiff
LSQML
To enable multi-processing, you must launch the reconstruction script using torchrun:
torchrun --nnodes=1 --nproc_per_node=2 reconstruction_script.py
The --nnodes and --nproc_per_node arguments specify the number of nodes and
the number of processes per node, respectively. For single-node machines, keep it to 1.
When a job is launched in this way, Pty-Chi will sign a rank to the GPU indexed
rank % n_gpus where n_gpus is the number of GPUs available, so as to max
out the number of GPUs while minimizing the number of ranks on each GPU. It is
not recommended, and in some cases not allowed to use launch more processes than
the number of GPUs.
torchrun spawns all processes at the beginning, so the reconstruction script
will also be executed in all processes. If you have post-analysis or data saving
routines in that script, make sure they don’t produce unexpected results when executed
in multiple processes. It is generally advised to execute such routines only on rank 0:
import torch.distributed as dist
# Set up and run task
if dist.get_rank() == 0:
# Do post-analysis or data saving
dist.get_rank() is only callable after the task object is instantiated
where it initializes the process group.