I'm using a Slurm-based HPC at my university to run memory-intensive software. I need to know if it's possible to distribute the required RAM across multiple nodes and partitions. My lab has exclusive access to one node in the 'uni' partition with 384 GB RAM. However, my current models require more memory, necessitating the use of an additional partition called 'work'.
I'm aiming to combine RAM from both the 'uni' and 'work' partitions. The 'uni' partition provides 384 GB, and I plan to use two nodes from the 'work' partition, each with approximately 255 GB of idle RAM, totaling around 800 GB.
I've attempted the following configurations:
Attempt 1 (using both partitions):
#SBATCH --partition=uni,work
#SBATCH --time=10:00:00
#SBATCH --nodes=3
#SBATCH --nodelist=n008,n010,n011
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=800G
Attempt 2 (using only 'work' partition):
#SBATCH --partition=work
#SBATCH --time=10:00:00
#SBATCH --nodes=4
#SBATCH --nodelist=n010,n011,n012,n013
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=800G
Attempt 3 (without specifying node list):
#SBATCH --partition=work
#SBATCH --time=10:00:00
#SBATCH --nodes=4
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=800G
All attempts resulted in the following error: "sbatch: error: Batch job submission failed: Requested node configuration is not available"
Can you advise on how to properly configure my job to utilize the required memory across multiple nodes or partitions? I think the --mem parameter is just for one node, right?
--mem=0
to request all available memory on the nodes you are requesting. This doesn't get past the restriction that a single process cannot access memory beyond what's on its own node. As Ian mentioned you need a multi-processes program where every node has at least one process, possibly more. $\endgroup$