SLURM job priority
One of the most common questions users of mésocentre ask is: “Why is my job still pending while someone else’s job is already running?” The answer almost always lies in SLURM’s priority calculation system.
SLURM doesn’t run jobs in a simple “first come, first served” order. Instead, it uses a multifactor priority plugin (PriorityType=priority/multifactor), which combines different weights into a single priority score for each pending job. Here’s the actual formula used by SLURM to compute each job’s priority (taken directly from the official documentation):
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_* * TRES_factor_*)
- nice_factor
Each term in this formula represents a “component” of the final score.
The relative importance of each factor is controlled by its weight, which is defined in slurm.conf
What Each Factor Means:
- Age Factor: Prioritizes jobs based on their wait time in the queue.
- Association Factor: Considers the user’s association and usage history.
- Job Size Factor: Accounts for the number of nodes or CPUs requested.
- Nice Factor: Reflects the user’s specified priority adjustment.
- Partition Factor: Adjusts priorities based on the partition’s configuration.
- Quality of Service (QOS) Factor: Influences priority based on the assigned QOS.
- Site Factor: Allows site-specific adjustments to job priorities.
- TRES Factor: Accounts for specific resource types like GPUs or memory.
- Fairshare Factor: Balances resource usage among users or groups over time.
These factors are weighted and combined to compute a final job priority, which SLURM uses to determine the order in which jobs are scheduled.
Among all factors, the Job Size Factor often generates the most confusion among users. This factor is based on how many computing resources your job requests — in our case, the number of CPUs. It helps SLURM balance between throughput (running many small jobs quickly) and efficiency (ensuring large parallel jobs eventually start). Mesocentre’s policy goal is to prioritize large MPI jobs.
Here is mesocentre’s priority configuration:
PriorityFavorSmall = No PriorityWeightAge = 400,000 PriorityWeightAssoc = 0 PriorityWeightFairShare = 900,000 PriorityWeightJobSize = 1,500,000 PriorityWeightPartition = 0 PriorityWeightQOS = 900,000
The takeaway from these values is as follows
PriorityWeightJobSize is the highest weight
PriorityFavorSmall=No → bigger jobs get higher job size factor scores.
This ensures that large parallel jobs can start efficiently, reducing fragmentation.
Fairshare and QOS Matter
Fairshare weight is high (900,000) → users who have used fewer resources recently gain higher priority.
QOS weight is also high → jobs with special QOS settings can get a boost, but all projects have equal QOS, so its effect is minimal.
Age Has Moderate Influence
Jobs waiting a long time gradually gain priority (PriorityWeightAge=400,000), so small or medium jobs won’t be starved indefinitely.
Other Factors Ignored
Association and Partition factors are zero, meaning individual accounts or partitions don’t affect priority.
Implication for Users
- If your job is large, it will likely run sooner than smaller jobs submitted at the same time.
- If your job is small, it may wait behind large jobs unless older age or fairshare boosts it.
So, if your job is pending, it doesn’t necessarily mean something is wrong. It simply means that, according to our cluster configuration, other jobs currently have a higher combined priority. Over time, your job’s age and fairshare factors will naturally increase, ensuring it eventually runs.