Adjust the Workload's Runtime Budgets #836

fsschneider · 2025-01-20T10:23:07Z

Based on the inaugural AlgoPerf competition results, we believe we can adjust the per-workload runtime budgets. Mostly to reduce the required computational resources, without significantly affecting the meaningfulness of the results.

Normalized submission runtimes across workloads

External Tuning

	CRITEO 1TB	FASTMRI	RESNET	VIT	CONFORMER	DEEPSPEECH	OGBG	WMT
Amos	inf	0.33	inf	0.65	0.71	0.57	0.60	0.68
Baseline	0.94	0.23	inf	0.91	0.90	0.65	0.42	0.86
CASPR Adaptive	NaN	0.13	inf	0.58	inf	0.75	0.12	0.67
Cyclic LR	0.67	0.25	inf	0.81	0.94	0.70	0.38	0.49
Generalized Adam	0.83	0.18	0.97	0.84	inf	0.68	0.31	0.63
LAWA EMA	0.69	0.29	inf	0.80	inf	inf	0.57	0.89
LAWA Queue	inf	0.22	inf	0.66	inf	inf	0.25	0.56
NadamP	0.80	0.22	inf	0.88	0.94	0.60	0.43	0.80
Schedule Free AdamW	0.67	0.13	inf	0.57	0.92	0.78	0.29	0.33
Schedule Free Prodigy	NaN	0.21	inf	inf	inf	inf	0.61	inf
PyTorch Distr. Shampoo	0.65	0.15	inf	0.43	0.78	0.62	0.18	0.80

Self-Tuning

	CRITEO 1TB	FASTMRI	RESNET	VIT	CONFORMER	DEEPSPEECH	OGBG	WMT
AdamG	inf	inf	inf	inf	inf	inf	inf	inf
Baseline	0.75	0.22	inf	0.95	0.94	0.65	0.46	0.84
NadamW Sequential	2.96	0.27	inf	1.58	inf	1.45	0.55	2.36
Schedule Free AdamW	0.75	0.15	inf	0.68	0.97	0.88	0.32	0.94
Sinv6	NaN	0.49	inf	inf	inf	2.47	1.35	2.32
Sinv6 75	NaN	0.45	inf	inf	inf	2.21	1.50	1.82

fsschneider · 2025-01-20T14:03:10Z

External Tuning Ruleset

The main motivation for reducing the runtime budgets is to save compute resources. With the following plot, we can look at how much compute we can save (y-axis), without (substantially) affecting the benchmark results.

Plot details: The x-axis investigates hypothetical runtime budget cuts at given percentages of the original budget. They range from 100% (use the original budget) to the smallest possible one (i.e. the fastest a submission reached the target on a single ruin). The blue line denotes the total required computation (on this workload) as a percentage of the original cost, i.e. at x=100% we always have 100% of the cost. The blue line will always be above the gray dashed line, which indicates the identity, since training runs stopped once they hit the target (or they failed with a NaN), thus reducing the total compute.
The (orange) vertical lines indicate the median per-workload runtime scores by submissions, with the winner in the external tuning ruleset (Shampoo) in green and the baseline in black. They can give us a sense at which points reducing the budgets would affect the results.

I bolded my preferred option, but would be happy to discuss it.
Note: The compute reductions are computed per workload. For the overall compute, it is more important to save compute on resource-intensive workloads.

Criteo 1TB

This likely gives us four options (from least to most competitive):

Don't reduce the budget.
Reduce to 95%, saving roughly 3% compute. No submission would be affected.
Reduce to 85%, saving roughly 11% compute. The Baseline would receive an infinite score.
Reduce to 70%, saving roughly 24% compute. The Baseline, Generalized Adam, and NadamP would receive an infinite score.

fastMRI

Due to our 4x rule (i.e. submissions get an infinite score when above 4x the fastest submission), we should reduce the budget to at least 52% (fastest submission: CASPR Adaptive with 13%). Given that the slowest submission took 33%, we have (roughly) the following two options:

Reduce to 50%, saving roughly 27%. No submission would be affected.
Reduce to 35%, saving roughly 37%. No submission would be affected.
Reduce to 25%, saving roughly 49%. Amos and LAWA EMA would receive an infinite score. (Cyclic LR requires exactly 25%).

ResNet

Discussed in a separate comment below

ViT

ViT shows a very large spread between the fastest submission and the baseline. We could thus consider the following (two options):

Reduce to 95%, saving roughly 3%. No submission would be affected.
Reduce to 85%, saving roughly 11%. The Baseline and NadamP would receive an infinite score.
Reduce to 70%, saving roughly 25%. The Baseline, Cyclic LR, Generalized Adam, LAWA EMA, and NadamP would receive an infinite score.

Conformer

Here, I see the following options:

Don't reduce the budget.
Reduce to 95%, saving roughly 4%. No submission would be affected.
Reduce to 90%, saving roughly 9%. Cyclic LR, NadamP, and Schedule Free AdamW would receive an infinite score.

DeepSpeech

Reduce to 80%, saving roughly 16%. No submission would be affected.
Reduce to 70%, saving roughly 24%. CASPR, Cyclic LR, and Schedule Free AdamW would receive an infinite score.

OGBG

Due to our 4x rule (i.e. submissions get an infinite score when above 4x the fastest submission), we should reduce the budget to at least 48% (fastest submission: CASPR Adaptive with 12%). However, since the fastest submission might not necessarily be part of the next round of submissions, we could also consider 72% (since Shampoo required 18%). I see the following options:

Reduce to 65%, saving roughly 26%. No submission would be affected.
Reduce to 48%, saving roughly 41%. Amos, LAWA EMA, and Schedule Free Prodigy would receive an infinite score.

WMT

Reduce to 90%, saving roughly 6%. No submission would be affected.
Reduce to 70%, saving roughly 23%. The Baseline, LAWA EMA, NadamP, and Shampoo would receive an infinite score.

fsschneider · 2025-01-20T14:15:53Z

ResNet

Here, reducing the budget is not really an option. However, we did consider increasing the budget as currently, only a single submission hit the target (across both rulesets).
The following plots show that three additional submissions were really close to hitting the target reliably, the Baseline, NadamP, and Shampoo.

Increasing the budget slightly, e.g. by 5%, could result in them hitting the target, giving us a less binary score for ResNet.

Hitting the target on ResNet would result in a benchmark score increase of roughly 0.125 (i.e. 1/8 since the workload score on ResNet is now roughly the same as the fastest -> $\tau = 1$, increasing the normalized performance profile integral by ~1/8). This is quite a significant increase, i.e. the Baseline would jump from 6th to 3rd place (very close to 2nd) or NadamP from 5th to 2nd.

adefazio · 2025-01-21T17:39:30Z

Thanks for preparing the hard numbers here! I am in agreement on all suggestions. As for ResNet, as I mentioned in the meeting, increasing the budget 5% will result in more repeatable and lower variance results across the board, which I am in favor of. Right now the performance profiles are very dependent on the seed values used for the runs on ResNet, which is undesirable.

fsschneider added 👷 In Progress Issue is being worked on 🛑 AlgoPerf Leaderboard Blocking rolling AlgoPerf Leaderboard labels Jan 20, 2025

fsschneider added this to the Support for Rolling Leaderboard milestone Jan 20, 2025

fsschneider self-assigned this Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the Workload's Runtime Budgets #836

Adjust the Workload's Runtime Budgets #836

fsschneider commented Jan 20, 2025 •

edited

Loading

fsschneider commented Jan 20, 2025

fsschneider commented Jan 20, 2025

adefazio commented Jan 21, 2025

Adjust the Workload's Runtime Budgets #836

Adjust the Workload's Runtime Budgets #836

Comments

fsschneider commented Jan 20, 2025 • edited Loading

Normalized submission runtimes across workloads

fsschneider commented Jan 20, 2025

External Tuning Ruleset

Criteo 1TB

fastMRI

ResNet

ViT

Conformer

DeepSpeech

OGBG

WMT

fsschneider commented Jan 20, 2025

ResNet

adefazio commented Jan 21, 2025

fsschneider commented Jan 20, 2025 •

edited

Loading