Without talking about Gustafson's law and the assumptions both make.
Let me preface this by saying there is nothing wrong with Amdahl's law[1] itself. After all, it is just an observation of the result of applying simple math.
But I believe there is a lot wrong with its current use in pedagogy around parallel programming[2].
Amdahl's law
Amdahl's law is a formula to calculate the speedup of a task by using more computing resources, e.g. multiple threads (assuming perfect parallelization). Formally written
\[
S(n) = \frac{1}{(1-p) + \frac{p}{n}}
\]
where
- \(n\) is the number of threads you have
- \(p\) is the fraction of runtime taken up by parallelizable code
There are two important things to notice about the formula.
- The result is the improvement in latency of the task, that is, how much less real time you have to wait from start to finish.
- For large \(n\), the result simplifies to just \(\frac{1}{(1 - p)}\).
Informally we can also generalize it a bit:
Optimizing a single part of code cannot provide overall improvements larger than the fraction of time initially taken by the optimized part.
We expect experienced programmers to understand the underlying principle naturally; the colder a section of code is, the less payoff there is in optimizing it. This is because if a section of code takes up 1% of the overall runtime, you will, at most, knock off 1% of the overall runtime by optimizing it.
My issue with Amdahl's law when talking about multithreading is simple. Amdahl's law is used to explain the limit of parallelizing code through a large number of threads. If the parallelizable code takes up 95% of the runtime before parallelization, even perfect parallelization across 1000s of cores won't decrease the runtime more than 20x (5% of the original time).
This limit is correct, but providing further context is necessary for the students to avoid leaving with the idea that parallelizing past few cores is pointless. The issue is that Amdahl's law makes an important assumption about the problem, and students who have just been exposed to a bunch of new complex ideas (concurrency, synchronization, etc.), are unlikely to realize what the assumption is. Can you figure out the assumption?
> Click here for the answer <
Amdahl's law assumes that there is always a fixed amount of work.
This assumption holds for a lot of important software.
Take grep
as a simple example. No matter how quickly you can search for a pattern in a single[3] file, you will always have to wait the time it takes to transfer the bytes from the drive[4] to the memory so you can read and match them.
Filters in Photoshop are another example. Most of the time, they will be applied to a single image, sometimes to a well-defined batch of images. No matter how fast/parallelized the computational part of the filter becomes, there will always be non-parallelizable steps, like dealing with IO.
Gustafson's law
Gustafson's law talks about the slowdown from forcing parallel code to be run on a serial computer.
\[
S(n) = (1 - p) + p * n
\]
as with Amdahl's law,
- \(n\) is the number of threads you have
- \(p\) is the fraction of runtime taken up by parallelizable code
Notice that there are no diminishing returns on the parallelism level applied to the task. Where does this difference come from? The difference in the basic assumptions of the shape of the work. Can you figure out the assumption underlying Gustafson's law?
> Click here for the answer <
Gustafson's law assumes a fixed amount of time to do the work.
This assumption also holds for a lot of important software.
For example, models for weather forecasting are massively parallel but must finish running within a set amount of time[5]. So when we get more parallel resources, we can run more granular[6] models or larger model cohorts, improving our ability to accurately forecast short-term weather.
Various other physics simulations have the same property, where we can saturate an arbitrary number of cores[7] to achieve better and more precise results, but we need to get the results within some predetermined amount of time.
But the tasks do not have to be inherently parallel for the software to obey Gustafson's law. The work that uses up the extra resources can be orthogonal from the work that is always done; e.g. old word processor running on an old machine might not even have available resources to run simple spell check, while a new one can do online machine learning on your writing style to tell you when your voice has changed.
Conclusion
Both Amdahl's and Gustafson's laws are correct while holding very different positions about the efficiency of parallelism. This difference comes from their contradictory assumptions about the properties of the problem our code is solving, and thus at most, one of them applies to our code.
Somehow it became common to show students just Amdahl's law as part of multithreading courses without explaining the underlying assumption it makes. But showing your students just Amdahl's law[8] without explaining it is doing them a huge disservice, so, please:
stop talking about Amdahl's law without also explaining Gustafson's law
If you do not have the time to do both, skip showing either.
Quick terminology note; Keeping the problem size constant while increasing the number of cores is called strong scaling. Increasing the number of cores and the size of the problem is called weak scaling.
The fact that it is called "law" annoys me a bit. There is nothing factually wrong with Amdahl's law though[9]. ↩︎
Amdahl's law is perfectly applicable to single-threaded programming, but I have never seen it brought up in that context. I find this interesting because we teach people about its implications, but we never name the rule as such. ↩︎
Arguably, as the implementation of your grep-like utility becomes faster, the users will use it to search through more files. This is because a common usage pattern is to look for the file with a specific pattern, and once users realize that they can get the results in a reasonable time, even if they are lazy with prefiltering the files, they will get lazy. ↩︎
Assuming cold FS cache. ↩︎
If the weather model does not finish in time, we won't get the results (prediction) before the weather is actually happening. And once the weather is happening, well, we can just observe it for much more accurate data. ↩︎
Higher granularity means that the "cells" used to simplify the simulations become smaller, which in turn means that there will be more of them, and thus there is more work to do. ↩︎
To quote a friend working on physics simulations in a field I won't pretend to understand: "If it got me 10x cores, I would turn Hindu". ↩︎
Obviously, this is also true for showing just Gustafson's law. But I have never met someone who said somebody showed them just Gustafon's law without further explanation. I have only seen this happen with Amdahl's law. ↩︎
Technically, second-order effects can violate Amdahl's law in some specific cases[10], but that requires code explicitly crafted to do that. Let's assume we are talking about plain old boring programs instead. ↩︎
The simplest way to violate Amdahl's law is through CPU cache effects on microbenchmark results. If an optimization of one part reduces how much i/d cache it uses up, that can provide speed up to the non-optimized parts due to not evicting them from the CPU cache. ↩︎