The study provides an analysis of ML model energy usage on a state of the art nvidia chip:
We ran all of our experiments on a single NVIDIA A100-SXM4-80GB GPU
Looking these devices up – they have a power draw of 400W when they’re running at full pelt. Your phone probably uses something like 30-40W when fast charging and your laptop probably uses 60-120W when it’s charging up. Gaming-grade GPUs like the RTX4090 have a similar power draw to the A100 (450w). My Nvidia 4070 has a power draw of 200W.
We know that the big players are running data centres filled with racks and racks of A100s and similar chips and that is concerning. We should collectively be concerned with how much energy we’re burning using these systems.
I’m a bit wary about the Gizmodo article’s conclusion that all models – including Dall-e and Midjourney – should be tarred with the same brush, not because I’m naively optimistic that they’re not burning the same amount of energy, but simply because they are an unknown quantity at this point. It’s possible that they are doing something clever behind the scenes (see quantization section below)/
Industry Pivot Away From Task Appropriate Models
I think those of us in the AI/ML space had an intuition that custom-trained models would probably be cheaper and more efficient than generative models but this study provides some great empirical validation of that hunch:
…The difference is much more drastic if comparing BERT-based models for tasks such as text classification with the larger multi-purpose models: for instancepg 14, https://arxiv.org/pdf/2311.16863.pdf
bert-base-multilingual-uncased-sentimentemits just 0.32g of 𝐶𝑂2 per 1,000 queries, compared to 2.66g for
Flan-T5-XLand 4.67g for
…While we see the benefit of deploying generative zero-shot models given their ability to carry out multiple tasks, we do not see convincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web search and navigation, given these models’ energy requirements.
Generative models that can “solve” problems out of the box may seem like an easy way to save many person-weeks of effort – defining and scoping an ML problem, building and refining datasets and so on. However, the cost to the environment (heck even the fiscal cost) of training and using these models is higher in the long term.
If we look at the recent history of the software industry to understand this current trend, we can see a similar sort of pattern in the switch away from platform-specific development frameworks like QT or Java on Android towards the use of cross-platform frameworks like Electron.js React Native. These frameworks generally produce more power-hungry, bloated apps but a much faster and cheaper development experience for companies who need to support apps across multiple systems. This is why your banking app takes up several hundred megabytes on your phone.
The key difference when applying this general “write once run everywhere” type approach to AI is that, once you scratch the surface of your problem-space and realise that prompt engineering is more alchemy than wizadry and that the behaviour of these models is opaque and almost impossible to explain, it may make sense to start with a simple model anyway. If you have a well defined classification problem you might find that a random forest model that can run on a potato computer will do the job for you.
Quantization and Optimisation
A topic that this study doesn’t broach is model optimisation and quantization. For those unfamiliar with the term, quantization is a compression mechanism which allows us to shrink neural network models so that they can run on older/slower computers or run much more quickly and efficiently on state-of-the-art hardware. Quantization has been making big waves this year, starting with llama.cpp (which I built Turbopilot on top of).
Language models like Llama and Llama2 typically need several gigabytes of VRAM to run (hence the A100 with 80GB ram). However, quantized models can run in 8-12GiB RAM and will happily tick along on your gaming GPU or even a Macbook with an Apple M-series chip. For example, To run Llama2 without quantization you need 28GiB of RAM. To run it in 5-bit quantized mode you need 7.28GB. Not only does compressing the model mean it can run on smaller hardware but it also means that inference can be carried out in fewer computer cycles since we can do more calculations at once.
Whilst I stand by the idea that we should use appropriate models for specific tasks, I’d love to see this same study done with quantized models. Furthermore, there’s nothing stopping us applying quantization to pre-GPT models to make them even more efficient too as this repository attempts to do with BERT.
I haven’t come across a stable runtime for quantized stable diffusion models yet but there are promising early signs that such an approach is possible for image generation models too.
However, I’d wager that companies like OpenAI are currently not under any real pressure (commercial or technical) to quantize their models when they can just throw racks of A100s at the problem and chew through gigawatt-hours in the process.
It seems pretty clear that transformer-based and diffusion-based ML models are energy intensive and difficult to deploy at scale. Whilst there are some use cases where it may make sense to deploy generative models, the advantages that these models bring to well defined problem spaces may simply never manifest. In cases where a generative model does make sense, we should be using optimisation and quantization to make their usage as energy efficient as possible.