本帖最后由 埃律西昂 于 2023-11-26 22:17 编辑
来源: Raja的X文。
Very encouraging to see the steady increase of viable hardware options that can handle various AI models.
At the beginning of the year, there was only one practical option - nVidia. Now we see at-least 3 vendors providing reasonable options. Apple, AMD and Intel. We have been profiling several options and I will share some of our findings here.
The good stuff
- Apple Macs were a pleasant surprise on how easy it is to get various models running
- AMD also made impressive progress with PyTorch and a lot more models run now than even 4-5 months ago on MI2XX and Radeon
- We tried both Intel Arc and Ponte Vecchio and they were able to execute everything we have thrown at them.
- Intel Gaudi has very impressive performance on the models that work on that architecture. It's our current best option for LLM inference on select models.
- Ponte Vecchio surprised us with its performance on our custom face swap model, beating everyone including the mighty H100. We suspect that our model may be fitting largely in the Rambo cache.
The wishlist
- For training and inference of large models that don't fit in memory - nVidia is still the only practical option. Wishing that there are more options in 2024 here
- While compatibility is getting better, a ton of performance is still left on the table on Apple, AMD and Intel. Wishing that software will keep getting better and increase their HW utilization. There is still room on compatibility as well, particularly with supporting various encodings and model parameter sizes on AMD.
- Intel Gaudi looks very promising performance-wise and wishing that more models seamlessly work out of the box without Intel intervention.
- Wishing that both AMD and Intel release new gaming GPUs with more memory capacity and bandwidth.
- Wishing that Intel releases a PVC kicker with more memory capacity and bandwidth. Currently it's the best option we have to bring our artists workflow with face swap training from 3-days to a few hours. It scales linearly from 1-GPU to 16-GPUs.
- Wishing Intel support for PyTorch is as frictionless as AMD and nVdia. May be Intel should consider supporting PyTorch RocM or up-stream OneAPI support under CUDA device.
really grateful to all vendors for providing access to hardware and developer support.
Looking forward to continue filling our data center with interesting mix of architectures.
https://twitter.com/RajaXg/status/1728465097243406482
令人鼓舞的是,能处理各种人工智能模型的可行硬件选择在稳步增加。
在今年年初,只有一个可行的选择--nVidia。 现在,我们看到至少有三家厂商提供了合理的选择。苹果、AMD 和英特尔。我们已经对几种选择进行了分析,我将在这里分享我们的一些发现。
好的发展
- 苹果 Mac 让人惊喜地发现,在Mac运行各种模型变得如此简单
- AMD 在 PyTorch 上也取得了令人印象深刻的进展,与 4-5 个月前在 MI2XX 和 Radeon 上运行的机型相比,现在运行的机型要多得多。
- 我们试用了英特尔 Arc 和 Ponte Vecchio,它们能够执行我们向其抛出的所有任务。
- 英特尔Gaudi在使用该架构的模型上表现非常出色。它是我们目前对特定模型进行 LLM 推断的最佳选择。
- Ponte Vecchio 在我们定制的换脸模型上的表现让我们大吃一惊,它击败了包括强大的 H100 在内的所有人。我们怀疑我们的模型可能在很大程度上适合 Rambo 缓存。
愿望清单
- 对于无法放入内存的大型模型的训练和推理,nVidia 仍然是唯一实用的选择。希望 2024 年有更多选择
- 虽然兼容性越来越好,但苹果、AMD 和英特尔的性能仍有很大差距。希望软件不断改进,提高硬件利用率。 在兼容性方面也仍有余地,尤其是在 AMD 上支持各种编码和模型参数大小。
- 英特尔Gaudi在性能方面看起来很有前途,希望更多的模型能在没有英特尔干预的情况下无缝运行。
- 希望 AMD 和英特尔都能推出具有更大内存容量和带宽的新游戏 GPU。
- 希望英特尔能推出具有更大内存容量和带宽的 PVC 踢脚线[继任者?]。目前,它是我们将艺术家换脸培训工作流程从 3 天缩短到几小时的最佳选择。它可以从 1 个 GPU 线性扩展到 16 个 GPU。
- 希望英特尔对 PyTorch 的支持能像 AMD 和 nVdia 一样无障碍。也许英特尔应该考虑支持 PyTorch RocM或对 CUDA 设备给予上游 OneAPI 支持。
非常感谢所有提供硬件和开发人员支持的供应商。
我期待着继续在我们的数据中心中使用各种有趣的架构组合。 On open LLMs - we still find notable quality improvement with large parameters and 16b quantization. While there's exciting performance with 4b quantization and smaller models, we haven't yet found any of these livable relative to the best closed model.
Also have seen that the data scientists developing new models starting with FP32!! This is one of the reasons they vastly prefer GPUs in their dev setups and very many of them are developing on a PC. Until recently Nvidia is the only GPU that supported full AI compute software stack on a PC. Others are catching up now.
https://x.com/RajaXg/status/1728569646343987340?s=20
在开放式 LLM 上,我们仍然发现大参数和 16b 量化能显著提高质量。虽然 4b 量化和较小的模型也有令人兴奋的表现,但我们还没有发现任何一个模型可以与最佳封闭模型相比。
我们还看到,数据科学家们从 FP32 开始开发新模型!这也是他们在开发设置中更青睐 GPU 的原因之一,他们中的很多人都是在 PC 上开发的。直到最近,Nvidia 仍是唯一支持 PC 上完整 AI 计算软件栈的 GPU。现在,其他公司正在迎头赶上。
另外关于AI硬件方案内容,包括Raja在内的3位行业专家进行了讨论:
AWS计算解决方案架构师总监Bryan Beal:
The future of AI compute isn’t GPUs. It’s purpose-built silicon.
There will be a lot of good competitors.
Let’s not forget Amazon, Google, and Microsoft - all of whom are designing their own silicon tailor-made for AI and LLM
人工智能计算的未来不是 GPU。而是专用芯片。
会有很多优秀的竞争对手。
别忘了亚马逊、谷歌和微软,它们都在为人工智能和 LLM 量身打造自己的芯片。
Raja:
We have heard this statement since 2016..but GPUs still rule..why? I'm still learning..but my observations so far
- the "purpose" of purpose built silicon is not stable. AI is not as static as some people imagined and triviliaze.."it's just a bunch of matrix multiplies"
- the system architecture (things like page tables, memory management, interrupt handling, debugging etc) of GPUs evolved over 2 decades and is a necessary evil to support production software stacks...many of the purpose built silicon is deficient here and throw the burden onto "software" people. There isn't much new young system software talent coming into workforce these days..so everyone competes for the same small pool of aging talent.
- but I'm still optimistic that a new architecture with new purpose will evolve from the lessons learned so far
自 2016 年以来,我们就听到了这种说法,但 GPU 仍然统治着一切,原因何在?我还在学习中,但我目前的观察是
- 专用芯片的 "用途 "并不稳定。人工智能并不像某些人想象的那样一成不变,"它只是一堆矩阵乘法"。
- GPU 的系统架构(如页表、内存管理、中断处理、调试等)已经发展了 20 多年,是支持生产软件堆栈的必要之恶......许多专用芯片在这方面存在缺陷,并将重担扔给了 "软件 "人员。如今,没有多少新的年轻系统软件人才加入劳动力大军,因此大家都在争夺同样的一小批老化人才。
- 但我仍然乐观地认为,从目前的经验教训中将会发展出一种具有新目的的新架构。
MosaicML和Nervana的创始人,现Databricks生成式AI副总裁Naveen Rao:
Well, the GPU is the thing that's not stable. The "GPU" like an H100 is 75% AI accelerator. This isn't really built and optimized for anything else. So I'd argue that the GPU simply became AI silicon, not that AI silicon wasn't well defined.
嗯,GPU 才是不稳定的东西。像 H100 这样的 "GPU "75% 是人工智能加速器。这并不是真正为其他任何东西而构建和优化的。因此,我认为 GPU 只是成为了人工智能芯片,而不是人工智能芯片没有得到很好的定义。
Raja:
It doesn't matter what % is matrix multiplication if the developer can't develop on it..it's the small percent gates that handle the whole system software stack that matters unevenly. GPU tools for compute (mainly Nvidia) have been pretty stable for 17 years..not that I'm counting
如果开发人员无法在矩阵乘法上进行开发,那么矩阵乘法的百分比并不重要。处理整个系统软件堆栈的小百分比门槛才是最重要的。 用于计算的 GPU 工具(主要是 Nvidia)17 年来一直非常稳定。 |