
The biggest difference was achieved in the recommender system test and the smallest difference in the LLM test.ĭave Salvatore, director of AI inference, benchmarking, and cloud at Nvidia, attributed much of the Grace Hopper advantage to memory access. The Grace Hopper machine beat that in every category by 2 to 14 percent, depending on the benchmark. The nearest comparable system to the Grace Hopper was an Nvidia DGX H100 computer that combined two Intel Xeon CPUs with an H100 GPU. Most other H100 systems rely on Intel Xeon or AMD Epyc CPUs housed in a separate package. This set of benchmarks saw the arrival of the Grace Hopper superchip, an Arm-based 72-core CPU fused to an H100 through Nvidia’s proprietary C2C link. However, Intel’s Habana Gaudi2 continued to nip at the Nvidia H100’s heels, and Qualcomm’s Cloud AI 100 chips made a strong showing in benchmarks focused on power consumption. This was version 3.1 of the inferencing contest, and as in previous iterations, Nvidia dominated both in the number of machines using its chips and in performance. It’s also in line with a trend toward more compact but still capable neural networks.

But going small was on purpose, according to MLCommons executive director David Kanter, because the organization wanted the benchmark to be achievable by a big swath of the computing industry.

It’s made up of some 6 billion parameters compared to GPT-3’s 175 billion. The LLM, called GPT-J and released in 2021, is on the small side for such AIs. This set of benchmarks tested how well an already-trained neural network executed on different computer systems, a process called inferencing. Sometimes called “the Olympics of machine learning,” MLPerf consists of seven benchmark tests: image recognition, medical-imaging segmentation, object detection, speech recognition, natural-language processing, a new recommender system, and now an LLM. In one of the highlights of the data-center category, Nvidia revealed the first benchmark results for its Grace Hopper-an H100 GPU linked to the company’s new Grace CPU in the same package as if they were a single “superchip.” Fifteen computer companies submitted performance results in this first LLM trial, adding to the more than 13,000 other results submitted by a total of 26 companies.

MLPerf’s twice-a-year data delivery was released on 11 September and included, for the first time, a test of a large language model (LLM), GPT-J. But how well do today’s data center–class computers execute them? Pretty well, according to the latest set of benchmark results for machine learning, with the best able to summarize more than 100 articles in a second. Large language models like Llama 2 and ChatGPT are where much of the action is in AI.
