Last week,, or tensor processing unit. Google’s results revealed that the TPU is much faster than a conventional GPU for processing inference workloads, at a fraction of the power consumption. While machine learning still takes place on GPUs, Google’s TPU results are a significant leap forward for inference processing — and Nvidia, as one might expect, has its own take on those numbers.
According to a newpublished by Nvidia, the comparison would’ve been quite different if Google had used its Pascal-class GPUs instead of relying on the older, Kepler-based, dual-GPU K80. Here’s Nvidia:
Its team released technical information about the benefits of TPUs this past week. It asserts, among other things, that the TPU has 13x the inferencing performance of the K80. However, it doesn’t compare the TPU to the current generation Pascal-based P40.
Nvidia’s claim that the TPU has 13x the performance of K80 is provisionally true, but there’s a snag. That 13x figure is the geometric mean of all the various workloads combined, as shown below:
As Google notes, it’s a good analysis method when you don’t know how much each application contributes to the program mix. In this case, however, we do know that — and therefore, the more appropriate column to use is “WM,” which stands for “weighted mean.” Adjusted for application contributions, the gap between the TPU and the K80 increases to 15.3x. And, of course, that gap varies substantially from workload to workload, from no gap on LSTM1 to a 60x gap on MLP1.
For reference, here’s the slideshow we used in last week’s story, with competitive performance figures for Haswell, K80, and Google’s TPU:
Nvidia’s argument is that Pascal has vastly higher memory bandwidth and far more resources to throw at inference performance than K80. The net result of these improvements, according to Nvidia, is that the P40 offers 26x more inference performance than one die of a K80.
It’s not clear which inference tests Nvidia is referring to with its claim of 26x improvement, and the varying results in the slideshow above demonstrate that the relative performance gap between Nvidia and Google is highly workload dependent. It’s also not clear if Nvidia’s claim takes Google’s tight latency caps into account. At the small batch sizes Google requires for its 8ms latency threshold, K80 utilization is just 37 percent of maximum theoretical performance. The vagueness of the claims make it difficult to evaluate them for accuracy.
Google’s(PDF) also anticipates this kind of claim. The researchers also disclosed that they’ve has modeled the expected performance improvement of a TPU with GDDR5 instead of DDR3, with substantially more memory bandwidth. Scaling memory bandwidth up by 4x would improve overall performance by 3x, at the cost of ~10% more die space. There are ways, in other words, to boost the TPU side of the equation as well.
There’s no arguing the P40 is much faster than K80;shows a 3-4x performance boost in inference workload between and , to say nothing of . Even so, Google’s data shows a huge advantage for TPU performance-per-watt compared with GPUs, particularly once host server power is subtracted from the equation. This is the classic problem with trying to use a GPU against a custom-built ASIC — at the end of the day, a GPU contains a great deal of power-chewing hardware that a chip like Google’s TPU simply doesn’t need.
A matter of resources
It would have been interesting to see how Google’s TPU matched up against Nvidia’s newest and most powerful Pascal architecture, but I strongly suspect that it wouldn’t tell us much about which kind of solutions vendors are likely to use. For companies like Google, Microsoft, and Facebook, custom-built ASICs offer the prospect of vastly improved efficiency. The cost of researching and building the ASIC can be tolerated because each company has the pockets to fund it and a guaranteed market for the final product. Google’s TPU is custom-designed for very specific workloads and excels at them. A GPU like Nvidia’s P40 is designed to perform well in a wider range of workloads with varying characteristics.
Most companies, including plenty of Fortune 500 companies that might like to deploy deep learning or AI software, lack the expertise to handle in-house development. Companies that have this ability may well build custom circuits to handle future development, but the majority of firms will probably stick to using GPUs, at least for the foreseeable future.
Source : https://www.extremetech.com/computing/247403-nvidia-claims-pascal-gpus-challenge-googles-tensorflow-tpu-updated-benchmarks