News

Using a kernel-level profiler, I found that TensorFlow utilizes DepthwiseConv2dGPUKernelNHWC, which takes approximately 6.8ms per iteration in the following test case, while PyTorch uses ...