Generative AI features under the Apple Intelligence banner have steered clear from leveraging NVIDIA GPUs to handle cloud-based inputs, with the California-based giant sticking with its custom silicon in its servers that will eventually be replaced by the unreleased M4 Ultra to speed up its Large Language Models. However, a recent blog post from the iPhone maker reveals that Apple and its engineers are not shying away from partnering with NVIDIA if it means both entities have a common goal; implementing faster text generation performance with LLMs.
A new ‘Recurrent Drafter’ technique has been published and open-sourced by Apple, and it ‘achieves state of the art performance’
Known as ‘ReDrafter’ for short, the new blog post states that this technique combines two approaches; one is beam search, and the other is tree attention. Both techniques are designed for improving text generation performance and after Apple’s own research, it collaborated with NVIDIA to integrate ReDrafter into TensorRT-LLM, which is a tool that helps Large Language Models run faster on NVIDIA GPUs. Another improvement is that this technology can reduce latency while utilizing less power.
“This research work demonstrated strong results, but its greater impact comes from being applied in production to accelerate LLM inference. To make this advancement production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.
Although TensorRT-LLM supports numerous open source LLMs and the Medusa speculative decoding method, ReDrafter’s beam search and tree attention algorithms rely on operators that had never been used in previous applications. To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM’s capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.
In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen 2.7x speed-up in generated tokens per second for greedy decoding. These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.”
While this collaboration proves that there is a sliver of a chance for Apple and NVIDIA to enter into an agreement, we strongly believe that such a partnership will not materialize due to the past history shared by the technology giants. We should see short-term tag teams like this formed again in the future, but a meaningful business relationship appears to be out the window.
News Source: Apple