Home PC & LaptopHardware New Compute Units, Upgraded Raytracing Cores, AI Enhancements, & Path Tracing

New Compute Units, Upgraded Raytracing Cores, AI Enhancements, & Path Tracing

by admin

AMD has finally unveiled the full architectural details of its next-gen RDNA 4 GPU architecture, which is built from the ground up for the Radeon RX 9000 series.

AMD RDNA 4 Is A GPU Architecture Designed From The Ground Up For Gamers: New Compute Units, Ray Tracing & AI Cores, Ready For Path Tracing

AMD’s RDNA 4 architecture has been highly anticipated since the launch of the previous RDNA 3 and its upgraded RDNA 3.5 variation. While the RDNA 4 architecture isn’t going to see any ultra-enthusiast SKUs, it does come with brand-new changes that should elevate gaming performance since it is designed primarily for gaming audiences.

As such, AMD has brought the following new changes to RDNA 4:

  • Heavily Optimized for high-end gaming workloads
  • Improved Rasterization & Compute Efficiency
  • A step change in Raytracing performance
  • Comprehensive high-performance ML support
  • Enhanced Bandwidth efficiency for all workloads
  • Multimedia improvements for gamers and creators

Compared to RDNA 2, the RDNA 4 GPUs see almost a 2x uplift in rasterization, close to 2.5x uplift in raytracing and a 3.5x uplift in ML (FP16 dense matrix) workloads per compute unit. So next, we dive into the building blocks of the RDNA 4 architectural block diagram to see how the entire chip comes together.

RDNA 4’s New Core IPs

The core building block of the RDNA 4 GPU architecture is the Compute Engine.

The new Compute Units come with Dual SIMD32 Vector Units and Enhanced Matrix Operations, which include:

  • 2x-16b & 4x-8b/4b dense matrix rates
  • 4:2 Structured Sparsity for +2x rate
  • New 8b Float Data Types
  • Matrix load w/transpose

RDNA 4 also carries new shading improvements with RDNA 4 shades allocating registers dynamically. They can request registers from the pool when needed. They can release registers back to the pool when they complete that work, and the software manages the conditions when there’s a wait time for an allocation. This results in better handling of memory latency while overall efficiency of the shared core can increase significantly.

On the scalar unit side, you get new Float32 operations while scheduling updates include Split & Named barriers, Accelerated spill/fill operations, and improved instruction prefetch.

Then we have the 3rd Generation Ray tracing units offering doubled ray intersection rates, improved BVH compression, accelerated ray traversal and shading, and Oriented Bounding Boxes. These new ray tracing cores offer one of the biggest performance increases on the chip. Each Ray accelerator has also been improved with:

  • 2x box & triangle intersection units
  • Hardware instance transforms
  • Improved RT stack management
  • BVH8 and improved node compression
  • Oriented Bounding Boxes

These new ray tracing upgrades also result in much lower memory requirements for BVH. On average, RDNA 4 reduces the memory requirements to less than 60% versus RDNA 3 thanks to the 8-wide design.

But that’s not it. AMD has also implemented a new solution to reduce traversal costs by encoding a rotation with each box to more tightly bound the contained geometry, while aligning the box to the geometry can help remove much of the space, and the ray direction is transformed on entry to the box to match the encoded rotation. This results in fewer traversal steps, a reduction in peak cost by eliminating traversal hotspots and an improvement in traversal performance by 10%.

The result of these changes is that RDNA 4 CUs offer 2x ray traversal performance compared to RDNA 3 at equal clock rates and bandwidth.

There’s also an improved Command Processor which features enhanced packet accelerators. The Cache is also seeing an upgrade, which is now more balanced with up to 64 MB of 3rd Gen Infinity Cache, 8 MB of L2 cache and 2MB Aggregate CU cache. On the memory side, the RDNA 4 GPU architecture retains GDDR6 support but has been upgraded to faster speeds of up to 20.00 Gbps with up to 16 GB capacity alongside a 256-bit bus interface. RDNA 4 also employs enhanced memory compression techniques to lessen the stress on the available bandwidth.

For AI, AMD is leveraging its 3rd Generation Matrix Acceleration engine which comes with improved Tensor Dense Rates, New 8b float data types, Structured Sparsity support and ML-based upscaling or Super Resolution.

Compared to RDNA 3, the RDNA 4 CUs offer a 2x boost in image generation performance (SDXL 1.5) in a normalized scenario with FP16.

The Media Engine moves to a dual-width design with updated Encode/Decode engines, up to 25% quality improvement in AVC, H.264, H.265, Double the AV1 throughput and is optimized for low-latency streaming. Finally, there’s the updated Radiance Display Engine which now supports DisplayPort 2.1a, HDMI 2.1b outputs and an updated scaling and sharpening engine.

The RDNA 4 Block Diagram (Top Navi 48 Die)

Next, we move to the RDNA 4 block diagram which represents the full Navi 48 GPU SKU. RDNA 4 GPUs are fabricated on the TSMC 4nm process node and feature up to 53.9 Billion Transistors and the SKU measures 356.5mm2. The chip is also fully compliant with PCIe Gen5.

Now it’s time to break apart the RDNA 4 chip. The Navi 48 GPU (Radeon RX 9070 XT) is composed of four shader engines and each of those houses several “Dual Compute Units”, not WGPs. Each Dual Compute Unit features two Compute units and there are a total of 8 DCUs or 16 CUs per Shader Engine. That’s a total of 32 DCUs or 64 CUs on the chip itself for a total of 4096 stream processors or shader units.

Each DCU has two Ray Accelerator engines for a total of 16 RAs per Shader Engine or 64 RAs in total, while each DCU also packs 4 Matrix Acceleration Engines for a total of 32 MAs per Shader Engine and 128 MAs in total. Each Shader Engine also packs four RB+ blocks, a rasterizer engine & a Prim Unit block. There are four sections of 3rd Gen Infinity Caches and four 4×16-bit memory controllers on the outskirts of the chip.

The L2 caches are right in the middle of the GPU, which also includes two Geometry processors, two ACE units, and one each, HWS & DMA. The chip is connected using Infinity Fabric.

The Path Tracing Future Ahead For AMD

Raytracing is often seen as an outdated term in the PC gaming space. Sure, it’s one form of tracing rays to make scenes look more realistic and has only started to gain traction in the console space, but the competition is often seen using a different type of ray tracer, called Path Tracing. While Ray Tracing uses a single primary ray to cast reflections, shadows, and refractions on a source, path tracing uses all possible paths of light and is a more expensive technique.

NVIDIA’s Path Tracing expertise can be seen in games like Cyberpunk 2077 or Alan Wake II, which are regarded as some of the most graphics-demanding titles, and also look absolutely stunning. It was made possible to use Path Tracing through new techniques such as upscaling and frame-gen, but the Green team also invested in a brand-new technology called ray reconstruction, which helps achieve path tracing more efficiently by removing the in-engine denoisers and using AI/ML to help re-evaluate and reconstruct the image.

It looks like AMD is also following that approach with its own Neural Supersampling and Denoising technique for RDNA 4’s Path Tracing capabilities.

Upgraded Media & Display Capabilities

We can’t end this deep-dive without talking about the Media and Display Engines. So, to start it off, we first have the new Media Engines which offer enhanced game streaming and recording through:

  • 25% gain in H.264 low latency encode quality
  • 11% HEVC encode quality improvement
  • AV1 encoding efficiency improved with B Frames
  • Encoding performance boost of up to 30% at 720p
  • Optimized for FFMPEG, OBS & Handbrake
  • VCN low power video playback (50% performance uplift for AV1 & VP9)

The Display Experiences have also improved with enhanced FreeSync Power Optimization modes that deliver lower idle power in most 2-display configs, a hardware flip queue support for offloading video frame scheduling to the GPU and saving CPU power for video playback, while Radeon Image Sharpening 2 delivers high-quality images and scenes and works across all APIs through a single toggle.

That’s all for the deep-dive, you can also check out our coverage of the AMD Radeon RX 9000 graphics cards and FSR 4 technology at the respective links.

Source link

Related Posts