[v10] gpu: nova-core: firmware: Hopper/Blackwell support

[PATCH v10 00/28] gpu: nova-core: firmware: Hopper/Blackwell support

Posted by John Hubbard 1 day, 10 hours ago

This is based on today's Alex Courbot's drm-rust-next-staging branch[1],
and a branch for this v10 is here:

https://github.com/johnhubbard/linux/tree/nova-core-blackwell-v10

My prerequisite nova-core SizeConstants patch [2] is posted separately,
and also included in the above -v10 branch.

This has been re-tested on Turing, Ampere, and Blackwell. (Just
recently, I installed one of each into the same test machine, without
running out of PCI bar space, woohoo!):

NovaCore 0000:c1:00.0: NVIDIA (Chipset: TU117, Architecture: Turing, Revision: a.1)
NovaCore 0000:c1:00.0: GPU name: NVIDIA T400 4GB

NovaCore 0000:c2:00.0: NVIDIA (Chipset: GB202, Architecture: BlackwellGB20x, Revision: a.1)
NovaCore 0000:c2:00.0: GPU name: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

NovaCore 0000:01:00.0: NVIDIA (Chipset: GA104, Architecture: Ampere, Revision: a.1)
NovaCore 0000:01:00.0: GPU name: NVIDIA RTX A4000

Changes in v10:

* Reordered per review (and direct assistance--thanks again) from
Alexandre Courbot: the two refactoring patches (factor .fwsignature*
selection, use GPU Architecture to simplify HALs) now come first,
before GPU identification. The boot_via_fsp stub is introduced early
and completed as FSP features arrive. The SEC2 refactoring, PCI config
mirror, and reserved heap size patches are moved earlier in the
series.

* Made pmuReservedSize conditional on Blackwell dGPU architectures.
Open RM only sets this field for Blackwell (Turing/Ampere/Ada/Hopper
all leave it zero). Added calc_pmu_reserved_size() helper and
FbLayout.pmu_reserved_size field to route the value through the
layout instead of using the constant unconditionally. Replaced
`as u32` cast with usize_into_u32 for PMU_RESERVED_SIZE. (Alexandre)

* Split the GFW boot wait HAL change into two patches: one that moves
the existing behavior into a GpuHal trait, and a second that adds the
Hopper/Blackwell skip.

* Removed the Spec::chipset() accessor (no longer needed after
restructuring). Updated the Copy/Clone commit message accordingly.

* Rebased onto drm-rust-next-staging, which includes
const_align_up(), "move firmware image parsing code to firmware.rs",
"factor out an elf_str() function", and "make WPR heap sizing
fallible" from the v9 series. Series is now 28 patches (was 31).

* Depends on the "rust: sizes: SizeConstants trait" series[N], which
adds typed SZ_* constants (u64::SZ_1M, u32::SZ_4K, etc.). The
nova-core conversion patch ("use SizeConstants trait for u64 size
constants") will be posted separately, but is already included in my
git branch. The Blackwell patches that introduce new SZ_* usage
(larger non-WPR heap, FSP Chain of Trust boot, larger WPR2 heap) use
the trait form from the start.

* Fixed the PCI config mirror commit message: corrected hex offsets to
match the code (older architectures use 0x088000, Hopper/Blackwell
use 0x092000).

* Dropped the never-used nvdm_type_raw() method from the MCTP/NVDM
introducing patch.

* Removed stale Co-developed-by tag from the FSP Chain of Trust boot
commit per Alex's request. Rewrote the commit message to remove
references to the no-longer-existent fmc_full field.

* Added missing #[expect(dead_code)] on GspFmcBootParams in the FSP
secure boot commit, removed when the struct becomes used in the
Chain of Trust boot commit.

Changes in v9:

* Rebased onto today's drm-rust-next.

* Split Architecture::Blackwell into BlackwellGB10x and BlackwellGB20x,
after Gary Guo and Sashiko pointed out that GB10x and GB20x are
distinct enough to warrant separate architecture variants. This
surfaced several bugs where all Blackwell chips were incorrectly
treated as a single group:
* Fixed the FSP boot completion register address for GB10x. GB10x
uses the same address as Hopper (0x000200bc), not the GB20x
address (0x00ad00bc).
* Made the FSP secure boot timeout architecture-dependent. GB20x
now gets 5000ms while Hopper and GB10x keep 4000ms.
* Removed chipset-level match arms that were working around the
single-variant design in fb/hal.rs, firmware/gsp.rs, and regs.rs.

* Simplified find_gsp_sigs_section() to return &'static str instead of
Option<&'static str>, since the Architecture enum is now exhaustive
and every variant has a known signature section name.

* Moved dma_set_mask_and_coherent from probe() into Gpu::new(), with
the unsafe block narrowed to just that call. Gpu::new() now takes
pci::Device<device::Core> instead of device::Bound to support this.

* Dropped the local `chipset` variable in Gpu::new() and accessed
spec.chipset() directly, since Spec is now Copy.

* Changed Spec::chipset() to take self instead of &self, since Spec is
Copy.

* Removed the unnecessary Tu102/Gh100 consts in gpu/hal.rs and used the
unit structs directly.

* Kept a hold on the Firmware object in FspFirmware instead of copying
the FMC ELF into a KVec<u8>.

* Moved the dev_info formatting fix and the GFW_BOOT comment removal
out of the Copy/Clone patch and into the patches that actually touch
those lines.

* Added Reviewed-by tags from Gary Guo and Alice Ryhl.

Changes in v8:

* Added Clone/Copy derives to Spec and Revision. Removed the
unnecessary pin_init_scope wrapping in Gpu::new() that the lack of
Copy had forced. Added a Spec::chipset() accessor.

* Removed implementation-detail sentence from the
Architecture::dma_mask() doccomment.

* Simplified the GPU HAL to two variants (Tu102, Gh100) instead of
four. Renamed "Fsp" to "Gh100" to follow the HAL naming convention.
Removed the spurious GA100 special case. Moved the GFW_BOOT wait into
the HAL method itself instead of returning a bool.

* Increased the GFW_BOOT wait timeout from 4 seconds to 30 seconds,
after Joel found that a different Blackwell SKU required extra time.

* Removed stray Cc lines from each patch.

* Fixed rustfmt issues in gsp/fw.rs and gsp/boot.rs reported by the
kernel test robot against v7 patches 27 and 31.

Changes in v7:
* Rebased onto Alexandre Courbot's rust register!() series in
drm-rust-next, including the related generic I/O accessor and
IoCapable changes.

* Rebased onto drm-rust-next (v7.0-rc4 based).

* Dropped the v6 patches that are already in drm-rust-next: the
aux-device fix, the pdev helper macro patch, and the one-item-per-line
use cleanup.

* Reworked the GPU init pieces per review. DMA mask setup now stays in
driver probe, with the mask width selected by GPU architecture, and
the GFW boot policy now lives in a dedicated GPU HAL.

* Reworked firmware image parsing per review around a single ElfFormat
trait with associated header types. Also added support for both ELF32
and ELF64 images, with automatic format detection.

* Reworked the MCTP/NVDM protocol code to use bitfield! and typed
accessors, removing the open-coded bit handling.

* Reworked the FSP messaging part of the series so that the message
structures are introduced in the first patches that use them, instead
of as a standalone dead-code-only patch. Also changed fmc_full to use
KVec<u8> from the start.

* Split the WPR heap overflow handling out into a separate prep patch.
That patch makes management_overhead() and wpr_heap_size() fallible,
uses checked arithmetic, and leaves the larger WPR2 heap patch with
only the Hopper and Blackwell sizing changes.

* Added a code comment documenting the Hopper and Blackwell PCI config
mirror base change.

Changes in v6:

* Rebased onto drm-rust-next (v7.0-rc1 based).

* Dropped the first two patches from v5 (aux device fix and pdev
macros), which have since been merged independently.

* const_align_up(): reworked per review from Gary Guo, Miguel Ojeda,
and Danilo Krummrich: now returns Option<usize> instead of panicking,
takes an Alignment argument instead of a const generic, and no longer
needs the inline_const feature addition in scripts/Makefile.build.

* The rust/sizes and SZ_*_U64 patches from v5 are no longer included.
I plan to post those as a separate series that depends on this one.

Changes in v5:

* Rebased onto linux.git master.

* Split MCTP protocol into its own module and file.

* Many Rust-based improvements: more use of types, especially. Also
used Result and Option more.

* Lots of cleanup of comments and print output and error handling.

* Added const_align_up() to rust/ and used it in nova-core. This
required enabling a Rust feature: inline_const, as recommended by
Miguel Ojeda.

* Refactoring various things, such as Gpu::new() to own Spec creation,
and several more such things.

* Fixed three Delta::ZERO busy-polls (patches 21, 24, 31) to use
non-zero sleep intervals (after just realizing that it was a bad
choice to have zero in there).

* Reduced GH100/GB100 HAL duplication. Made FSP_PKEY_SIZE/FSP_SIG_SIZE
consistent across patches. Replaced fragile architecture checks with
chipset.arch(). Renamed LIBOS_BLACKWELL.

* Narrowed the scope of some of the #![expect(dead_code)] cases,
although that really only matters within the series, not once it is
fully applied.

[1] https://github.com/Gnurou/linux/commits/drm-rust-next-staging/
[2] https://lore.kernel.org/20260411024118.471294-1-jhubbard@nvidia.com

John Hubbard (28):
gpu: nova-core: factor .fwsignature* selection into a new
find_gsp_sigs_section()
gpu: nova-core: use GPU Architecture to simplify HAL selections
gpu: nova-core: Hopper/Blackwell: basic GPU identification
gpu: nova-core: add Copy/Clone to Spec and Revision
gpu: nova-core: set DMA mask width based on GPU architecture
gpu: nova-core: move GFW boot wait into a GPU HAL
gpu: nova-core: Hopper/Blackwell: skip GFW boot waiting
gpu: nova-core: Blackwell: calculate reserved FB heap size
gpu: nova-core: Hopper/Blackwell: new location for PCI config mirror
gpu: nova-core: refactor SEC2 booter loading into
BooterFirmware::run()
gpu: nova-core: Hopper/Blackwell: integrate FSP boot path into boot()
gpu: nova-core: don't assume 64-bit firmware images
gpu: nova-core: add support for 32-bit firmware images
gpu: nova-core: add auto-detection of 32-bit, 64-bit firmware images
gpu: nova-core: Hopper/Blackwell: add FSP falcon engine stub
gpu: nova-core: Hopper/Blackwell: add FMC firmware image, in support
of FSP
gpu: nova-core: Hopper/Blackwell: add FSP secure boot completion
waiting
gpu: nova-core: Hopper/Blackwell: add FMC signature extraction
gpu: nova-core: Hopper/Blackwell: add FSP falcon EMEM operations
gpu: nova-core: Hopper/Blackwell: add FSP message infrastructure
gpu: nova-core: add MCTP/NVDM protocol types for firmware
communication
gpu: nova-core: Hopper/Blackwell: add FSP send/receive messaging
gpu: nova-core: Hopper/Blackwell: add FspCotVersion type
gpu: nova-core: Hopper/Blackwell: larger non-WPR heap
gpu: nova-core: Hopper/Blackwell: add FSP Chain of Trust boot
gpu: nova-core: Blackwell: use correct sysmem flush registers
gpu: nova-core: Hopper/Blackwell: larger WPR2 (GSP) heap
gpu: nova-core: Hopper/Blackwell: add GSP lockdown release polling

drivers/gpu/nova-core/driver.rs | 16 -
drivers/gpu/nova-core/falcon.rs | 1 +
drivers/gpu/nova-core/falcon/fsp.rs | 234 ++++++++++
drivers/gpu/nova-core/falcon/hal.rs | 21 +-
drivers/gpu/nova-core/fb.rs | 51 ++-
drivers/gpu/nova-core/fb/hal.rs | 36 +-
drivers/gpu/nova-core/fb/hal/ga102.rs | 2 +-
drivers/gpu/nova-core/fb/hal/gb100.rs | 80 ++++
drivers/gpu/nova-core/fb/hal/gb202.rs | 77 ++++
drivers/gpu/nova-core/fb/hal/gh100.rs | 38 ++
drivers/gpu/nova-core/firmware.rs | 176 ++++++--
drivers/gpu/nova-core/firmware/booter.rs | 30 ++
drivers/gpu/nova-core/firmware/fsp.rs | 44 ++
drivers/gpu/nova-core/firmware/gsp.rs | 35 +-
drivers/gpu/nova-core/fsp.rs | 523 +++++++++++++++++++++++
drivers/gpu/nova-core/gfw.rs | 76 ----
drivers/gpu/nova-core/gpu.rs | 62 ++-
drivers/gpu/nova-core/gpu/hal.rs | 28 ++
drivers/gpu/nova-core/gpu/hal/gh100.rs | 18 +
drivers/gpu/nova-core/gpu/hal/tu102.rs | 86 ++++
drivers/gpu/nova-core/gsp/boot.rs | 286 ++++++++++---
drivers/gpu/nova-core/gsp/commands.rs | 8 +-
drivers/gpu/nova-core/gsp/fw.rs | 62 ++-
drivers/gpu/nova-core/gsp/fw/commands.rs | 22 +-
drivers/gpu/nova-core/mctp.rs | 119 ++++++
drivers/gpu/nova-core/nova_core.rs | 3 +-
drivers/gpu/nova-core/regs.rs | 104 +++++
27 files changed, 2000 insertions(+), 238 deletions(-)
create mode 100644 drivers/gpu/nova-core/falcon/fsp.rs
create mode 100644 drivers/gpu/nova-core/fb/hal/gb100.rs
create mode 100644 drivers/gpu/nova-core/fb/hal/gb202.rs
create mode 100644 drivers/gpu/nova-core/fb/hal/gh100.rs
create mode 100644 drivers/gpu/nova-core/firmware/fsp.rs
create mode 100644 drivers/gpu/nova-core/fsp.rs
delete mode 100644 drivers/gpu/nova-core/gfw.rs
create mode 100644 drivers/gpu/nova-core/gpu/hal.rs
create mode 100644 drivers/gpu/nova-core/gpu/hal/gh100.rs
create mode 100644 drivers/gpu/nova-core/gpu/hal/tu102.rs
create mode 100644 drivers/gpu/nova-core/mctp.rs

base-commit: dcba1b1052c8a7381df762597fa50a50ad2efb8d
--
2.53.0