[PATCH v6 0/7] gpu: nova-core: run unload sequence upon unbinding

Alexandre Courbot posted 7 patches 3 days, 5 hours ago
drivers/gpu/nova-core/driver.rs                   |  23 +-
drivers/gpu/nova-core/firmware/booter.rs          |  32 +-
drivers/gpu/nova-core/firmware/fwsec.rs           |   1 -
drivers/gpu/nova-core/gpu.rs                      |  38 ++-
drivers/gpu/nova-core/gsp.rs                      |   4 +
drivers/gpu/nova-core/gsp/boot.rs                 | 262 ++++++-----------
drivers/gpu/nova-core/gsp/commands.rs             |  72 +++--
drivers/gpu/nova-core/gsp/fw.rs                   |   4 +
drivers/gpu/nova-core/gsp/fw/commands.rs          |  45 +++
drivers/gpu/nova-core/gsp/fw/r570_144/bindings.rs |  11 +
drivers/gpu/nova-core/gsp/hal.rs                  |  93 ++++++
drivers/gpu/nova-core/gsp/hal/gh100.rs            |  53 ++++
drivers/gpu/nova-core/gsp/hal/tu102.rs            | 341 ++++++++++++++++++++++
drivers/gpu/nova-core/regs.rs                     |   5 +
14 files changed, 783 insertions(+), 201 deletions(-)
[PATCH v6 0/7] gpu: nova-core: run unload sequence upon unbinding
Posted by Alexandre Courbot 3 days, 5 hours ago
Currently the GSP is left running and the WPR2 memory region untouched
when the driver is unbound. This is obviously not ideal for at least two
reasons:

- Probing requires setting up the WPR2 region, which cannot be done if
  there is already one in place. Hence the current requirement to reset
  the GPU (using e.g. `echo 1 >/sys/bus/pci/devices/.../reset`) before
  the driver can be probed again after removal.
- The running GSP may still attempt to access shared memory regions
  which the kernel might recycle.

On top of that, there is a nasty bug in the Blackwell VBIOS that
sometimes borks the GPU upon PCI reset, requiring a reboot. So relying
on the PCI reset to unload/reload Nova is really not practical here.

This series does what is needed to leave the GPU in a clean state after
unbind, for all currently supported GPUs. Blackwell support is basic and
will be added alongside the Blackwell series if this can be merged
first.

This revision rebases on top of the Device HRT series [1] and moves the
unload bundle from `Gsp` into `NovaCore`. This makes GSP unload
initiation only possible at the driver module level, which is the only
place that can consume the unload bundle.

A branch with the series and its required dependencies is available at
[2].

[1] https://lore.kernel.org/20260506215113.851360-1-dakr@kernel.org
[2] https://github.com/Gnurou/linux/tree/b4/nova-unload

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
Changes in v6:
- Inline TU102 local `run_booter` method in its unique call site.
- Rename unload bundle field to `unload_bundle`.
- Make Sec2UnloadBundle private.
- Continue GSP teardown upon partial failure.
- Store the unload bundle into `NovaCore`.
- Take the unload bundle by value to make it one-shot.
- Link to v5: https://patch.msgid.link/20260515-nova-unload-v5-0-c4d6250ad160@nvidia.com

Changes in v5:
- Rebase on top of the Device HRT series.
- Drop the now unneeded "gpu: nova-core: split BAR acquisition in unbind()".
- Link to v4: https://patch.msgid.link/20260427-nova-unload-v4-0-e145ccddae66@nvidia.com

Changes in v4:
- Remove `warn_on_err` macro as it isn't performing as expected and
  distracts from the goal of the series.
- Add John's patch from the Blackwell series refactoring the Booter
  Loader runner code.
- Add a GSP HAL and move the existing TU102/SEC2 boot sequence into it
  in preparation for the Hopper/Blackwell FSP boot path.
- Prepare the firmware required for unloading at probe time and save it
  into an unload bundle, as we cannot guarantee filesystem access at
  unload time.
- Constrain `UNLOADING_GUEST_DRIVER`'s visibility to the parent module.
- Also write the sentinel value `0xff` into `mbox1` when running Booter
  Unloader to align with OpenRM.
- Link to v3: https://patch.msgid.link/20260422-nova-unload-v3-0-1d2c81bd3ced@nvidia.com

Changes in v3:
- Disambiguate doccomment for `warn_on_err`.
- Test the correct bit instead of the whole register value to determine
  that the GSP has stopped.
- Use an enum instead of a boolean to encode the power level when
  shutting down the GSP.
- Add missing newline to `dev_err`.
- Add missing doccomments for new types.
- Use values from bindings instead of magic numbers.
- Remove the redundant `get_gsp_info` function.
- Better document Booter Unloader mailbox sentinel value, and check the
  value of mbox0 upon return.
- Link to v2: https://patch.msgid.link/20260421-nova-unload-v2-0-2fe54963af8b@nvidia.com

Changes in v2:
- Rebase on top of `master` and remove unneeded/obsolete preparatory patches.
- Tidy up the imports of commands from the `fw` module in the `gsp` module.
- Link to v1: https://patch.msgid.link/20251216-nova-unload-v1-0-6a5d823be19d@nvidia.com

---
Alexandre Courbot (6):
      gpu: nova-core: remove unneeded get_gsp_info proxy function
      gpu: nova-core: do not import firmware commands into GSP command module
      gpu: nova-core: send UNLOADING_GUEST_DRIVER GSP command upon unloading
      gpu: nova-core: gsp: shuffle boot code a bit to keep chipset-specific parts close
      gpu: nova-core: gsp: move chipset-specific parts of the boot process into a HAL
      gpu: nova-core: run Booter Unloader and FWSEC-SB upon unbinding

John Hubbard (1):
      gpu: nova-core: refactor SEC2 booter loading into BooterFirmware::run()

 drivers/gpu/nova-core/driver.rs                   |  23 +-
 drivers/gpu/nova-core/firmware/booter.rs          |  32 +-
 drivers/gpu/nova-core/firmware/fwsec.rs           |   1 -
 drivers/gpu/nova-core/gpu.rs                      |  38 ++-
 drivers/gpu/nova-core/gsp.rs                      |   4 +
 drivers/gpu/nova-core/gsp/boot.rs                 | 262 ++++++-----------
 drivers/gpu/nova-core/gsp/commands.rs             |  72 +++--
 drivers/gpu/nova-core/gsp/fw.rs                   |   4 +
 drivers/gpu/nova-core/gsp/fw/commands.rs          |  45 +++
 drivers/gpu/nova-core/gsp/fw/r570_144/bindings.rs |  11 +
 drivers/gpu/nova-core/gsp/hal.rs                  |  93 ++++++
 drivers/gpu/nova-core/gsp/hal/gh100.rs            |  53 ++++
 drivers/gpu/nova-core/gsp/hal/tu102.rs            | 341 ++++++++++++++++++++++
 drivers/gpu/nova-core/regs.rs                     |   5 +
 14 files changed, 783 insertions(+), 201 deletions(-)
---
base-commit: 293c8393b49c9fc017168ddb46aa2012d508c921
change-id: 20251216-nova-unload-4029b3b76950

Best regards,
--  
Alexandre Courbot <acourbot@nvidia.com>