drm/nouveau: nv04 FIFO cleanup + recovery for Tesla

[PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla

Posted by Marek Czernohous 4 weeks, 1 day ago

Hi all,

Two-patch series for the legacy nv04_fifo path covering Tesla
(MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC
hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05.

Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session
start on Tesla GPUs. The Mesa NV50 userspace driver issues a method-
0x0060 / data-0xbeef02xx binding probe that recovers cleanly via
nv04_fifo_swmthd(), but currently logs at error level on every X or
Wayland session, dominating dmesg noise on this hardware class. This
clears the channel for patch 2 to identify real faults from noise.

Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults:

  Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid,
  call nvkm_chan_error(chan, true), fire tracepoint
  nouveau:fifo_chan_killed. Idempotent through the existing
  chan->errored short-circuit.

  Tier 2 (sliding window). When the per-fifo fault count in a
  configurable window reaches the threshold, schedule a worker that
  calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL)
  and fires tracepoint nouveau:fifo_dev_wedged. Worker context is
  needed because kobject_uevent_env may sleep.

Motivation: Fermi+ gets channel-kill and device-wedge automatically
through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge
uAPI existed. Three observable consequences on the reference
hardware:

  1. Silent state corruption (channel produces wrong output after a
     fault, no notice to userspace).
  2. Observability gap (no counters, tracepoints, or wedge event,
     only dmesg).
  3. Repeated-fault loop (the log-and-reset cycle repeats forever on
     a persistently faulting channel instead of killing it).

Validation. A debugfs fault-injector (kept on a separate
DO-NOT-MERGE branch, not part of this submission) was used to drive
both Tier-1 and Tier-2 paths through their full state space. Phases
1-5 of the test plan were exercised that way. Phase 6 (no manual
injection, real workload soak) ran 2026-05-05 through 2026-05-13:
one organic DRM_WEDGE_RECOVERY_REBIND event was captured on
2026-05-05 09:08; the rest of the soak was fault-free.

Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes
to the WEDGED=rebind uevent in log-only mode and was used to confirm
end-to-end propagation through udev.

Module parameters:

  nouveau.fifo_wedge_count     (uint, 0..32, default 10)
  nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000)

Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping
Tier-1 channel-kill active.

A note on MAINTAINERS. The series adds a new file
drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is
covered by the existing nouveau MAINTAINERS section
(drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included.
checkpatch.pl flags this as a hint; it is not load-bearing.

This is a follow-up to the April 9 NVAC stability series [2], which
is still awaiting review. The two patches here are independent of
that series and apply against current Linus master.

[1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0)
[2] https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series

Marek Czernohous (2):
  drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind
    probe
  drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher

 .../drm/nouveau/include/nvkm/engine/fifo.h    |  12 ++
 .../include/trace/events/nouveau_fifo.h       |  58 +++++++++
 drivers/gpu/drm/nouveau/nouveau_drm.c         |  29 +++++
 .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild   |   1 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/base.c   |   3 +
 .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c   |  29 ++++-
 .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h   |  10 ++
 .../drm/nouveau/nvkm/engine/fifo/recover.c    | 121 ++++++++++++++++++
 8 files changed, 257 insertions(+), 6 deletions(-)
 create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
 create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c


base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11
-- 
2.53.0

Re: [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla

Posted by lyude@redhat.com 3 weeks, 2 days ago

Hey! I truly apologize for asking, and thank you tremendously in
advance if the answer is no. But I had to check since I haven't seen
you around before, and these are unusually long commit messages:

These patches were written by a person correct? If not, you need to
follow the coding assistants guidelines here:

https://docs.kernel.org/process/coding-assistants.html

On Wed, 2026-05-13 at 19:50 +0200, Marek Czernohous wrote:
> Hi all,
> 
> Two-patch series for the legacy nv04_fifo path covering Tesla
> (MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC
> hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05.
> 
> Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session
> start on Tesla GPUs. The Mesa NV50 userspace driver issues a method-
> 0x0060 / data-0xbeef02xx binding probe that recovers cleanly via
> nv04_fifo_swmthd(), but currently logs at error level on every X or
> Wayland session, dominating dmesg noise on this hardware class. This
> clears the channel for patch 2 to identify real faults from noise.
> 
> Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults:
> 
>   Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid,
>   call nvkm_chan_error(chan, true), fire tracepoint
>   nouveau:fifo_chan_killed. Idempotent through the existing
>   chan->errored short-circuit.
> 
>   Tier 2 (sliding window). When the per-fifo fault count in a
>   configurable window reaches the threshold, schedule a worker that
>   calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL)
>   and fires tracepoint nouveau:fifo_dev_wedged. Worker context is
>   needed because kobject_uevent_env may sleep.
> 
> Motivation: Fermi+ gets channel-kill and device-wedge automatically
> through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge
> uAPI existed. Three observable consequences on the reference
> hardware:
> 
>   1. Silent state corruption (channel produces wrong output after a
>      fault, no notice to userspace).
>   2. Observability gap (no counters, tracepoints, or wedge event,
>      only dmesg).
>   3. Repeated-fault loop (the log-and-reset cycle repeats forever on
>      a persistently faulting channel instead of killing it).
> 
> Validation. A debugfs fault-injector (kept on a separate
> DO-NOT-MERGE branch, not part of this submission) was used to drive
> both Tier-1 and Tier-2 paths through their full state space. Phases
> 1-5 of the test plan were exercised that way. Phase 6 (no manual
> injection, real workload soak) ran 2026-05-05 through 2026-05-13:
> one organic DRM_WEDGE_RECOVERY_REBIND event was captured on
> 2026-05-05 09:08; the rest of the soak was fault-free.
> 
> Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes
> to the WEDGED=rebind uevent in log-only mode and was used to confirm
> end-to-end propagation through udev.
> 
> Module parameters:
> 
>   nouveau.fifo_wedge_count     (uint, 0..32, default 10)
>   nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000)
> 
> Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping
> Tier-1 channel-kill active.
> 
> A note on MAINTAINERS. The series adds a new file
> drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is
> covered by the existing nouveau MAINTAINERS section
> (drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included.
> checkpatch.pl flags this as a hint; it is not load-bearing.
> 
> This is a follow-up to the April 9 NVAC stability series [2], which
> is still awaiting review. The two patches here are independent of
> that series and apply against current Linus master.
> 
> [1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0)
> [2]
> https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series
> 
> Marek Czernohous (2):
>   drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50
> bind
>     probe
>   drm/nouveau/fifo: add recovery path for Tesla
> cache_error/dma_pusher
> 
>  .../drm/nouveau/include/nvkm/engine/fifo.h    |  12 ++
>  .../include/trace/events/nouveau_fifo.h       |  58 +++++++++
>  drivers/gpu/drm/nouveau/nouveau_drm.c         |  29 +++++
>  .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild   |   1 +
>  .../gpu/drm/nouveau/nvkm/engine/fifo/base.c   |   3 +
>  .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c   |  29 ++++-
>  .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h   |  10 ++
>  .../drm/nouveau/nvkm/engine/fifo/recover.c    | 121
> ++++++++++++++++++
>  8 files changed, 257 insertions(+), 6 deletions(-)
>  create mode 100644
> drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
>  create mode 100644
> drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
> 
> 
> base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11

Re: [PATCH 0/2] drm/nouveau: nv04 FIFO cleanup + recovery for Tesla

Posted by Marek Czernohous 3 weeks, 2 days ago

Hi Lyude,

No need to apologise, it is a completely fair question and I should have 
got ahead of it.

To answer directly: these patches were not written by a person alone. I 
used Anthropic's Claude (Opus 4.7) while developing the series, to help 
investigate the nv04_fifo fault behaviour and to draft both the code and 
the commit messages. I reviewed every change myself, the series was 
soak-tested on the reference hardware as described in the cover letter, 
and I take full responsibility for it under my Signed-off-by.

I missed Documentation/process/coding-assistants.rst when I prepared the 
series, apologies for that. The fix is to add this trailer to both patches:

Assisted-by: Claude:claude-opus-4-7

I can send a v2 with that trailer right away, or fold it into the next 
revision together with any review feedback you have, whichever keeps 
things tidiest for you.

For full disclosure: my earlier April series (NVAC stability, ref [2] in 
the cover letter) was likewise developed with Claude's assistance. It is 
still awaiting review, and I will make sure the trailer is in place when 
I revise it.

Thanks for the careful review,
Marek

Am 20.05.26 um 00:01 schrieb lyude@redhat.com:
> Hey! I truly apologize for asking, and thank you tremendously in
> advance if the answer is no. But I had to check since I haven't seen
> you around before, and these are unusually long commit messages:
>
> These patches were written by a person correct? If not, you need to
> follow the coding assistants guidelines here:
>
> https://docs.kernel.org/process/coding-assistants.html
>
> On Wed, 2026-05-13 at 19:50 +0200, Marek Czernohous wrote:
>> Hi all,
>>
>> Two-patch series for the legacy nv04_fifo path covering Tesla
>> (MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC
>> hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05.
>>
>> Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session
>> start on Tesla GPUs. The Mesa NV50 userspace driver issues a method-
>> 0x0060 / data-0xbeef02xx binding probe that recovers cleanly via
>> nv04_fifo_swmthd(), but currently logs at error level on every X or
>> Wayland session, dominating dmesg noise on this hardware class. This
>> clears the channel for patch 2 to identify real faults from noise.
>>
>> Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults:
>>
>>    Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid,
>>    call nvkm_chan_error(chan, true), fire tracepoint
>>    nouveau:fifo_chan_killed. Idempotent through the existing
>>    chan->errored short-circuit.
>>
>>    Tier 2 (sliding window). When the per-fifo fault count in a
>>    configurable window reaches the threshold, schedule a worker that
>>    calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL)
>>    and fires tracepoint nouveau:fifo_dev_wedged. Worker context is
>>    needed because kobject_uevent_env may sleep.
>>
>> Motivation: Fermi+ gets channel-kill and device-wedge automatically
>> through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge
>> uAPI existed. Three observable consequences on the reference
>> hardware:
>>
>>    1. Silent state corruption (channel produces wrong output after a
>>       fault, no notice to userspace).
>>    2. Observability gap (no counters, tracepoints, or wedge event,
>>       only dmesg).
>>    3. Repeated-fault loop (the log-and-reset cycle repeats forever on
>>       a persistently faulting channel instead of killing it).
>>
>> Validation. A debugfs fault-injector (kept on a separate
>> DO-NOT-MERGE branch, not part of this submission) was used to drive
>> both Tier-1 and Tier-2 paths through their full state space. Phases
>> 1-5 of the test plan were exercised that way. Phase 6 (no manual
>> injection, real workload soak) ran 2026-05-05 through 2026-05-13:
>> one organic DRM_WEDGE_RECOVERY_REBIND event was captured on
>> 2026-05-05 09:08; the rest of the soak was fault-free.
>>
>> Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes
>> to the WEDGED=rebind uevent in log-only mode and was used to confirm
>> end-to-end propagation through udev.
>>
>> Module parameters:
>>
>>    nouveau.fifo_wedge_count     (uint, 0..32, default 10)
>>    nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000)
>>
>> Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping
>> Tier-1 channel-kill active.
>>
>> A note on MAINTAINERS. The series adds a new file
>> drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is
>> covered by the existing nouveau MAINTAINERS section
>> (drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included.
>> checkpatch.pl flags this as a hint; it is not load-bearing.
>>
>> This is a follow-up to the April 9 NVAC stability series [2], which
>> is still awaiting review. The two patches here are independent of
>> that series and apply against current Linus master.
>>
>> [1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0)
>> [2]
>> https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series
>>
>> Marek Czernohous (2):
>>    drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50
>> bind
>>      probe
>>    drm/nouveau/fifo: add recovery path for Tesla
>> cache_error/dma_pusher
>>
>>   .../drm/nouveau/include/nvkm/engine/fifo.h    |  12 ++
>>   .../include/trace/events/nouveau_fifo.h       |  58 +++++++++
>>   drivers/gpu/drm/nouveau/nouveau_drm.c         |  29 +++++
>>   .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild   |   1 +
>>   .../gpu/drm/nouveau/nvkm/engine/fifo/base.c   |   3 +
>>   .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c   |  29 ++++-
>>   .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h   |  10 ++
>>   .../drm/nouveau/nvkm/engine/fifo/recover.c    | 121
>> ++++++++++++++++++
>>   8 files changed, 257 insertions(+), 6 deletions(-)
>>   create mode 100644
>> drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
>>   create mode 100644
>> drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
>>
>>
>> base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11