.../drm/nouveau/include/nvkm/engine/fifo.h | 12 ++ .../include/trace/events/nouveau_fifo.h | 58 +++++++++ drivers/gpu/drm/nouveau/nouveau_drm.c | 29 +++++ .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild | 1 + .../gpu/drm/nouveau/nvkm/engine/fifo/base.c | 3 + .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 29 ++++- .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h | 10 ++ .../drm/nouveau/nvkm/engine/fifo/recover.c | 121 ++++++++++++++++++ 8 files changed, 257 insertions(+), 6 deletions(-) create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
Hi all,
Two-patch series for the legacy nv04_fifo path covering Tesla
(MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC
hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05.
Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session
start on Tesla GPUs. The Mesa NV50 userspace driver issues a method-
0x0060 / data-0xbeef02xx binding probe that recovers cleanly via
nv04_fifo_swmthd(), but currently logs at error level on every X or
Wayland session, dominating dmesg noise on this hardware class. This
clears the channel for patch 2 to identify real faults from noise.
Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults:
Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid,
call nvkm_chan_error(chan, true), fire tracepoint
nouveau:fifo_chan_killed. Idempotent through the existing
chan->errored short-circuit.
Tier 2 (sliding window). When the per-fifo fault count in a
configurable window reaches the threshold, schedule a worker that
calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL)
and fires tracepoint nouveau:fifo_dev_wedged. Worker context is
needed because kobject_uevent_env may sleep.
Motivation: Fermi+ gets channel-kill and device-wedge automatically
through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge
uAPI existed. Three observable consequences on the reference
hardware:
1. Silent state corruption (channel produces wrong output after a
fault, no notice to userspace).
2. Observability gap (no counters, tracepoints, or wedge event,
only dmesg).
3. Repeated-fault loop (the log-and-reset cycle repeats forever on
a persistently faulting channel instead of killing it).
Validation. A debugfs fault-injector (kept on a separate
DO-NOT-MERGE branch, not part of this submission) was used to drive
both Tier-1 and Tier-2 paths through their full state space. Phases
1-5 of the test plan were exercised that way. Phase 6 (no manual
injection, real workload soak) ran 2026-05-05 through 2026-05-13:
one organic DRM_WEDGE_RECOVERY_REBIND event was captured on
2026-05-05 09:08; the rest of the soak was fault-free.
Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes
to the WEDGED=rebind uevent in log-only mode and was used to confirm
end-to-end propagation through udev.
Module parameters:
nouveau.fifo_wedge_count (uint, 0..32, default 10)
nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000)
Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping
Tier-1 channel-kill active.
A note on MAINTAINERS. The series adds a new file
drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is
covered by the existing nouveau MAINTAINERS section
(drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included.
checkpatch.pl flags this as a hint; it is not load-bearing.
This is a follow-up to the April 9 NVAC stability series [2], which
is still awaiting review. The two patches here are independent of
that series and apply against current Linus master.
[1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0)
[2] https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series
Marek Czernohous (2):
drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind
probe
drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher
.../drm/nouveau/include/nvkm/engine/fifo.h | 12 ++
.../include/trace/events/nouveau_fifo.h | 58 +++++++++
drivers/gpu/drm/nouveau/nouveau_drm.c | 29 +++++
.../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild | 1 +
.../gpu/drm/nouveau/nvkm/engine/fifo/base.c | 3 +
.../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 29 ++++-
.../gpu/drm/nouveau/nvkm/engine/fifo/priv.h | 10 ++
.../drm/nouveau/nvkm/engine/fifo/recover.c | 121 ++++++++++++++++++
8 files changed, 257 insertions(+), 6 deletions(-)
create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h
create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c
base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11
--
2.53.0
Hey! I truly apologize for asking, and thank you tremendously in advance if the answer is no. But I had to check since I haven't seen you around before, and these are unusually long commit messages: These patches were written by a person correct? If not, you need to follow the coding assistants guidelines here: https://docs.kernel.org/process/coding-assistants.html On Wed, 2026-05-13 at 19:50 +0200, Marek Czernohous wrote: > Hi all, > > Two-patch series for the legacy nv04_fifo path covering Tesla > (MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC > hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05. > > Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session > start on Tesla GPUs. The Mesa NV50 userspace driver issues a method- > 0x0060 / data-0xbeef02xx binding probe that recovers cleanly via > nv04_fifo_swmthd(), but currently logs at error level on every X or > Wayland session, dominating dmesg noise on this hardware class. This > clears the channel for patch 2 to identify real faults from noise. > > Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults: > > Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid, > call nvkm_chan_error(chan, true), fire tracepoint > nouveau:fifo_chan_killed. Idempotent through the existing > chan->errored short-circuit. > > Tier 2 (sliding window). When the per-fifo fault count in a > configurable window reaches the threshold, schedule a worker that > calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL) > and fires tracepoint nouveau:fifo_dev_wedged. Worker context is > needed because kobject_uevent_env may sleep. > > Motivation: Fermi+ gets channel-kill and device-wedge automatically > through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge > uAPI existed. Three observable consequences on the reference > hardware: > > 1. Silent state corruption (channel produces wrong output after a > fault, no notice to userspace). > 2. Observability gap (no counters, tracepoints, or wedge event, > only dmesg). > 3. Repeated-fault loop (the log-and-reset cycle repeats forever on > a persistently faulting channel instead of killing it). > > Validation. A debugfs fault-injector (kept on a separate > DO-NOT-MERGE branch, not part of this submission) was used to drive > both Tier-1 and Tier-2 paths through their full state space. Phases > 1-5 of the test plan were exercised that way. Phase 6 (no manual > injection, real workload soak) ran 2026-05-05 through 2026-05-13: > one organic DRM_WEDGE_RECOVERY_REBIND event was captured on > 2026-05-05 09:08; the rest of the soak was fault-free. > > Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes > to the WEDGED=rebind uevent in log-only mode and was used to confirm > end-to-end propagation through udev. > > Module parameters: > > nouveau.fifo_wedge_count (uint, 0..32, default 10) > nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000) > > Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping > Tier-1 channel-kill active. > > A note on MAINTAINERS. The series adds a new file > drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is > covered by the existing nouveau MAINTAINERS section > (drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included. > checkpatch.pl flags this as a hint; it is not load-bearing. > > This is a follow-up to the April 9 NVAC stability series [2], which > is still awaiting review. The two patches here are independent of > that series and apply against current Linus master. > > [1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0) > [2] > https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series > > Marek Czernohous (2): > drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 > bind > probe > drm/nouveau/fifo: add recovery path for Tesla > cache_error/dma_pusher > > .../drm/nouveau/include/nvkm/engine/fifo.h | 12 ++ > .../include/trace/events/nouveau_fifo.h | 58 +++++++++ > drivers/gpu/drm/nouveau/nouveau_drm.c | 29 +++++ > .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild | 1 + > .../gpu/drm/nouveau/nvkm/engine/fifo/base.c | 3 + > .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 29 ++++- > .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h | 10 ++ > .../drm/nouveau/nvkm/engine/fifo/recover.c | 121 > ++++++++++++++++++ > 8 files changed, 257 insertions(+), 6 deletions(-) > create mode 100644 > drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h > create mode 100644 > drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c > > > base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11
Hi Lyude, No need to apologise, it is a completely fair question and I should have got ahead of it. To answer directly: these patches were not written by a person alone. I used Anthropic's Claude (Opus 4.7) while developing the series, to help investigate the nv04_fifo fault behaviour and to draft both the code and the commit messages. I reviewed every change myself, the series was soak-tested on the reference hardware as described in the cover letter, and I take full responsibility for it under my Signed-off-by. I missed Documentation/process/coding-assistants.rst when I prepared the series, apologies for that. The fix is to add this trailer to both patches: Assisted-by: Claude:claude-opus-4-7 I can send a v2 with that trailer right away, or fold it into the next revision together with any review feedback you have, whichever keeps things tidiest for you. For full disclosure: my earlier April series (NVAC stability, ref [2] in the cover letter) was likewise developed with Claude's assistance. It is still awaiting review, and I will make sure the trailer is in place when I revise it. Thanks for the careful review, Marek Am 20.05.26 um 00:01 schrieb lyude@redhat.com: > Hey! I truly apologize for asking, and thank you tremendously in > advance if the answer is no. But I had to check since I haven't seen > you around before, and these are unusually long commit messages: > > These patches were written by a person correct? If not, you need to > follow the coding assistants guidelines here: > > https://docs.kernel.org/process/coding-assistants.html > > On Wed, 2026-05-13 at 19:50 +0200, Marek Czernohous wrote: >> Hi all, >> >> Two-patch series for the legacy nv04_fifo path covering Tesla >> (MCP77/MCP79 and G80-GT218). Daily-driven on the reference NVAC >> hardware (Apple Mac mini Late 2009, GeForce 9400M) since 2026-05-05. >> >> Patch 1 demotes a benign CACHE_ERROR that fires once per Mesa session >> start on Tesla GPUs. The Mesa NV50 userspace driver issues a method- >> 0x0060 / data-0xbeef02xx binding probe that recovers cleanly via >> nv04_fifo_swmthd(), but currently logs at error level on every X or >> Wayland session, dominating dmesg noise on this hardware class. This >> clears the channel for patch 2 to identify real faults from noise. >> >> Patch 2 adds a two-tier fault-recovery path for Tesla FIFO faults: >> >> Tier 1 (per fault). Look up the channel via nvkm_chan_get_chid, >> call nvkm_chan_error(chan, true), fire tracepoint >> nouveau:fifo_chan_killed. Idempotent through the existing >> chan->errored short-circuit. >> >> Tier 2 (sliding window). When the per-fifo fault count in a >> configurable window reaches the threshold, schedule a worker that >> calls drm_dev_wedged_event(drm, DRM_WEDGE_RECOVERY_REBIND, NULL) >> and fires tracepoint nouveau:fifo_dev_wedged. Worker context is >> needed because kobject_uevent_env may sleep. >> >> Motivation: Fermi+ gets channel-kill and device-wedge automatically >> through nvkm_runl_rc; Tesla was feature-frozen before the DRM wedge >> uAPI existed. Three observable consequences on the reference >> hardware: >> >> 1. Silent state corruption (channel produces wrong output after a >> fault, no notice to userspace). >> 2. Observability gap (no counters, tracepoints, or wedge event, >> only dmesg). >> 3. Repeated-fault loop (the log-and-reset cycle repeats forever on >> a persistently faulting channel instead of killing it). >> >> Validation. A debugfs fault-injector (kept on a separate >> DO-NOT-MERGE branch, not part of this submission) was used to drive >> both Tier-1 and Tier-2 paths through their full state space. Phases >> 1-5 of the test plan were exercised that way. Phase 6 (no manual >> injection, real workload soak) ran 2026-05-05 through 2026-05-13: >> one organic DRM_WEDGE_RECOVERY_REBIND event was captured on >> 2026-05-05 09:08; the rest of the soak was fault-free. >> >> Companion userland tool nouveau-pstate-daemon v0.2.0 [1] subscribes >> to the WEDGED=rebind uevent in log-only mode and was used to confirm >> end-to-end propagation through udev. >> >> Module parameters: >> >> nouveau.fifo_wedge_count (uint, 0..32, default 10) >> nouveau.fifo_wedge_window_ms (uint, 100..600000, default 60000) >> >> Setting fifo_wedge_count=0 disables Tier-2 entirely while keeping >> Tier-1 channel-kill active. >> >> A note on MAINTAINERS. The series adds a new file >> drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c. The change is >> covered by the existing nouveau MAINTAINERS section >> (drivers/gpu/drm/nouveau/), so no MAINTAINERS update is included. >> checkpatch.pl flags this as a hint; it is not load-bearing. >> >> This is a follow-up to the April 9 NVAC stability series [2], which >> is still awaiting review. The two patches here are independent of >> that series and apply against current Linus master. >> >> [1] https://github.com/hibbes/nouveau-pstate-daemon (v0.2.0) >> [2] >> https://lore.kernel.org/dri-devel/20260409-nouveau-nvac-stability-series >> >> Marek Czernohous (2): >> drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 >> bind >> probe >> drm/nouveau/fifo: add recovery path for Tesla >> cache_error/dma_pusher >> >> .../drm/nouveau/include/nvkm/engine/fifo.h | 12 ++ >> .../include/trace/events/nouveau_fifo.h | 58 +++++++++ >> drivers/gpu/drm/nouveau/nouveau_drm.c | 29 +++++ >> .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild | 1 + >> .../gpu/drm/nouveau/nvkm/engine/fifo/base.c | 3 + >> .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 29 ++++- >> .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h | 10 ++ >> .../drm/nouveau/nvkm/engine/fifo/recover.c | 121 >> ++++++++++++++++++ >> 8 files changed, 257 insertions(+), 6 deletions(-) >> create mode 100644 >> drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h >> create mode 100644 >> drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c >> >> >> base-commit: 1f63dd8ca0dc05a8272bb8155f643c691d29bb11
© 2016 - 2026 Red Hat, Inc.