From nobody Fri Jun 12 17:16:56 2026 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A3A303EDE43 for ; Wed, 13 May 2026 17:50:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778694620; cv=none; b=VAyr9nTQNFUiHiHjaO9dOOK7bJpRgnA0cExhwojYNKfA3rWhx6QP9GKWY0jFFwNPheYUuQJTP6pPzHdAv9gKnyc2iBb0kXnFlJMfq6sL0Hq2h6Or8CyPQKWd3oam56M3cCHlf2ZlTL/5gogKyZxylW4ikEbNArFF9J0Ir0qbNVM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778694620; c=relaxed/simple; bh=E7KqH80XCIEh8VnhVuqObTrPfrCCvnj3w3aCsCHInqA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Uwt1mAk6aeIEIlF8M90kTgd2aGoWiaGoK3E3wyNdXDI8WoTo7OD/VPqhr6nmJqvHWEiwRIS0z5tSOJ3EtXhz8RmiLvF1zzEMBBbjhyUBBKuV/P22L0O7q5Vi188a8pho3yQ7+Wbxk6me4SRAqnX/+FcryS1cTCa+kPOAinoCQ/w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=nfnaAbyV; arc=none smtp.client-ip=209.85.128.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="nfnaAbyV" Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-488a8f97f6bso9721215e9.2 for ; Wed, 13 May 2026 10:50:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778694617; x=1779299417; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=C3sIh5375gOTbKLogW0nu2FPAiE6cFaR5WSblCu7vM0=; b=nfnaAbyV/q384ElH2wUVjiTWuEfcFmcha6X/rKySBmdVwS2vS7A+U6KpI8i10Aq7WM cju/i8qrceSOJB8jWMl3GeCSYZnemuHSNrN2pUXUZAoieRdJM/Pkz6+toW84N2Gmomy5 gmHt4eTR6AAt0kTtt0KhL4Gdi7jvoVY6szlnfgEtKELLc2aWhBrK6gQIOE9O35CJ4XEq dRuNdW0dWcUoliMWOWuX+wAlsyN200gbdGP1SUQtAy1GFgiHTU59DgZsbT85EzIwWuHK BmKKNc+FOnlwB3liDX+T5vE1aC1fIa4U89JZS4w5bD2SE9/eQnLGoIEyp/lDwirLoJ6s kI1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778694617; x=1779299417; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=C3sIh5375gOTbKLogW0nu2FPAiE6cFaR5WSblCu7vM0=; b=nyosVVhi1ZTU4Lf5N5oM/LCLqE6O1xKF9p8wMLAm/zV/VUvHnEd01QKHpmRrULF1df AqxF5pxrTx2JDyjdaIkS0zBJGa/JleSpuiP3syFonIMwSusPHSC1tiLhUlYmh3fH4ADb 8zscKz770/3FKnfVr/TN7ye94V5RWdg+21NOASF8ANxdtmd4L9Chcjqmr3Jv7dmwAcDo iL47J9yvCWnnibY5X+VzK9E++yG/eXDkPs8LsH4Tj5VCoSDHZHaQiEXT0FMsBd5CKw6o SOqehfRsFzIIanmeGoW1yt+heNgI7aaYfZXEwq/7QGGlgd9pNnweZbdh8n3jfuTqwOuM eEWg== X-Forwarded-Encrypted: i=1; AFNElJ9F07cdn4A1YY5ZFbIomAkTpZkE0RWA8br9qrLgUdyP3eycCJyJlXimcfgEKEEkMX+v6QLao6zODX2V09A=@vger.kernel.org X-Gm-Message-State: AOJu0YxEykfRfx9e6r2jAaQYFJYRWGfmZe87l9zDoIYjj6Vij2jj7lSc DXCnGkuKsiWh4daPpfFioTtCxDyn3hfwf6GrASUxvvjkC8Y3//VPq7b3 X-Gm-Gg: Acq92OF9nUsgJ5V2dS1Rzn9V5KUwXEN8ZKkZ57OKmLiBXZ9tSEqjtmyJqE0lg80O3VR vsf4cEU4+Y5oE9RrnfbimEzFfhh6Eir/r0Z1YehFjnItuQQX7L8cqOgqzoD1X5kMVXo+HVtrAp4 9qFK63PcHBTylTGDhARH9YyGfkoy+IfYaZxUH4AAoN5KDEX8NtaDXEOJRNwDOI4qCYOJhATeXor OlJrxBX8eIWUqqN++PUw/a/30N5FXMysVOBJK7JLvXjVlL8RLbRhb79OjP8iSz70YCzmZY7TdKZ k4n7nIFvZ/TdzsDwjhDF1FUjCdTfHbEpWK/o+sdl0Jbio1k3QocrPf5GRW4r6ZJuWRoCEfUfxl4 5AtQRqbFxblBcNaGvD52Msb+fhPwiBgx+LQVbvSt4N75vj6KQo6uX34kpLLJPFHVqE5wRngrsKF YlrLQMmq0KwfXLpwaLhBMIPixxuo9l/17sfwUIxcFHAAzhG0dgC0KdtF+w/fdTXbIiDAzSATl3t gUz4lhXrdBi4+jRxMycLR6aDg== X-Received: by 2002:a05:600c:19ce:b0:48a:58e1:6cf7 with SMTP id 5b1f17b1804b1-48fc9a362f9mr27622065e9.4.1778694616986; Wed, 13 May 2026 10:50:16 -0700 (PDT) Received: from Neo.taile6b6ba.ts.net (ip-109-193-028-214.um39.pools.vodafone-ip.de. [109.193.28.214]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-45da0fe13a7sm486730f8f.29.2026.05.13.10.50.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 May 2026 10:50:16 -0700 (PDT) From: Marek Czernohous X-Google-Original-From: Marek Czernohous To: nouveau@lists.freedesktop.org Cc: Lyude Paul , Danilo Krummrich , dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Marek Czernohous Subject: [PATCH 1/2] drm/nouveau/fifo/nv04: filter benign CACHE_ERROR from Mesa NV50 bind probe Date: Wed, 13 May 2026 19:50:12 +0200 Message-ID: <20260513175014.96599-2-marek@czernohous.de> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260513175014.96599-1-marek@czernohous.de> References: <20260513175014.96599-1-marek@czernohous.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The Mesa userspace driver issues a method-0x0060 / data-0xbeef02xx binding probe on Tesla GPUs that ends up triggering CACHE_ERROR in the PFIFO interrupt handler. The probe is harmless and recovers cleanly, but it floods dmesg at error level on every X/Wayland session start. Filter that specific pattern down to debug level so dmesg stays clean while real CACHE_ERROR conditions are still logged at error level. Tested on Apple Mac Mini (MCP79, NVAC 0xac080b1) and a G94: dmesg has no CACHE_ERROR spam during normal operation, the previously visible beef02xx pattern now only appears at debug level. Signed-off-by: Marek Czernohous --- .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 25 ++++++++++++++----- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c b/drivers/gpu/= drm/nouveau/nvkm/engine/fifo/nv04.c index c4b8e567d86f..fa13cd55b593 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c @@ -327,12 +327,25 @@ nv04_fifo_intr_cache_error(struct nvkm_fifo *fifo, u3= 2 chid, u32 get) =20 if (!(pull0 & 0x00000100) || !nv04_fifo_swmthd(device, chid, mthd, data)) { - chan =3D nvkm_chan_get_chid(&fifo->engine, chid, &flags); - nvkm_error(subdev, "CACHE_ERROR - " - "ch %d [%s] subc %d mthd %04x data %08x\n", - chid, chan ? chan->name : "unknown", - (mthd >> 13) & 7, mthd & 0x1ffc, data); - nvkm_chan_put(&chan, flags); + /* + * Filter benign Mesa NV50 bind probe: mthd 0x0060 with + * data 0xbeef02xx is a harmless userspace probe on Tesla + * GPUs and does not indicate an actual error condition. + * Demote to debug to keep dmesg clean while still catching + * real CACHE_ERROR events. + */ + if ((mthd & 0x1ffc) =3D=3D 0x0060 && + (data & 0xffffff00) =3D=3D 0xbeef0200) { + nvkm_debug(subdev, "CACHE_ERROR - ch %d subc %d mthd %04x data %08x (be= nign, skipped)\n", + chid, (mthd >> 13) & 7, mthd & 0x1ffc, data); + } else { + chan =3D nvkm_chan_get_chid(&fifo->engine, chid, &flags); + nvkm_error(subdev, "CACHE_ERROR - " + "ch %d [%s] subc %d mthd %04x data %08x\n", + chid, chan ? chan->name : "unknown", + (mthd >> 13) & 7, mthd & 0x1ffc, data); + nvkm_chan_put(&chan, flags); + } } =20 nvkm_wr32(device, NV04_PFIFO_CACHE1_DMA_PUSH, 0); --=20 2.53.0 From nobody Fri Jun 12 17:16:56 2026 Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AEA234DD6C5 for ; Wed, 13 May 2026 17:50:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.54 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778694621; cv=none; b=sSSafZFG0ZH8Uj6GEC2SFU57XyJSFQ1zi6UosM6csfY+OmcBvMv7ai3uHY3fXvqXMyXOsMGSFOmt8gSBDVaENz8qm7tPXFbY0e3ZjzlbT4xSCI9p6cGhTMREa357EDQuE6Vsr/rc9ANtaKQBnRukq1EJRl6WCauKT5Qe2NvqKtQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778694621; c=relaxed/simple; bh=DtNmXzu960uDH1laqhbvhqq9X/KNHmuztEyt4O2WqD8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=s0wOf8tRq6kOU41euE1DUzDbOV6jSugptJ+OTUpAzkKIaH56JJQbAuKo9eTQ5RCmy6donC+o4ZVJlbt7VkIpqX4GOQtqpdR4jr1IqmqtxB2oSPNfaum8qAVmlAkoIMTkq/1DxLw5nuNifgadZfwp1DAWGnv/dWIRYi+1qjtwqTU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ow1cbfNg; arc=none smtp.client-ip=209.85.221.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ow1cbfNg" Received: by mail-wr1-f54.google.com with SMTP id ffacd0b85a97d-45b030a5696so208374f8f.0 for ; Wed, 13 May 2026 10:50:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778694618; x=1779299418; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=obOaH/rodhS+A66+h4JJFqrYSWRTiEApCWTJ41LIyfU=; b=ow1cbfNg7VnXr/gZLsO/wa159QIR2KxnKcNsb7K9NxbVnCfWlQd9PO8a1R3EMj2u/y A+dkybcJFPLodmflpNfOIvsnuvJUsu3N8ccQohsTCH8cNaGq+6ktNZKwkNO99hzIRxPb yOUNZPkPi5eNefFld6L0z2nCr1KMmdb0BJj/0EpE2GmcjROww64CCA3ipGR02Nt4wOro MqeUcqTm88Gu0EMa3XuDyndMCOUumdllf9Z3RXc29heRYhFyZVFCpTfqDXyEN35ijfUt 2/qM/HFCeZFuVRi//eagDMcgDnyUyvrV/BqBFUdEMfF7pGK03rI9jLqWkmvwuaQwOP9G aMXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778694618; x=1779299418; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=obOaH/rodhS+A66+h4JJFqrYSWRTiEApCWTJ41LIyfU=; b=S6gqxzwRUBGHrP6A6XFAYF6ee+Lfo3vupjik6Z83WAi1f6C/F5LZ5AegvhARnUOtov wEv5j6fvtxi1GM+1nXiT2VCJUJxiQjUUrSORZSWAee7mJqhq11N1VuNewyt0eepLF+1k 9IKUPBQKKmwgaOFxPNuNYi7EqAt91wTPjadEyChoUfTONVdz9srbwuUMGkkvYZTctCFW vAoW+TbxY8Qy85mV2y737Oino376rlf9WGYCcHUCHDjkk6Lh9/bVlFudhDWHcTmaWiEx Hbf9E6gcCBE/o/XO8RNhmOidQKrJZ7dQHiZdJioxcUTUvrZRbsusY2R+XUwUHZ+mS3ap eyfA== X-Forwarded-Encrypted: i=1; AFNElJ9OLJr/VpHL6l7KWvW7EDXvR2PD7prH4dCU3tMScA02/qSgG4l7TQZuPRJSLiOu0FbumvEgvKAOfnB04kQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yxau1qXdQtPvFQuyava2AFOiyeclz8lbvKzMb9gNWfwe7xneoQ0 h6nz1Mvu6XzU5BJxY5EYsXcooq7cGkG6eVDYKov31L0dszfmE6CY+RnQ X-Gm-Gg: Acq92OGlXw7P8+SGKioZwn7xSeyWu4IX9qs7KKWg36llzSrOhqf3WNoAM2JkbXD1K39 V0x19RXJInpySmFFFHJPWlrwV0fj61K0VUsHAEf3z9DFzH2fB0ZCFRAZAc7daZEhcZUEDkc7SPh Ui2Ba3Y+qGz9SwYD/r/faB54oNjbxbq/cBRP+B+WCP3RbPWrN1MeGRn+OR7Of/mRq6Vk5O+i98O 1oyqw26QQW6OxzpE4gAknFchHGkpMQy2ngjmablbKzFzGaGlYVyMvZCoEj6JNVkPZ6OtbINNWk6 0+TJ7i31QuAGDl7TSDKOuEPwkkMihE0swhd4gW+Sn0Uai0bTpp4eOBcvguv3SNkzUzHY1CaP2FH bDIrFnzpFDHRrk9t6GUALf0eRssg/PSS5OEk5eArHJH6+FCy+u8JW5VM0swAwNqm0dU8ot8SsEC V7ZdMvsxsZNUqK/azfzYjFqSQNd7Z39JdSZy3BqQrFq48WrMhinhVaUp5+oqApqwDutNHKukYRS XuFgOOxwsK/lyQ= X-Received: by 2002:a05:600c:1d06:b0:48a:5758:7999 with SMTP id 5b1f17b1804b1-48fc9a34a11mr37886085e9.4.1778694618016; Wed, 13 May 2026 10:50:18 -0700 (PDT) Received: from Neo.taile6b6ba.ts.net (ip-109-193-028-214.um39.pools.vodafone-ip.de. [109.193.28.214]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-45da0fe13a7sm486730f8f.29.2026.05.13.10.50.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 May 2026 10:50:17 -0700 (PDT) From: Marek Czernohous X-Google-Original-From: Marek Czernohous To: nouveau@lists.freedesktop.org Cc: Lyude Paul , Danilo Krummrich , dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Marek Czernohous Subject: [PATCH 2/2] drm/nouveau/fifo: add recovery path for Tesla cache_error/dma_pusher Date: Wed, 13 May 2026 19:50:13 +0200 Message-ID: <20260513175014.96599-3-marek@czernohous.de> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260513175014.96599-1-marek@czernohous.de> References: <20260513175014.96599-1-marek@czernohous.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On Tesla / NV50 family chipsets (nv50, g84, g94, g98, mcp77, mcp79), FIFO fault handling in nv04_fifo_intr_cache_error and nv04_fifo_intr_dma_pusher logs the fault and resets hardware registers but leaves the offending channel running. Compared to Fermi+ which calls nvkm_chan_error from nvkm_runl_rc, Tesla has no escalation: silent state corruption is possible, no telemetry beyond dmesg, and repeated faults on the same channel keep firing forever. Add a shared recovery helper nv04_fifo_recover() that both intr handlers call after the existing logging+reset sequence. It implements two tiers: Tier-1 (per-fault): nvkm_chan_get_chid + nvkm_chan_error(chan, true). The atomic chan->errored short-circuit means re-faults on the same channel are no-op; other channels are unaffected. Tier-2 (sliding-window): per-fifo lock-protected ring of fault timestamps. When the count within fifo_wedge_window_ms reaches fifo_wedge_count, schedule a workqueue job that emits a drm_dev_wedged_event with DRM_WEDGE_RECOVERY_REBIND. The drm_dev_wedged_event call cannot run from IRQ context because kobject_uevent_env may sleep; the workqueue indirection handles this. Tracepoints (nouveau:fifo_chan_killed, nouveau:fifo_dev_wedged) provide zero-overhead telemetry consumable via perf or bpftrace. Module parameters fifo_wedge_count (default 10, range 0..32, 0=3DTier-2 disabled) and fifo_wedge_window_ms (default 60000, range 100..600000) allow tuning without rebuild. Validated on Apple Mac mini Late 2009 (NVAC, MCP79). Signed-off-by: Marek Czernohous --- .../drm/nouveau/include/nvkm/engine/fifo.h | 12 ++ .../include/trace/events/nouveau_fifo.h | 58 +++++++++ drivers/gpu/drm/nouveau/nouveau_drm.c | 29 +++++ .../gpu/drm/nouveau/nvkm/engine/fifo/Kbuild | 1 + .../gpu/drm/nouveau/nvkm/engine/fifo/base.c | 3 + .../gpu/drm/nouveau/nvkm/engine/fifo/nv04.c | 4 + .../gpu/drm/nouveau/nvkm/engine/fifo/priv.h | 10 ++ .../drm/nouveau/nvkm/engine/fifo/recover.c | 121 ++++++++++++++++++ 8 files changed, 238 insertions(+) create mode 100644 drivers/gpu/drm/nouveau/include/trace/events/nouveau_fi= fo.h create mode 100644 drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c diff --git a/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h b/drivers/g= pu/drm/nouveau/include/nvkm/engine/fifo.h index 96c16cfccf16..7c27b4c8a212 100644 --- a/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h +++ b/drivers/gpu/drm/nouveau/include/nvkm/engine/fifo.h @@ -55,6 +55,17 @@ void nvkm_chan_put(struct nvkm_chan **, unsigned long ir= qflags); =20 struct nvkm_chan *nvkm_uchan_chan(struct nvkm_object *); =20 +#define NVKM_FIFO_WEDGE_RING_MAX 32 + +struct nvkm_fifo_wedge { + spinlock_t lock; + u32 count; /* aktuelle Fens= ter-Tiefe */ + ktime_t ts[NVKM_FIFO_WEDGE_RING_MAX]; /* Ring von Times= tamps */ + u32 head; /* Ring-Head */ + struct work_struct work; /* schedules drm_= dev_wedged_event */ + atomic_t wedged; /* Tier-2 already= fired? */ +}; + struct nvkm_fifo { const struct nvkm_fifo_func *func; struct nvkm_engine engine; @@ -86,6 +97,7 @@ struct nvkm_fifo { =20 spinlock_t lock; struct mutex mutex; + struct nvkm_fifo_wedge wedge; }; =20 void nvkm_fifo_fault(struct nvkm_fifo *, struct nvkm_fault_data *); diff --git a/drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h b/= drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h new file mode 100644 index 000000000000..46d043a82850 --- /dev/null +++ b/drivers/gpu/drm/nouveau/include/trace/events/nouveau_fifo.h @@ -0,0 +1,58 @@ +/* SPDX-License-Identifier: MIT */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM nouveau + +#if !defined(_TRACE_NOUVEAU_FIFO_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_NOUVEAU_FIFO_H + +#include +#include + +TRACE_EVENT(nouveau_fifo_chan_killed, + TP_PROTO(struct drm_device *dev, u32 chid, u32 fault_type, u64 info), + TP_ARGS(dev, chid, fault_type, info), + TP_STRUCT__entry( + __string(devname, dev_name(dev->dev)) + __field(u32, chid) + __field(u32, fault_type) + __field(u64, info) + ), + TP_fast_assign( + __assign_str(devname); + __entry->chid =3D chid; + __entry->fault_type =3D fault_type; + __entry->info =3D info; + ), + TP_printk("dev=3D%s chid=3D%u fault=3D%s info=3D0x%llx", + __get_str(devname), + __entry->chid, + __entry->fault_type =3D=3D 0 ? "CACHE_ERROR" : "DMA_PUSHER", + __entry->info) +); + +TRACE_EVENT(nouveau_fifo_dev_wedged, + TP_PROTO(struct drm_device *dev, u32 fault_count, u32 window_ms), + TP_ARGS(dev, fault_count, window_ms), + TP_STRUCT__entry( + __string(devname, dev_name(dev->dev)) + __field(u32, fault_count) + __field(u32, window_ms) + ), + TP_fast_assign( + __assign_str(devname); + __entry->fault_count =3D fault_count; + __entry->window_ms =3D window_ms; + ), + TP_printk("dev=3D%s wedged after %u faults in %u ms", + __get_str(devname), + __entry->fault_count, + __entry->window_ms) +); + +#endif /* _TRACE_NOUVEAU_FIFO_H */ + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH ../../drivers/gpu/drm/nouveau/include/trace/eve= nts +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE nouveau_fifo +#include diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouvea= u/nouveau_drm.c index 517ff2c31dce..c62b7fc3a1d3 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -22,6 +22,8 @@ * Authors: Ben Skeggs */ =20 +#define CREATE_TRACE_POINTS + #include #include #include @@ -74,6 +76,9 @@ #include "nouveau_uvmm.h" #include "nouveau_sched.h" =20 +#include +#include + DECLARE_DYNDBG_CLASSMAP(drm_debug_classes, DD_CLASS_TYPE_DISJOINT_BITS, 0, "DRM_UT_CORE", "DRM_UT_DRIVER", @@ -111,6 +116,18 @@ MODULE_PARM_DESC(runpm, "disable (0), force enable (1)= , optimus only default (-1 static int nouveau_runtime_pm =3D -1; module_param_named(runpm, nouveau_runtime_pm, int, 0400); =20 +MODULE_PARM_DESC(fifo_wedge_count, + "FIFO faults within window before drm_dev_wedged_event " + "(0=3Ddisable Tier-2, max 32, default 10)"); +unsigned int nouveau_fifo_wedge_count =3D 10; +module_param_named(fifo_wedge_count, nouveau_fifo_wedge_count, uint, 0400); + +MODULE_PARM_DESC(fifo_wedge_window_ms, + "Sliding-window width in milliseconds for fifo_wedge_count " + "(default 60000)"); +unsigned int nouveau_fifo_wedge_window_ms =3D 60000; +module_param_named(fifo_wedge_window_ms, nouveau_fifo_wedge_window_ms, uin= t, 0400); + static struct drm_driver driver_stub; static struct drm_driver driver_pci; static struct drm_driver driver_platform; @@ -1495,6 +1512,18 @@ nouveau_drm_init(void) if (!nouveau_modeset) return 0; =20 + if (nouveau_fifo_wedge_count > NVKM_FIFO_WEDGE_RING_MAX) { + pr_warn("nouveau: fifo_wedge_count=3D%u exceeds max %u; clamping\n", + nouveau_fifo_wedge_count, NVKM_FIFO_WEDGE_RING_MAX); + nouveau_fifo_wedge_count =3D NVKM_FIFO_WEDGE_RING_MAX; + } + if (nouveau_fifo_wedge_window_ms < 100 || + nouveau_fifo_wedge_window_ms > 600000) { + pr_warn("nouveau: fifo_wedge_window_ms=3D%u out of range; resetting to 6= 0000\n", + nouveau_fifo_wedge_window_ms); + nouveau_fifo_wedge_window_ms =3D 60000; + } + nouveau_module_debugfs_init(); =20 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild b/drivers/gpu/= drm/nouveau/nvkm/engine/fifo/Kbuild index 376e9c3bcb1a..1ff29753731d 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/Kbuild @@ -5,6 +5,7 @@ nvkm-y +=3D nvkm/engine/fifo/chan.o nvkm-y +=3D nvkm/engine/fifo/chid.o nvkm-y +=3D nvkm/engine/fifo/runl.o nvkm-y +=3D nvkm/engine/fifo/runq.o +nvkm-y +=3D nvkm/engine/fifo/recover.o =20 nvkm-y +=3D nvkm/engine/fifo/nv04.o nvkm-y +=3D nvkm/engine/fifo/nv10.o diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c b/drivers/gpu/= drm/nouveau/nvkm/engine/fifo/base.c index 9dd924694306..a61183fa38af 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c @@ -337,6 +337,8 @@ nvkm_fifo_dtor(struct nvkm_engine *engine) struct nvkm_runl *runl, *runt; struct nvkm_runq *runq, *rtmp; =20 + nv04_fifo_wedge_fini(fifo); + if (fifo->userd.bar1) nvkm_vmm_put(nvkm_bar_bar1_vmm(engine->subdev.device), &fifo->userd.bar1= ); nvkm_memory_unref(&fifo->userd.mem); @@ -390,6 +392,7 @@ nvkm_fifo_new_(const struct nvkm_fifo_func *func, struc= t nvkm_device *device, fifo->timeout.chan_msec =3D 10000; spin_lock_init(&fifo->lock); mutex_init(&fifo->mutex); + nv04_fifo_wedge_init(fifo); =20 return nvkm_engine_ctor(&nvkm_fifo, device, type, inst, true, &fifo->engi= ne); } diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c b/drivers/gpu/= drm/nouveau/nvkm/engine/fifo/nv04.c index fa13cd55b593..cb81941ecccd 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/nv04.c @@ -345,6 +345,8 @@ nv04_fifo_intr_cache_error(struct nvkm_fifo *fifo, u32 = chid, u32 get) chid, chan ? chan->name : "unknown", (mthd >> 13) & 7, mthd & 0x1ffc, data); nvkm_chan_put(&chan, flags); + nv04_fifo_recover(fifo, chid, NV04_FAULT_CACHE_ERROR, + ((u64)mthd << 32) | data); } } =20 @@ -410,6 +412,8 @@ nv04_fifo_intr_dma_pusher(struct nvkm_fifo *fifo, u32 c= hid) } nvkm_chan_put(&chan, flags); =20 + nv04_fifo_recover(fifo, chid, NV04_FAULT_DMA_PUSHER, state); + nvkm_wr32(device, 0x003228, 0x00000000); nvkm_wr32(device, 0x003220, 0x00000001); nvkm_wr32(device, 0x002100, NV_PFIFO_INTR_DMA_PUSHER); diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h b/drivers/gpu/= drm/nouveau/nvkm/engine/fifo/priv.h index fff1428ef267..bf551906dcd4 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/priv.h @@ -83,6 +83,16 @@ void nv04_chan_start(struct nvkm_chan *); void nv04_chan_stop(struct nvkm_chan *); void nv04_eobj_ramht_del(struct nvkm_chan *, int); =20 +/* Recovery helper for Tesla cache_error/dma_pusher (recover.c). */ +#define NV04_FAULT_CACHE_ERROR 0 +#define NV04_FAULT_DMA_PUSHER 1 + +void nv04_fifo_recover(struct nvkm_fifo *fifo, u32 chid, u32 fault_type, u= 64 info); +void nv04_fifo_wedge_init(struct nvkm_fifo *fifo); +void nv04_fifo_wedge_fini(struct nvkm_fifo *fifo); +extern unsigned int nouveau_fifo_wedge_count; +extern unsigned int nouveau_fifo_wedge_window_ms; + int nv10_fifo_chid_nr(struct nvkm_fifo *); =20 int nv50_fifo_chid_nr(struct nvkm_fifo *); diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c b/drivers/g= pu/drm/nouveau/nvkm/engine/fifo/recover.c new file mode 100644 index 000000000000..14c0eebf7040 --- /dev/null +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/recover.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: MIT +/* + * nv04_fifo_recover - shared recovery helper for Tesla cache_error and + * dma_pusher fault paths. + * + * Tier-1: kill the offending channel via nvkm_chan_error. + * Tier-2: after a configurable burst of faults within a sliding time + * window, request a device-wide drm_dev_wedged_event so userspace + * can rebind the driver. + */ + +#include "priv.h" +#include "chan.h" + +#include +#include + +#include +#include +#include +#include +#include + +#include "nouveau_drv.h" +#include + +static struct drm_device * +nv04_fifo_drm_device(struct nvkm_fifo *fifo) +{ + struct nvkm_device *device =3D fifo->engine.subdev.device; + struct nouveau_drm *drm =3D dev_get_drvdata(device->dev); + + return (drm && drm->dev) ? drm->dev : NULL; +} + +void +nv04_fifo_recover(struct nvkm_fifo *fifo, u32 chid, u32 fault_type, u64 in= fo) +{ + struct drm_device *drm_dev =3D nv04_fifo_drm_device(fifo); + struct nvkm_chan *chan; + unsigned long flags; + ktime_t now, cutoff; + u32 i, count; + + if (drm_dev) + trace_nouveau_fifo_chan_killed(drm_dev, chid, fault_type, info); + + chan =3D nvkm_chan_get_chid(&fifo->engine, chid, &flags); + if (chan) { + nvkm_chan_error(chan, true); + nvkm_chan_put(&chan, flags); + } + + if (nouveau_fifo_wedge_count =3D=3D 0) + return; + + now =3D ktime_get(); + cutoff =3D ktime_sub_ms(now, nouveau_fifo_wedge_window_ms); + + spin_lock_irqsave(&fifo->wedge.lock, flags); + + /* Insert current first, then purge expired and count survivors. */ + fifo->wedge.ts[fifo->wedge.head] =3D now; + fifo->wedge.head =3D (fifo->wedge.head + 1) % NVKM_FIFO_WEDGE_RING_MAX; + + count =3D 0; + for (i =3D 0; i < NVKM_FIFO_WEDGE_RING_MAX; i++) { + if (!ktime_to_ns(fifo->wedge.ts[i])) + continue; + if (ktime_before(fifo->wedge.ts[i], cutoff)) + fifo->wedge.ts[i] =3D 0; + else + count++; + } + fifo->wedge.count =3D count; + + if (count >=3D nouveau_fifo_wedge_count) + schedule_work(&fifo->wedge.work); + + spin_unlock_irqrestore(&fifo->wedge.lock, flags); +} + +static void +nv04_fifo_wedge_work(struct work_struct *work) +{ + struct nvkm_fifo_wedge *w =3D container_of(work, struct nvkm_fifo_wedge, = work); + struct nvkm_fifo *fifo =3D container_of(w, struct nvkm_fifo, wedge); + struct drm_device *drm_dev =3D nv04_fifo_drm_device(fifo); + u32 fault_count; + + if (atomic_xchg(&w->wedged, 1) !=3D 0) + return; /* already wedged this cycle */ + + if (!drm_dev) + return; + + fault_count =3D w->count; + + dev_info(drm_dev->dev, + "nouveau: fifo wedged after %u faults in %u ms\n", + fault_count, nouveau_fifo_wedge_window_ms); + + trace_nouveau_fifo_dev_wedged(drm_dev, fault_count, + nouveau_fifo_wedge_window_ms); + + drm_dev_wedged_event(drm_dev, DRM_WEDGE_RECOVERY_REBIND, NULL); +} + +void +nv04_fifo_wedge_init(struct nvkm_fifo *fifo) +{ + spin_lock_init(&fifo->wedge.lock); + INIT_WORK(&fifo->wedge.work, nv04_fifo_wedge_work); + atomic_set(&fifo->wedge.wedged, 0); +} + +void +nv04_fifo_wedge_fini(struct nvkm_fifo *fifo) +{ + cancel_work_sync(&fifo->wedge.work); +} --=20 2.53.0