From nobody Sun May 10 07:13:39 2026 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0AAF0BE6C for ; Sat, 28 Dec 2024 06:32:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.119 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735367578; cv=none; b=uD9h4u9icnrUaR2OljkxrjVvQFIw5/tukR7+GHRxaHa233FraMn83d7rlPZddWgMBjvifcqHKR+p3sIFiTBgpUbuuriYZ4r4ljnxBoIdsKANc2AeItor+vMiG+guoLjV5mfCbaj3YcPM93xXHisB1ithf+6XKZMa+9RT9wi+7vU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735367578; c=relaxed/simple; bh=acPwPnP0ipK88ajgiyNLCGcH9CWr2JYLY3BAJpkQcHA=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=jYK6xB6/DzMCK696nsudcMQk20qdFmqcuYjjD2T04diuJNmYymTmdjD5BmlOj4cXt73roCCds7Ln8vjmoKvs247mCM/cSF856AJzkyYBNlX0v9cCusGFsTBsY0FyJEVAGcUL3tdII/a9FjexxMbNKv+Cq/H5l386zd6x7IoiMZw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=FsY/ixqU; arc=none smtp.client-ip=115.124.30.119 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="FsY/ixqU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1735367567; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=MGGvG2Fzj1nmQzeJtG/D+iwRccsq+ANs4YuzTX7jMpU=; b=FsY/ixqU5v30A3TAC1l11EZWLd+LLndIzwU1lUlIO5EBMgHcCbnqgSudJMe2G967SvStD8YPhHJdQLkd/8ZxgxkzqjtQDp3vD0hJLBni4DTOXAtldVeamMpdw3JLXawA3oD2APIJWbXDqdLPfZLQ7tmrAwmOnARzaWSO+iWl83o= Received: from localhost.localdomain(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WMN3RGm_1735367565 cluster:ay36) by smtp.aliyun-inc.com; Sat, 28 Dec 2024 14:32:47 +0800 From: Shuai Xue To: alexander.deucher@amd.com, christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@gmail.com, simona@ffwll.ch, lijo.lazar@amd.com, le.ma@amd.com, hamza.mahfooz@amd.com, tzimmermann@suse.de, shaoyun.liu@amd.com, Jun.Ma2@amd.com Cc: xueshuai@linux.alibaba.com, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation Date: Sat, 28 Dec 2024 14:32:45 +0800 Message-ID: <20241228063245.61874-1-xueshuai@linux.alibaba.com> X-Mailer: git-send-email 2.44.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" It's observed that most GPU jobs utilize less than one server, typically with each GPU being used by an independent job. If a job consumed poisoned data, a SIGBUS signal will be sent to terminate it. Meanwhile, the gpu_recovery parameter is set to -1 by default, the amdgpu driver resets all GPUs on the server. As a result, all jobs are terminated. Setting gpu_recovery to 0 provides an opportunity to preemptively evacuate other jobs and subsequently manually reset all GPUs. However, this parameter is read-only, necessitating correct settings at driver load. And reloading the GPU driver in a production environment can be challenging due to reference counts maintained by various monitoring services. Set the gpu_recovery parameter with read-write permission to enable runtime modification. It will enables users to dynamically manage GPU recovery mechanisms based on real-time requirements or conditions. Signed-off-by: Shuai Xue --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/= amdgpu/amdgpu_drv.c index 38686203bea6..03dd902e1cec 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread acr= oss pipes (1 =3D enable, 0 =3D disable, -1 =3D auto)"); module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); =20 +static int amdgpu_set_gpu_recovery(const char *buf, + const struct kernel_param *kp) +{ + unsigned long val; + int ret; + + ret =3D kstrtol(buf, 10, &val); + if (ret < 0) + return ret; + + if (val !=3D 1 && val !=3D 0 && val !=3D -1) { + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", + val); + return -EINVAL; + } + + return param_set_int(buf, kp); +} + +static const struct kernel_param_ops amdgpu_gpu_recovery_ops =3D { + .set =3D amdgpu_set_gpu_recovery, + .get =3D param_get_int, +}; + /** * DOC: gpu_recovery (int) * Set to enable GPU recovery mechanism (1 =3D enable, 0 =3D disable). The= default is -1 (auto, disabled except SRIOV). */ MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 =3D enab= le, 0 =3D disable, -1 =3D auto)"); -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recove= ry, 0644); =20 /** * DOC: emu_mode (int) --=20 2.39.3