From nobody Thu Sep 18 10:16:02 2025
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A4137C352A1
	for <linux-kernel@archiver.kernel.org>; Wed,  7 Dec 2022 09:40:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229479AbiLGJkF (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 7 Dec 2022 04:40:05 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50136 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229901AbiLGJjx (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 7 Dec 2022 04:39:53 -0500
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC1093A7;
        Wed,  7 Dec 2022 01:39:51 -0800 (PST)
Received: from kwepemi500015.china.huawei.com (unknown [172.30.72.54])
        by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4NRsc31fdbzJp8c;
        Wed,  7 Dec 2022 17:36:19 +0800 (CST)
Received: from huawei.com (10.175.124.27) by kwepemi500015.china.huawei.com
 (7.221.188.92) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 7 Dec
 2022 17:39:15 +0800
From: Lv Ying <lvying6@huawei.com>
To: <rafael@kernel.org>, <lenb@kernel.org>, <james.morse@arm.com>,
        <tony.luck@intel.com>, <bp@alien8.de>, <naoya.horiguchi@nec.com>,
        <linmiaohe@huawei.com>, <akpm@linux-foundation.org>,
        <xueshuai@linux.alibaba.com>, <ashish.kalra@amd.com>
CC: <xiezhipeng1@huawei.com>, <wangkefeng.wang@huawei.com>,
        <xiexiuqi@huawei.com>, <tanxiaofei@huawei.com>,
        <lvying6@huawei.com>, <linux-acpi@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: [RFC PATCH v1 1/2] ACPI: APEI: Make memory_failure() triggered by
 synchronization errors execute in the current context
Date: Wed, 7 Dec 2022 17:39:34 +0800
Message-ID: <20221207093935.1972530-2-lvying6@huawei.com>
X-Mailer: git-send-email 2.36.1
In-Reply-To: <20221207093935.1972530-1-lvying6@huawei.com>
References: <20221207093935.1972530-1-lvying6@huawei.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [10.175.124.27]
X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To
 kwepemi500015.china.huawei.com (7.221.188.92)
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

The memory uncorrected error which is detected by an external component and
notified via an IRQ, can be called asynchronization error. If an error is
detected as a result of user-space process accessing a corrupt memory
location, the CPU may take an abort. On arm64 this is a
'synchronous external abort', and on a firmware first system it is notified
via NOTIFY_SEA, this can be called synchronization error.

Currently, synchronization error and asynchronization error both use
memory_failure_queue to schedule memory_failure() exectute in kworker
context. Commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue
for synchronous errors") make task_work pending to flush out the queue,
cancel_work_sync() in memory_failure_queue_kick() will make
memory_failure() exectute in kworker context first which will get
synchronization error info from kfifo, so task_work later will get nothing
from kfifo which doesn't work as expected. Even worse, synchronization
error notification has NMI like properties, (it can interrupt IRQ-masked
code), task_work may get wrong kfifo entry from interrupted
asynchronization error which is notified by IRQ.

Since the memory_failure() triggered by a synchronous exception is
executed in the kworker context, the early_kill mode of memory_failure()
will send wrong si_code by SIGBUS signal: current process is kworker
thread, the actual user-space process accessing the corrupt memory location
will be collected by find_early_kill_thread(), and then send SIGBUS with
BUS_MCEERR_AO si_code to the actual user-space process instead of
BUS_MCEERR_AR. The machine-manager(kvm) use the si_code: BUS_MCEERR_AO for
'action optional' early notifications, and BUS_MCEERR_AR for
'action required' synchronous/late notifications.

Make memory_failure() triggered by synchronization errors execute in the
current context, we do not need workqueue for synchronization error
anymore, use task_work handle synchronization errors directly. Since,
synchronization errors and asynchronization errors share the same kfifo,
use MF_ACTION_REQUIRED flag to distinguish them. And the asynchronization
error keeps the same as before.

Signed-off-by: Lv Ying <lvying6@huawei.com>
---
 drivers/acpi/apei/ghes.c | 36 +++++++++++++++++++++++-------------
 mm/memory-failure.c      | 34 ++++++++++++++++++++++------------
 2 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9952f3a792ba..1ff3756e70d4 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -423,8 +423,8 @@ static void ghes_clear_estatus(struct ghes *ghes,
=20
 /*
  * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context th=
at
- * triggered the notification.
+ * Ensure any queued corrupt page in synchronous errors has been handled b=
efore
+ * we return to the user context that triggered the notification.
  */
 static void ghes_kick_task_work(struct callback_head *head)
 {
@@ -461,7 +461,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, i=
nt flags)
 }
=20
 static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdat=
a,
-				       int sev)
+				       int sev, bool sync)
 {
 	int flags =3D -1;
 	int sec_sev =3D ghes_severity(gdata->error_severity);
@@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest=
_generic_data *gdata,
 	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 		flags =3D MF_SOFT_OFFLINE;
 	if (sev =3D=3D GHES_SEV_RECOVERABLE && sec_sev =3D=3D GHES_SEV_RECOVERABL=
E)
-		flags =3D 0;
+		flags =3D sync ? MF_ACTION_REQUIRED : 0;
=20
 	if (flags !=3D -1)
 		return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -483,7 +483,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest=
_generic_data *gdata,
 	return false;
 }
=20
-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,=
 int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,=
 int sev, bool sync)
 {
 	struct cper_sec_proc_arm *err =3D acpi_hest_get_payload(gdata);
 	bool queued =3D false;
@@ -510,7 +510,8 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_g=
eneric_data *gdata, int s
 		 * and don't filter out 'corrected' error here.
 		 */
 		if (is_cache && has_pa) {
-			queued =3D ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+			queued =3D ghes_do_memory_failure(err_info->physical_fault_addr,
+					sync ? MF_ACTION_REQUIRED : 0);
 			p +=3D err_info->length;
 			continue;
 		}
@@ -623,7 +624,7 @@ static void ghes_defer_non_standard_event(struct acpi_h=
est_generic_data *gdata,
 }
=20
 static bool ghes_do_proc(struct ghes *ghes,
-			 const struct acpi_hest_generic_status *estatus)
+			 const struct acpi_hest_generic_status *estatus, bool sync)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -648,13 +649,13 @@ static bool ghes_do_proc(struct ghes *ghes,
 			ghes_edac_report_mem_error(sev, mem_err);
=20
 			arch_apei_report_mem_error(sev, mem_err);
-			queued =3D ghes_handle_memory_failure(gdata, sev);
+			queued =3D ghes_handle_memory_failure(gdata, sev, sync);
 		}
 		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
 			ghes_handle_aer(gdata);
 		}
 		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-			queued =3D ghes_handle_arm_hw_error(gdata, sev);
+			queued =3D ghes_handle_arm_hw_error(gdata, sev, sync);
 		} else {
 			void *err =3D acpi_hest_get_payload(gdata);
=20
@@ -868,7 +869,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, estatus))
 			ghes_estatus_cache_add(ghes->generic, estatus);
 	}
-	ghes_do_proc(ghes, estatus);
+	ghes_do_proc(ghes, estatus, false);
=20
 out:
 	ghes_clear_estatus(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
@@ -955,13 +956,22 @@ static struct notifier_block ghes_notifier_hed =3D {
 static struct llist_head ghes_estatus_llist;
 static struct irq_work ghes_proc_irq_work;
=20
+/*
+ * TODO: Currently ghes_proc_in_irq can be called in SDEI, SEA, NMI notify=
 type,
+ * this function run some job which may not be NMI safe in IRQ context.
+ * However, there is no good way to distinguish NMI-like notify type is
+ * synchronous or not.
+ * current implementation of ghes_proc_in_irq uses ghes_kick_task_work etc
+ * may imply this function is used in synchronous scenarios, so to be
+ * consistent with the current implementation
+ */
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;
 	struct ghes_estatus_node *estatus_node;
 	struct acpi_hest_generic *generic;
 	struct acpi_hest_generic_status *estatus;
-	bool task_work_pending;
+	bool corruption_page_pending;
 	u32 len, node_len;
 	int ret;
=20
@@ -978,14 +988,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_wor=
k)
 		estatus =3D GHES_ESTATUS_FROM_NODE(estatus_node);
 		len =3D cper_estatus_len(estatus);
 		node_len =3D GHES_ESTATUS_NODE_LEN(len);
-		task_work_pending =3D ghes_do_proc(estatus_node->ghes, estatus);
+		corruption_page_pending =3D ghes_do_proc(estatus_node->ghes, estatus, tr=
ue);
 		if (!ghes_estatus_cached(estatus)) {
 			generic =3D estatus_node->generic;
 			if (ghes_print_estatus(NULL, generic, estatus))
 				ghes_estatus_cache_add(generic, estatus);
 		}
=20
-		if (task_work_pending && current->mm) {
+		if (corruption_page_pending && current->mm) {
 			estatus_node->task_work.func =3D ghes_kick_task_work;
 			estatus_node->task_work_cpu =3D smp_processor_id();
 			ret =3D task_work_add(current, &estatus_node->task_work,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bead6bccc7f2..3b6ac3694b8d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2204,7 +2204,11 @@ struct memory_failure_cpu {
 static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
=20
 /**
- * memory_failure_queue - Schedule handling memory failure of a page.
+ * memory_failure_queue
+ * - Schedule handling memory failure of a page for asynchronous error, me=
mory
+ *   failure page will be executed in kworker thread
+ * - put corrupt memory info into kfifo for synchronous error, task_work w=
ill
+ *   handle them before returning to the user
  * @pfn: Page Number of the corrupted page
  * @flags: Flags for memory failure handling
  *
@@ -2217,6 +2221,11 @@ static DEFINE_PER_CPU(struct memory_failure_cpu, mem=
ory_failure_cpu);
  * happen outside the current execution context (e.g. when
  * detected by a background scrubber)
  *
+ * This function can also be used in synchronous errors which was detected=
 as a
+ * result of user-space accessing a corrupt memory location, just put memo=
ry
+ * error info into kfifo, and then, task_work get and handle it in current
+ * execution context instead of scheduling kworker to handle it
+ *
  * Can run in IRQ context.
  */
 void memory_failure_queue(unsigned long pfn, int flags)
@@ -2230,9 +2239,10 @@ void memory_failure_queue(unsigned long pfn, int fla=
gs)
=20
 	mf_cpu =3D &get_cpu_var(memory_failure_cpu);
 	spin_lock_irqsave(&mf_cpu->lock, proc_flags);
-	if (kfifo_put(&mf_cpu->fifo, entry))
-		schedule_work_on(smp_processor_id(), &mf_cpu->work);
-	else
+	if (kfifo_put(&mf_cpu->fifo, entry)) {
+		if (!(entry.flags & MF_ACTION_REQUIRED))
+			schedule_work_on(smp_processor_id(), &mf_cpu->work);
+	} else
 		pr_err("buffer overflow when queuing memory failure at %#lx\n",
 		       pfn);
 	spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
@@ -2240,7 +2250,7 @@ void memory_failure_queue(unsigned long pfn, int flag=
s)
 }
 EXPORT_SYMBOL_GPL(memory_failure_queue);
=20
-static void memory_failure_work_func(struct work_struct *work)
+static void __memory_failure_work_func(struct work_struct *work, bool sync)
 {
 	struct memory_failure_cpu *mf_cpu;
 	struct memory_failure_entry entry =3D { 0, };
@@ -2256,22 +2266,22 @@ static void memory_failure_work_func(struct work_st=
ruct *work)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
-		else
+		else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
 			memory_failure(entry.pfn, entry.flags);
 	}
 }
=20
-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqu=
eue.
- */
+static void memory_failure_work_func(struct work_struct *work)
+{
+	__memory_failure_work_func(work, false);
+}
+
 void memory_failure_queue_kick(int cpu)
 {
 	struct memory_failure_cpu *mf_cpu;
=20
 	mf_cpu =3D &per_cpu(memory_failure_cpu, cpu);
-	cancel_work_sync(&mf_cpu->work);
-	memory_failure_work_func(&mf_cpu->work);
+	__memory_failure_work_func(&mf_cpu->work, true);
 }
=20
 static int __init memory_failure_init(void)
--=20
2.36.1
From nobody Thu Sep 18 10:16:02 2025
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3983AC47089
	for <linux-kernel@archiver.kernel.org>; Wed,  7 Dec 2022 09:40:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230020AbiLGJkA (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 7 Dec 2022 04:40:00 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50134 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229893AbiLGJjw (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 7 Dec 2022 04:39:52 -0500
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15C66CDC;
        Wed,  7 Dec 2022 01:39:52 -0800 (PST)
Received: from kwepemi500015.china.huawei.com (unknown [172.30.72.54])
        by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4NRsc341rMzJp8f;
        Wed,  7 Dec 2022 17:36:19 +0800 (CST)
Received: from huawei.com (10.175.124.27) by kwepemi500015.china.huawei.com
 (7.221.188.92) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 7 Dec
 2022 17:39:16 +0800
From: Lv Ying <lvying6@huawei.com>
To: <rafael@kernel.org>, <lenb@kernel.org>, <james.morse@arm.com>,
        <tony.luck@intel.com>, <bp@alien8.de>, <naoya.horiguchi@nec.com>,
        <linmiaohe@huawei.com>, <akpm@linux-foundation.org>,
        <xueshuai@linux.alibaba.com>, <ashish.kalra@amd.com>
CC: <xiezhipeng1@huawei.com>, <wangkefeng.wang@huawei.com>,
        <xiexiuqi@huawei.com>, <tanxiaofei@huawei.com>,
        <lvying6@huawei.com>, <linux-acpi@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: [RFC PATCH v1 2/2] ACPI: APEI: fix reboot caused by synchronous error
 loop because of memory_failure() failed
Date: Wed, 7 Dec 2022 17:39:35 +0800
Message-ID: <20221207093935.1972530-3-lvying6@huawei.com>
X-Mailer: git-send-email 2.36.1
In-Reply-To: <20221207093935.1972530-1-lvying6@huawei.com>
References: <20221207093935.1972530-1-lvying6@huawei.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [10.175.124.27]
X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To
 kwepemi500015.china.huawei.com (7.221.188.92)
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Synchronous error was detected as a result of user-space accessing a
corrupt memory location the CPU may take an abort instead. On arm64 this
is a 'synchronous external abort' which can be notified by SEA.

If memory_failure() failed, we return to user-space will trigger SEA again,
such loop may cause platform firmware to exceed some threshold and reboot
when Linux could have recovered from this error. Not all memory_failure()
processing failures will cause the reboot, VM_FAULT_HWPOISON[_LARGE]
handling in arm64 page fault will send SIGBUS signal to the user-space
accessing process to terminate this loop.

If process mapping fault page, but memory_failure() abnormal return before
try_to_unmap(), for example, the fault page process mapping is KSM page.
In this case, arm64 cannot use the page fault process to terminate the
loop.

Add judgement of memory_failure() result in task_work before returning to
user-space. If memory_failure() failed, send SIGBUS signal to the current
process to avoid SEA loop.

Signed-off-by: Lv Ying <lvying6@huawei.com>
---
 mm/memory-failure.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3b6ac3694b8d..07ec7b62f330 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2255,7 +2255,7 @@ static void __memory_failure_work_func(struct work_st=
ruct *work, bool sync)
 	struct memory_failure_cpu *mf_cpu;
 	struct memory_failure_entry entry =3D { 0, };
 	unsigned long proc_flags;
-	int gotten;
+	int gotten, ret;
=20
 	mf_cpu =3D container_of(work, struct memory_failure_cpu, work);
 	for (;;) {
@@ -2266,7 +2266,16 @@ static void __memory_failure_work_func(struct work_s=
truct *work, bool sync)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
-		else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
+		else if (sync) {
+			if (entry.flags & MF_ACTION_REQUIRED) {
+				ret =3D memory_failure(entry.pfn, entry.flags);
+				if (ret =3D=3D -EHWPOISON || ret =3D=3D -EOPNOTSUPP)
+					return;
+
+				pr_err("Memory error not recovered");
+				force_sig(SIGBUS);
+			}
+		} else
 			memory_failure(entry.pfn, entry.flags);
 	}
 }
--=20
2.36.1