From nobody Thu Sep 18 10:16:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3983AC47089 for ; Wed, 7 Dec 2022 09:40:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230020AbiLGJkA (ORCPT ); Wed, 7 Dec 2022 04:40:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50134 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229893AbiLGJjw (ORCPT ); Wed, 7 Dec 2022 04:39:52 -0500 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15C66CDC; Wed, 7 Dec 2022 01:39:52 -0800 (PST) Received: from kwepemi500015.china.huawei.com (unknown [172.30.72.54]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4NRsc341rMzJp8f; Wed, 7 Dec 2022 17:36:19 +0800 (CST) Received: from huawei.com (10.175.124.27) by kwepemi500015.china.huawei.com (7.221.188.92) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 7 Dec 2022 17:39:16 +0800 From: Lv Ying To: , , , , , , , , , CC: , , , , , , , Subject: [RFC PATCH v1 2/2] ACPI: APEI: fix reboot caused by synchronous error loop because of memory_failure() failed Date: Wed, 7 Dec 2022 17:39:35 +0800 Message-ID: <20221207093935.1972530-3-lvying6@huawei.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20221207093935.1972530-1-lvying6@huawei.com> References: <20221207093935.1972530-1-lvying6@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.175.124.27] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemi500015.china.huawei.com (7.221.188.92) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Synchronous error was detected as a result of user-space accessing a corrupt memory location the CPU may take an abort instead. On arm64 this is a 'synchronous external abort' which can be notified by SEA. If memory_failure() failed, we return to user-space will trigger SEA again, such loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. Not all memory_failure() processing failures will cause the reboot, VM_FAULT_HWPOISON[_LARGE] handling in arm64 page fault will send SIGBUS signal to the user-space accessing process to terminate this loop. If process mapping fault page, but memory_failure() abnormal return before try_to_unmap(), for example, the fault page process mapping is KSM page. In this case, arm64 cannot use the page fault process to terminate the loop. Add judgement of memory_failure() result in task_work before returning to user-space. If memory_failure() failed, send SIGBUS signal to the current process to avoid SEA loop. Signed-off-by: Lv Ying --- mm/memory-failure.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 3b6ac3694b8d..07ec7b62f330 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -2255,7 +2255,7 @@ static void __memory_failure_work_func(struct work_st= ruct *work, bool sync) struct memory_failure_cpu *mf_cpu; struct memory_failure_entry entry =3D { 0, }; unsigned long proc_flags; - int gotten; + int gotten, ret; =20 mf_cpu =3D container_of(work, struct memory_failure_cpu, work); for (;;) { @@ -2266,7 +2266,16 @@ static void __memory_failure_work_func(struct work_s= truct *work, bool sync) break; if (entry.flags & MF_SOFT_OFFLINE) soft_offline_page(entry.pfn, entry.flags); - else if (!sync || (entry.flags & MF_ACTION_REQUIRED)) + else if (sync) { + if (entry.flags & MF_ACTION_REQUIRED) { + ret =3D memory_failure(entry.pfn, entry.flags); + if (ret =3D=3D -EHWPOISON || ret =3D=3D -EOPNOTSUPP) + return; + + pr_err("Memory error not recovered"); + force_sig(SIGBUS); + } + } else memory_failure(entry.pfn, entry.flags); } } --=20 2.36.1