From nobody Sat Nov 23 23:03:41 2024 Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam04on2041.outbound.protection.outlook.com [40.107.102.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 51BC81AC8A2 for ; Mon, 11 Nov 2024 17:32:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.102.41 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731346371; cv=fail; b=VLKqXJtreroZzQ6XgqXIQ1proucJYxHFRKzgBD62KMJ9vRf9j0f28/32iW1RXwhBli2RtmRIMOucbomMPo8obWgsZf2qZesRNBbAHx1N8/tXtlw4L8Xtsfzic+Up9la/nL/wbizITESlK9hlAeQpLmc7qqqKCVXMhEE94koMq08= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731346371; c=relaxed/simple; bh=8ER9/9KgOWKaFxj8ORRJdNTim3jBpO3N/WS3vnq3UbI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=XOn1wjONB6jhMz+53JZB32LbQb0Uuswx4zpMyOTkXehcKg1xrLJWI2jqNS2pntIMLdbZHR4ykmgfOJHRzCxBqH7/gw3Jud8Gz/zxGHYX5lYXCKDqQtnWyO9uNPw7Jr7sNPkLImR5TbRnFKp25dnRhWXaUap1Oll8UemXKMnTACk= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=rELvTfA7; arc=fail smtp.client-ip=40.107.102.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="rELvTfA7" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Zg/zSknbs3CW/psxSI/f5RVvoB4IftqiD2Mm66RAC/1314SLuaoNV7zu0at2usgkuwiDT9DqGzDJoRi5Qgr/Yb6ApaYkako4r8VUZiDc1SkxGGUumPIZpO8s/cKgJ+dnSgAIef1kGMJ5OpRajjw1DTOBCAIeuusSQmVTS4Djkgd/aEa7az/G+KPgZcpLOZYGgdwDAlWMI6XjYVY7zVx8vHApjupCY41bQnZhHvgETa0Rwm4nHiTzNE3eTHjsS9nrb5MjSU/NsPKvCzKSyNw1lZrzluXdeDruJ0w/XDhoR12tHEjP2Ef8pAQzbO50e00u+kyq+Pmb+eVGCdSrWSO3dw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rErBZsLAGuOfq21zm37t08AD+i6lWLJtI1hZQuWyE24=; b=tJ1GrIdxXrU0DzT5NbR6eqNWQFqtYL/iGDf9hcDKgpVrIdf9IAKFhsJKsHJYn3sdORXWvtb44y72priRgrXB4jdEkg2yewPABwj4fAMqzcdbukA5cYzx59rD0AzhEfly/Ds3KTP6lbdOUFumkuR+uALgcFl8Krhkhh0LekYzQ4ViWw4G1y1bjdzImHW14HCd6AQQZE8/QZ58AtI+LB0qjOuHBIZaf98my8sFZeF8EzXLNZRn5cTFkKlqfBKe3lOE2MqAuCTSZvQQyTmOgwBxWfCMz+B8sxxrspq4SxsqAr7KCoW3a+V1XycFs3kjcGAhP3GIzzc0JdKX9koWXOn9tw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=rErBZsLAGuOfq21zm37t08AD+i6lWLJtI1hZQuWyE24=; b=rELvTfA7pyiEXMtY4SQ+U/QctWooehKKeb369WUTdxWN+IATlSm8/UGXsTIj3QLQzX11CBdNkMmEeRjPqMFOjIZxsq3yxrL164pEz2QjzwkSSmPhkb8Wgxh9JV3i+GLHYzFvkKwOFw1dacrYpVAL0m9IcxX5dFCTRqvTlXyjI3g= Received: from CH0P220CA0023.NAMP220.PROD.OUTLOOK.COM (2603:10b6:610:ef::25) by PH0PR12MB7907.namprd12.prod.outlook.com (2603:10b6:510:28d::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8137.25; Mon, 11 Nov 2024 17:32:42 +0000 Received: from CH2PEPF00000148.namprd02.prod.outlook.com (2603:10b6:610:ef:cafe::2b) by CH0P220CA0023.outlook.office365.com (2603:10b6:610:ef::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8137.29 via Frontend Transport; Mon, 11 Nov 2024 17:32:42 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by CH2PEPF00000148.mail.protection.outlook.com (10.167.244.105) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.8158.14 via Frontend Transport; Mon, 11 Nov 2024 17:32:42 +0000 Received: from SATLEXMB05.amd.com (10.181.40.146) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 11 Nov 2024 11:32:41 -0600 Received: from SATLEXMB04.amd.com (10.181.40.145) by SATLEXMB05.amd.com (10.181.40.146) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 11 Nov 2024 11:32:41 -0600 Received: from xsjlizhih51.xilinx.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server id 15.1.2507.39 via Frontend Transport; Mon, 11 Nov 2024 11:32:40 -0600 From: Lizhi Hou To: , , CC: Lizhi Hou , , , , , Subject: [PATCH V8 09/10] accel/amdxdna: Add error handling Date: Mon, 11 Nov 2024 09:32:29 -0800 Message-ID: <20241111173230.655325-10-lizhi.hou@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241111173230.655325-1-lizhi.hou@amd.com> References: <20241111173230.655325-1-lizhi.hou@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: None (SATLEXMB05.amd.com: lizhi.hou@amd.com does not designate permitted sender hosts) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF00000148:EE_|PH0PR12MB7907:EE_ X-MS-Office365-Filtering-Correlation-Id: b258ad5a-0aa2-4b35-6181-08dd0276d8c7 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|36860700013|82310400026|376014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?olauouhae+KPgsT903us+0My2fDFO/MEVGuWAxl0JeRdBz8QijxS98M4x1Ui?= =?us-ascii?Q?vftXdpZPnas5KAFua8hz6sggvWg1T3uHmnAHPo9b4K8OrcaQkaxYY2Kv/iSv?= =?us-ascii?Q?YU8c7tQDXDMJN97jtwDjSjiNcfYBIndi/6lnVigHpaTDjebDBgjHRteO5zxw?= =?us-ascii?Q?IjQxkvFx14Eu4aaAovOV1IXl4lFvnamfn0Tq5i7269+XdbQcImuH15FKTcyO?= =?us-ascii?Q?a53JI0mdTthc48V6nBVmLYJgHcfrfKnOa49uygMLmyaK4MkJIpRF5Uaz5PFm?= =?us-ascii?Q?wPYpPJLPEHpeeHcA7M7q2gpI+XXe7F2EWAVh05CasF/24/jYPp7s6XOXJPv6?= =?us-ascii?Q?pMp1nU3ZDMOEipOgYkLXPM5ig7i6DhDmjhpOqrAxItVqCpy+Ap1OwL2J1SOv?= =?us-ascii?Q?IFUFw2GmAa8awS1/Vdspi6IpHPo5JYuIP7t5tkdPf1iGWjLrOI8PEqrt+e0r?= =?us-ascii?Q?PWJMpynRnN7IuwA0+PTvEpr0nIlBti0DaN/7OaZPcVSdq7uo02k49QDk5Ag+?= =?us-ascii?Q?3pY9shZ4VIq9jOcwQJksNJdw2FFq3BTt9h6ratdIxtXXuwu6pxU12pARa6OP?= =?us-ascii?Q?CbAtKiCTgug2M2xykI1DRwGdlU1WkjJtyc/gowKjKLj7UAPvPTBaLHTFDQlc?= =?us-ascii?Q?dKl5dgjoKLodEnYpqi2rKESr6JfQKUq29YPMaUk8dLi1Q/4t7rp/naf9wupX?= =?us-ascii?Q?g3JkSwQwsCrZZQsyg1D7wO1SwVArrZLg68rYij590kLfR0Y3qUj4Azr9GU9u?= =?us-ascii?Q?ThmnkvffL+sR8iIJI3/mEojcSUqvmpat4phQ/YHldsghR25RLYNCtzmGUZAW?= =?us-ascii?Q?1tJjfKivlDnETYsxlzEPNxxS4Rmz3hUXCkK86DGO6fB2r7BVWIa+wFbm1Jol?= =?us-ascii?Q?9x9YJ+KQe3mgFZLBB76Qzzdv/FwowPI+r/h0LYDJRSbxBlIh5x7TB2n2P0nU?= =?us-ascii?Q?m9irO5yKaFdCfSbmTSq7+MUBAGzI4QbBkZFItgljXYjM5n9zN5wcqHNc3k1B?= =?us-ascii?Q?CKeBYgWagNjcp5U8G3Eze6CX3PgMjRO5SXaEcATPMZpxMfnFikcWq1nQAWBl?= =?us-ascii?Q?Us0KIYKS2f7Z4J/8qcdyEB/2X6l9R2IEduDJM7MPahaJU/ta18fa+d1pxGe/?= =?us-ascii?Q?LMmq8oRwabO+gPng9/Ve4bkQPYSxqSRWwJhw1rCqTgC+WRl+4ew3r3OgQ+xx?= =?us-ascii?Q?hJWxBzv5wmI8/zR6ZLPWNnhZ2Mk2rVXUYOrF1s8UxHszQ4UffkdIvtEf5Tdu?= =?us-ascii?Q?20a2sPfWL8u21s8OHVHzfRyMXnr3S/Zj0ZfsKonvVR2Sb+SAz9j1dh4pslq4?= =?us-ascii?Q?EK2JCC7i6PFZwkLKqGAMtlcV+FLwmcSnQjJ6WCS3yI/wsQNwB0vA7ByfoADM?= =?us-ascii?Q?zszYaELY5LtETQKYc8K6I7yQHkaAa1kPftMkD6tAotFMRjW6Zw=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(36860700013)(82310400026)(376014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 Nov 2024 17:32:42.1882 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: b258ad5a-0aa2-4b35-6181-08dd0276d8c7 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF00000148.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR12MB7907 Content-Type: text/plain; charset="utf-8" When there is a hardware error, the NPU firmware notifies the host through a mailbox message. The message includes details of the error, such as the tile and column indexes where the error occurred. The driver starts a thread to handle the NPU error message. The thread stops the clients which are using the column where error occurred. Then the driver resets that column. Co-developed-by: Min Ma Signed-off-by: Min Ma Reviewed-by: Jeffrey Hugo Signed-off-by: Lizhi Hou --- drivers/accel/amdxdna/Makefile | 1 + drivers/accel/amdxdna/aie2_error.c | 360 +++++++++++++++++++++++++++ drivers/accel/amdxdna/aie2_message.c | 19 ++ drivers/accel/amdxdna/aie2_pci.c | 32 +++ drivers/accel/amdxdna/aie2_pci.h | 9 + 5 files changed, 421 insertions(+) create mode 100644 drivers/accel/amdxdna/aie2_error.c diff --git a/drivers/accel/amdxdna/Makefile b/drivers/accel/amdxdna/Makefile index a688c378761f..ed6f87910880 100644 --- a/drivers/accel/amdxdna/Makefile +++ b/drivers/accel/amdxdna/Makefile @@ -2,6 +2,7 @@ =20 amdxdna-y :=3D \ aie2_ctx.o \ + aie2_error.o \ aie2_message.o \ aie2_pci.o \ aie2_psp.o \ diff --git a/drivers/accel/amdxdna/aie2_error.c b/drivers/accel/amdxdna/aie= 2_error.c new file mode 100644 index 000000000000..b1defaa8513b --- /dev/null +++ b/drivers/accel/amdxdna/aie2_error.c @@ -0,0 +1,360 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2023-2024, Advanced Micro Devices, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "aie2_msg_priv.h" +#include "aie2_pci.h" +#include "amdxdna_mailbox.h" +#include "amdxdna_pci_drv.h" + +struct async_event { + struct amdxdna_dev_hdl *ndev; + struct async_event_msg_resp resp; + struct workqueue_struct *wq; + struct work_struct work; + u8 *buf; + dma_addr_t addr; + u32 size; +}; + +struct async_events { + struct workqueue_struct *wq; + u8 *buf; + dma_addr_t addr; + u32 size; + u32 event_cnt; + struct async_event event[] __counted_by(event_cnt); +}; + +/* + * Below enum, struct and lookup tables are porting from XAIE util header = file. + * + * Below data is defined by AIE device and it is used for decode error mes= sage + * from the device. + */ + +enum aie_module_type { + AIE_MEM_MOD =3D 0, + AIE_CORE_MOD, + AIE_PL_MOD, +}; + +enum aie_error_category { + AIE_ERROR_SATURATION =3D 0, + AIE_ERROR_FP, + AIE_ERROR_STREAM, + AIE_ERROR_ACCESS, + AIE_ERROR_BUS, + AIE_ERROR_INSTRUCTION, + AIE_ERROR_ECC, + AIE_ERROR_LOCK, + AIE_ERROR_DMA, + AIE_ERROR_MEM_PARITY, + /* Unknown is not from XAIE, added for better category */ + AIE_ERROR_UNKNOWN, +}; + +/* Don't pack, unless XAIE side changed */ +struct aie_error { + __u8 row; + __u8 col; + __u32 mod_type; + __u8 event_id; +}; + +struct aie_err_info { + u32 err_cnt; + u32 ret_code; + u32 rsvd; + struct aie_error payload[] __counted_by(err_cnt); +}; + +struct aie_event_category { + u8 event_id; + enum aie_error_category category; +}; + +#define EVENT_CATEGORY(id, cat) { id, cat } +static const struct aie_event_category aie_ml_mem_event_cat[] =3D { + EVENT_CATEGORY(88U, AIE_ERROR_ECC), + EVENT_CATEGORY(90U, AIE_ERROR_ECC), + EVENT_CATEGORY(91U, AIE_ERROR_MEM_PARITY), + EVENT_CATEGORY(92U, AIE_ERROR_MEM_PARITY), + EVENT_CATEGORY(93U, AIE_ERROR_MEM_PARITY), + EVENT_CATEGORY(94U, AIE_ERROR_MEM_PARITY), + EVENT_CATEGORY(95U, AIE_ERROR_MEM_PARITY), + EVENT_CATEGORY(96U, AIE_ERROR_MEM_PARITY), + EVENT_CATEGORY(97U, AIE_ERROR_DMA), + EVENT_CATEGORY(98U, AIE_ERROR_DMA), + EVENT_CATEGORY(99U, AIE_ERROR_DMA), + EVENT_CATEGORY(100U, AIE_ERROR_DMA), + EVENT_CATEGORY(101U, AIE_ERROR_LOCK), +}; + +static const struct aie_event_category aie_ml_core_event_cat[] =3D { + EVENT_CATEGORY(55U, AIE_ERROR_ACCESS), + EVENT_CATEGORY(56U, AIE_ERROR_STREAM), + EVENT_CATEGORY(57U, AIE_ERROR_STREAM), + EVENT_CATEGORY(58U, AIE_ERROR_BUS), + EVENT_CATEGORY(59U, AIE_ERROR_INSTRUCTION), + EVENT_CATEGORY(60U, AIE_ERROR_ACCESS), + EVENT_CATEGORY(62U, AIE_ERROR_ECC), + EVENT_CATEGORY(64U, AIE_ERROR_ECC), + EVENT_CATEGORY(65U, AIE_ERROR_ACCESS), + EVENT_CATEGORY(66U, AIE_ERROR_ACCESS), + EVENT_CATEGORY(67U, AIE_ERROR_LOCK), + EVENT_CATEGORY(70U, AIE_ERROR_INSTRUCTION), + EVENT_CATEGORY(71U, AIE_ERROR_STREAM), + EVENT_CATEGORY(72U, AIE_ERROR_BUS), +}; + +static const struct aie_event_category aie_ml_mem_tile_event_cat[] =3D { + EVENT_CATEGORY(130U, AIE_ERROR_ECC), + EVENT_CATEGORY(132U, AIE_ERROR_ECC), + EVENT_CATEGORY(133U, AIE_ERROR_DMA), + EVENT_CATEGORY(134U, AIE_ERROR_DMA), + EVENT_CATEGORY(135U, AIE_ERROR_STREAM), + EVENT_CATEGORY(136U, AIE_ERROR_STREAM), + EVENT_CATEGORY(137U, AIE_ERROR_STREAM), + EVENT_CATEGORY(138U, AIE_ERROR_BUS), + EVENT_CATEGORY(139U, AIE_ERROR_LOCK), +}; + +static const struct aie_event_category aie_ml_shim_tile_event_cat[] =3D { + EVENT_CATEGORY(64U, AIE_ERROR_BUS), + EVENT_CATEGORY(65U, AIE_ERROR_STREAM), + EVENT_CATEGORY(66U, AIE_ERROR_STREAM), + EVENT_CATEGORY(67U, AIE_ERROR_BUS), + EVENT_CATEGORY(68U, AIE_ERROR_BUS), + EVENT_CATEGORY(69U, AIE_ERROR_BUS), + EVENT_CATEGORY(70U, AIE_ERROR_BUS), + EVENT_CATEGORY(71U, AIE_ERROR_BUS), + EVENT_CATEGORY(72U, AIE_ERROR_DMA), + EVENT_CATEGORY(73U, AIE_ERROR_DMA), + EVENT_CATEGORY(74U, AIE_ERROR_LOCK), +}; + +static enum aie_error_category +aie_get_error_category(u8 row, u8 event_id, enum aie_module_type mod_type) +{ + const struct aie_event_category *lut; + int num_entry; + int i; + + switch (mod_type) { + case AIE_PL_MOD: + lut =3D aie_ml_shim_tile_event_cat; + num_entry =3D ARRAY_SIZE(aie_ml_shim_tile_event_cat); + break; + case AIE_CORE_MOD: + lut =3D aie_ml_core_event_cat; + num_entry =3D ARRAY_SIZE(aie_ml_core_event_cat); + break; + case AIE_MEM_MOD: + if (row =3D=3D 1) { + lut =3D aie_ml_mem_tile_event_cat; + num_entry =3D ARRAY_SIZE(aie_ml_mem_tile_event_cat); + } else { + lut =3D aie_ml_mem_event_cat; + num_entry =3D ARRAY_SIZE(aie_ml_mem_event_cat); + } + break; + default: + return AIE_ERROR_UNKNOWN; + } + + for (i =3D 0; i < num_entry; i++) { + if (event_id !=3D lut[i].event_id) + continue; + + return lut[i].category; + } + + return AIE_ERROR_UNKNOWN; +} + +static u32 aie2_error_backtrack(struct amdxdna_dev_hdl *ndev, void *err_in= fo, u32 num_err) +{ + struct aie_error *errs =3D err_info; + u32 err_col =3D 0; /* assume that AIE has less than 32 columns */ + int i; + + /* Get err column bitmap */ + for (i =3D 0; i < num_err; i++) { + struct aie_error *err =3D &errs[i]; + enum aie_error_category cat; + + cat =3D aie_get_error_category(err->row, err->event_id, err->mod_type); + XDNA_ERR(ndev->xdna, "Row: %d, Col: %d, module %d, event ID %d, category= %d", + err->row, err->col, err->mod_type, + err->event_id, cat); + + if (err->col >=3D 32) { + XDNA_WARN(ndev->xdna, "Invalid column number"); + break; + } + + err_col |=3D (1 << err->col); + } + + return err_col; +} + +static int aie2_error_async_cb(void *handle, const u32 *data, size_t size) +{ + struct async_event_msg_resp *resp; + struct async_event *e =3D handle; + + if (data) { + resp =3D (struct async_event_msg_resp *)data; + e->resp.type =3D resp->type; + wmb(); /* Update status in the end, so that no lock for here */ + e->resp.status =3D resp->status; + } + queue_work(e->wq, &e->work); + return 0; +} + +static int aie2_error_event_send(struct async_event *e) +{ + drm_clflush_virt_range(e->buf, e->size); /* device can access */ + return aie2_register_asyn_event_msg(e->ndev, e->addr, e->size, e, + aie2_error_async_cb); +} + +static void aie2_error_worker(struct work_struct *err_work) +{ + struct aie_err_info *info; + struct amdxdna_dev *xdna; + struct async_event *e; + u32 max_err; + u32 err_col; + + e =3D container_of(err_work, struct async_event, work); + + xdna =3D e->ndev->xdna; + + if (e->resp.status =3D=3D MAX_AIE2_STATUS_CODE) + return; + + e->resp.status =3D MAX_AIE2_STATUS_CODE; + + print_hex_dump_debug("AIE error: ", DUMP_PREFIX_OFFSET, 16, 4, + e->buf, 0x100, false); + + info =3D (struct aie_err_info *)e->buf; + XDNA_DBG(xdna, "Error count %d return code %d", info->err_cnt, info->ret_= code); + + max_err =3D (e->size - sizeof(*info)) / sizeof(struct aie_error); + if (unlikely(info->err_cnt > max_err)) { + WARN_ONCE(1, "Error count too large %d\n", info->err_cnt); + return; + } + err_col =3D aie2_error_backtrack(e->ndev, info->payload, info->err_cnt); + if (!err_col) { + XDNA_WARN(xdna, "Did not get error column"); + return; + } + + mutex_lock(&xdna->dev_lock); + /* Re-sent this event to firmware */ + if (aie2_error_event_send(e)) + XDNA_WARN(xdna, "Unable to register async event"); + mutex_unlock(&xdna->dev_lock); +} + +int aie2_error_async_events_send(struct amdxdna_dev_hdl *ndev) +{ + struct amdxdna_dev *xdna =3D ndev->xdna; + struct async_event *e; + int i, ret; + + drm_WARN_ON(&xdna->ddev, !mutex_is_locked(&xdna->dev_lock)); + for (i =3D 0; i < ndev->async_events->event_cnt; i++) { + e =3D &ndev->async_events->event[i]; + ret =3D aie2_error_event_send(e); + if (ret) + return ret; + } + + return 0; +} + +void aie2_error_async_events_free(struct amdxdna_dev_hdl *ndev) +{ + struct amdxdna_dev *xdna =3D ndev->xdna; + struct async_events *events; + + events =3D ndev->async_events; + + mutex_unlock(&xdna->dev_lock); + destroy_workqueue(events->wq); + mutex_lock(&xdna->dev_lock); + + dma_free_noncoherent(xdna->ddev.dev, events->size, events->buf, + events->addr, DMA_FROM_DEVICE); + kfree(events); +} + +int aie2_error_async_events_alloc(struct amdxdna_dev_hdl *ndev) +{ + struct amdxdna_dev *xdna =3D ndev->xdna; + u32 total_col =3D ndev->total_col; + u32 total_size =3D ASYNC_BUF_SIZE * total_col; + struct async_events *events; + int i, ret; + + events =3D kzalloc(struct_size(events, event, total_col), GFP_KERNEL); + if (!events) + return -ENOMEM; + + events->buf =3D dma_alloc_noncoherent(xdna->ddev.dev, total_size, &events= ->addr, + DMA_FROM_DEVICE, GFP_KERNEL); + if (!events->buf) { + ret =3D -ENOMEM; + goto free_events; + } + events->size =3D total_size; + events->event_cnt =3D total_col; + + events->wq =3D alloc_ordered_workqueue("async_wq", 0); + if (!events->wq) { + ret =3D -ENOMEM; + goto free_buf; + } + + for (i =3D 0; i < events->event_cnt; i++) { + struct async_event *e =3D &events->event[i]; + u32 offset =3D i * ASYNC_BUF_SIZE; + + e->ndev =3D ndev; + e->wq =3D events->wq; + e->buf =3D &events->buf[offset]; + e->addr =3D events->addr + offset; + e->size =3D ASYNC_BUF_SIZE; + e->resp.status =3D MAX_AIE2_STATUS_CODE; + INIT_WORK(&e->work, aie2_error_worker); + } + + ndev->async_events =3D events; + + XDNA_DBG(xdna, "Async event count %d, buf total size 0x%x", + events->event_cnt, events->size); + return 0; + +free_buf: + dma_free_noncoherent(xdna->ddev.dev, events->size, events->buf, + events->addr, DMA_FROM_DEVICE); +free_events: + kfree(events); + return ret; +} diff --git a/drivers/accel/amdxdna/aie2_message.c b/drivers/accel/amdxdna/a= ie2_message.c index db62954eb378..eb7e27045213 100644 --- a/drivers/accel/amdxdna/aie2_message.c +++ b/drivers/accel/amdxdna/aie2_message.c @@ -308,6 +308,25 @@ int aie2_map_host_buf(struct amdxdna_dev_hdl *ndev, u3= 2 context_id, u64 addr, u6 return 0; } =20 +int aie2_register_asyn_event_msg(struct amdxdna_dev_hdl *ndev, dma_addr_t = addr, u32 size, + void *handle, int (*cb)(void*, const u32 *, size_t)) +{ + struct async_event_msg_req req =3D { 0 }; + struct xdna_mailbox_msg msg =3D { + .send_data =3D (u8 *)&req, + .send_size =3D sizeof(req), + .handle =3D handle, + .opcode =3D MSG_OP_REGISTER_ASYNC_EVENT_MSG, + .notify_cb =3D cb, + }; + + req.buf_addr =3D addr; + req.buf_size =3D size; + + XDNA_DBG(ndev->xdna, "Register addr 0x%llx size 0x%x", addr, size); + return xdna_mailbox_send_msg(ndev->mgmt_chann, &msg, TX_TIMEOUT); +} + int aie2_config_cu(struct amdxdna_hwctx *hwctx) { struct mailbox_channel *chann =3D hwctx->priv->mbox_chann; diff --git a/drivers/accel/amdxdna/aie2_pci.c b/drivers/accel/amdxdna/aie2_= pci.c index 803ebb237971..5467aabe7308 100644 --- a/drivers/accel/amdxdna/aie2_pci.c +++ b/drivers/accel/amdxdna/aie2_pci.c @@ -180,6 +180,15 @@ static int aie2_mgmt_fw_init(struct amdxdna_dev_hdl *n= dev) return ret; } =20 + if (!ndev->async_events) + return 0; + + ret =3D aie2_error_async_events_send(ndev); + if (ret) { + XDNA_ERR(ndev->xdna, "Send async events failed"); + return ret; + } + return 0; } =20 @@ -469,9 +478,30 @@ static int aie2_init(struct amdxdna_dev *xdna) goto stop_hw; } =20 + ret =3D aie2_error_async_events_alloc(ndev); + if (ret) { + XDNA_ERR(xdna, "Allocate async events failed, ret %d", ret); + goto stop_hw; + } + + ret =3D aie2_error_async_events_send(ndev); + if (ret) { + XDNA_ERR(xdna, "Send async events failed, ret %d", ret); + goto async_event_free; + } + + /* Issue a command to make sure firmware handled async events */ + ret =3D aie2_query_firmware_version(ndev, &ndev->xdna->fw_ver); + if (ret) { + XDNA_ERR(xdna, "Re-query firmware version failed"); + goto async_event_free; + } + release_firmware(fw); return 0; =20 +async_event_free: + aie2_error_async_events_free(ndev); stop_hw: aie2_hw_stop(xdna); disable_sva: @@ -487,8 +517,10 @@ static int aie2_init(struct amdxdna_dev *xdna) static void aie2_fini(struct amdxdna_dev *xdna) { struct pci_dev *pdev =3D to_pci_dev(xdna->ddev.dev); + struct amdxdna_dev_hdl *ndev =3D xdna->dev_handle; =20 aie2_hw_stop(xdna); + aie2_error_async_events_free(ndev); iommu_dev_disable_feature(&pdev->dev, IOMMU_DEV_FEAT_SVA); pci_free_irq_vectors(pdev); } diff --git a/drivers/accel/amdxdna/aie2_pci.h b/drivers/accel/amdxdna/aie2_= pci.h index 72691c9eebd7..4422dd6c985e 100644 --- a/drivers/accel/amdxdna/aie2_pci.h +++ b/drivers/accel/amdxdna/aie2_pci.h @@ -170,6 +170,7 @@ struct amdxdna_dev_hdl { /* Mailbox and the management channel */ struct mailbox *mbox; struct mailbox_channel *mgmt_chann; + struct async_events *async_events; }; =20 #define DEFINE_BAR_OFFSET(reg_name, bar, reg_addr) \ @@ -210,6 +211,12 @@ struct psp_device *aie2m_psp_create(struct drm_device = *ddev, struct psp_config * int aie2_psp_start(struct psp_device *psp); void aie2_psp_stop(struct psp_device *psp); =20 +/* aie2_error.c */ +int aie2_error_async_events_alloc(struct amdxdna_dev_hdl *ndev); +void aie2_error_async_events_free(struct amdxdna_dev_hdl *ndev); +int aie2_error_async_events_send(struct amdxdna_dev_hdl *ndev); +int aie2_error_async_msg_thread(void *data); + /* aie2_message.c */ int aie2_suspend_fw(struct amdxdna_dev_hdl *ndev); int aie2_resume_fw(struct amdxdna_dev_hdl *ndev); @@ -224,6 +231,8 @@ int aie2_query_firmware_version(struct amdxdna_dev_hdl = *ndev, int aie2_create_context(struct amdxdna_dev_hdl *ndev, struct amdxdna_hwctx= *hwctx); int aie2_destroy_context(struct amdxdna_dev_hdl *ndev, struct amdxdna_hwct= x *hwctx); int aie2_map_host_buf(struct amdxdna_dev_hdl *ndev, u32 context_id, u64 ad= dr, u64 size); +int aie2_register_asyn_event_msg(struct amdxdna_dev_hdl *ndev, dma_addr_t = addr, u32 size, + void *handle, int (*cb)(void*, const u32 *, size_t)); int aie2_config_cu(struct amdxdna_hwctx *hwctx); int aie2_execbuf(struct amdxdna_hwctx *hwctx, struct amdxdna_sched_job *jo= b, int (*notify_cb)(void *, const u32 *, size_t)); --=20 2.34.1