From nobody Thu Mar 5 06:32:30 2026 Received: from out-06.smtp.spacemail.com (out-06.smtp.spacemail.com [66.29.159.77]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3914A2E3387 for ; Mon, 16 Feb 2026 18:45:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=66.29.159.77 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771267503; cv=none; b=KFqDcuaIR6zOakgbQFKC/IHJCC9H4i1ILUTzNYnMLGUEsYfV/Xi68lMmU7o4DHNw0NbZFYfulBq7ZYQ0dL/sOOd4god4Cve7ClPiTH6nwyYE0ZEJtPiSFTvDzlOST+sY7KMAuAXnZg0+kMAzezIQqbrRu8nShCEAcoHOqLNArpU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771267503; c=relaxed/simple; bh=gHl/HCxDTmH1RiERmmLtD2En+S+FteNpDpGqZ2im3qM=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=t8r+9IjXmxhWiRuq/rM/SMmAeVnEZsCRRpHJxjiZbyQoi6Kq/N/uR/V+kJvkBpOasd+CPxPyCe1a/3MBy80ZqMaJMVdIP5OLd4dIWQqVxTfiBiZhgVWoGorhJB+ET9oxrRRZ91AebG231TQgwluHCVa9gTQjyjWZ/mq3BpiPqGo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=r-sc.ca; spf=pass smtp.mailfrom=r-sc.ca; dkim=pass (2048-bit key) header.d=r-sc.ca header.i=@r-sc.ca header.b=a68cbfiv; arc=none smtp.client-ip=66.29.159.77 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=r-sc.ca Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=r-sc.ca Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=r-sc.ca header.i=@r-sc.ca header.b="a68cbfiv" Received: from Mac.pk.shawcable.net (S0106dceb699ec90f.pk.shawcable.net [24.69.43.232]) (using TLSv1.3 with cipher TLS_CHACHA20_POLY1305_SHA256 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA512) (No client certificate requested) by mail.spacemail.com (Postfix) with ESMTPSA id 4fFBPL5jb0z6tkL; Mon, 16 Feb 2026 18:38:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=r-sc.ca; s=spacemail; t=1771267129; bh=AqDSStJaHwjFuQmZp9bFFOPQD77ZB+Vf7mh356gt73w=; h=From:To:Cc:Subject:Date:From; b=a68cbfiviOKkJSeNSG6JAH2Yd4ywXE0o2mWpL8YoZQl03DmVHBoM4LfPkKRufkrN8 RPZWMywAm0ogY/8ZQkTBRZM+UOfNBsXIfhA9hSt53S/ysorvS3UQGXaBkx1mlfcM0D LkYrgJMfbrgfg/ZBcDq/TLbCYJ+7Ri0OALw1Ctv8pP+ShNI+MJu+oGkaZJjBYjGmqr c/5MXOwORfp2aynsLD5En9kJLmMI4PvBxFc07lTY/eW1C0clyeIvnI8O1kMZCxbT31 CaWLECsOng8Qv5FdODh8+CPj7rU8Xt89dsu+FnqmXzbtcxFiGUsBuTiTi6p4u80OZ2 WrS5Xc9PM2RGA== From: Ross Cawston To: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org, tomeu@tomeuvizoso.net, ogabbay@kernel.org, airlied@gmail.com, simona@ffwll.ch, maarten.lankhorst@linux.intel.com, mripard@kernel.org, tzimmermann@suse.de, jeff.hugo@oss.qualcomm.com, jani.nikula@intel.com, me@brighamcampbell.com, heiko@sntech.de, Ross Cawston Subject: [PATCH] accel/rocket: Add per-task flags and interrupt mask for flexible job handling Date: Mon, 16 Feb 2026 10:38:19 -0800 Message-ID: <20260216183819.99991-1-ross@r-sc.ca> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The Rocket NPU supports multiple task types: - Convolutional workloads that use CNA, Core, and DPU blocks - Standalone post-processing (PPU) tasks such as pooling and element-wise o= perations - Pipelined DPU=E2=86=92PPU workloads The current driver has several limitations that prevent correct execution of non-convolutional workloads and multi-core operation: - CNA and Core S_POINTER registers are always initialized, re-arming them with stale state from previous jobs and corrupting standalone DPU/PPU tas= ks. - Completion is hard-coded to wait only for DPU interrupts, causing PPU-only or DPU=E2=86=92PPU pipeline jobs to time out. - Ping-pong mode is unconditionally enabled, which is unnecessary for single-task jobs. - Non-zero cores hang because the vendor-specific "extra bit" (bit 28 =C3= =97 core index) in S_POINTER is not set; the BSP sets this via MMIO because usersp= ace cannot know which core the scheduler will select. - Timeout and IRQ debugging information is minimal. This patch introduces two new per-task fields to struct rocket_task: - u32 int_mask: specifies which block completion interrupts signal task done (DPU_0|DPU_1 for convolutional/standalone DPU, PPU_0|PPU_1 for PPU tasks). Zero defaults to DPU_0|DPU_1 for backward compatibility. - u32 flags: currently used for ROCKET_TASK_NO_CNA_CORE to indicate standal= one DPU/PPU tasks that must not touch CNA/Core state. Additional changes: - Only initialize CNA and Core S_POINTER (with the required per-core extra = bit) when ROCKET_TASK_NO_CNA_CORE is not set. - Set the per-core extra bit via MMIO to fix hangs on non-zero cores. - Enable ping-pong mode only when the job contains multiple tasks. - Mask and clear interrupts according to the task's int_mask. - Accept both DPU and PPU completion interrupts in the IRQ handler. - Minor error-path fix in GEM object creation (check error after unlocking mm_lock). These changes, derived from vendor BSP behavior, enable correct execution of PPU-only tasks, pipelined workloads, and reliable multi-core operation while preserving backward compatibility. --- drivers/accel/rocket/rocket_gem.c | 2 + drivers/accel/rocket/rocket_job.c | 99 +++++++++++++++++++++++++------ drivers/accel/rocket/rocket_job.h | 2 + include/uapi/drm/rocket_accel.h | 30 ++++++++++ 4 files changed, 115 insertions(+), 18 deletions(-) diff --git a/drivers/accel/rocket/rocket_gem.c b/drivers/accel/rocket/rocke= t_gem.c index 624c4ecf5a34..db1ff3544af2 100644 --- a/drivers/accel/rocket/rocket_gem.c +++ b/drivers/accel/rocket/rocket_gem.c @@ -95,6 +95,8 @@ int rocket_ioctl_create_bo(struct drm_device *dev, void *= data, struct drm_file * rkt_obj->size, PAGE_SIZE, 0, 0); mutex_unlock(&rocket_priv->mm_lock); + if (ret) + goto err; =20 ret =3D iommu_map_sgtable(rocket_priv->domain->domain, rkt_obj->mm.start, diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocke= t_job.c index acd606160dc9..dd69b195d0e6 100644 --- a/drivers/accel/rocket/rocket_job.c +++ b/drivers/accel/rocket/rocket_job.c @@ -96,6 +96,13 @@ rocket_copy_tasks(struct drm_device *dev, =20 rjob->tasks[i].regcmd =3D task.regcmd; rjob->tasks[i].regcmd_count =3D task.regcmd_count; + rjob->tasks[i].int_mask =3D task.int_mask; + rjob->tasks[i].flags =3D task.flags; + + /* Default to DPU completion if no mask specified */ + if (!rjob->tasks[i].int_mask) + rjob->tasks[i].int_mask =3D PC_INTERRUPT_MASK_DPU_0 | + PC_INTERRUPT_MASK_DPU_1; } =20 return 0; @@ -108,7 +115,6 @@ rocket_copy_tasks(struct drm_device *dev, static void rocket_job_hw_submit(struct rocket_core *core, struct rocket_j= ob *job) { struct rocket_task *task; - unsigned int extra_bit; =20 /* Don't queue the job if a reset is in progress */ if (atomic_read(&core->reset.pending)) @@ -121,29 +127,61 @@ static void rocket_job_hw_submit(struct rocket_core *= core, struct rocket_job *jo =20 rocket_pc_writel(core, BASE_ADDRESS, 0x1); =20 - /* From rknpu, in the TRM this bit is marked as reserved */ - extra_bit =3D 0x10000000 * core->index; - rocket_cna_writel(core, S_POINTER, CNA_S_POINTER_POINTER_PP_EN(1) | - CNA_S_POINTER_EXECUTER_PP_EN(1) | - CNA_S_POINTER_POINTER_PP_MODE(1) | - extra_bit); - - rocket_core_writel(core, S_POINTER, CORE_S_POINTER_POINTER_PP_EN(1) | - CORE_S_POINTER_EXECUTER_PP_EN(1) | - CORE_S_POINTER_POINTER_PP_MODE(1) | - extra_bit); + /* + * Initialize CNA and Core S_POINTER for ping-pong mode via MMIO. + * + * Each core needs a per-core extra_bit (bit 28 * core_index) which + * the TRM marks as reserved but the BSP rknpu driver sets. Without + * it, non-zero cores hang. This MUST be done via MMIO (not regcmd) + * because userspace doesn't know which core the scheduler picks. + * + * DPU/DPU_RDMA and PPU/PPU_RDMA S_POINTERs are set by the regcmd + * itself =E2=80=94 they don't need the per-core extra_bit. + * + * For standalone DPU/PPU tasks (element-wise ops, pooling), CNA + * and Core have no work. Writing their S_POINTERs would re-arm + * them with stale state from the previous conv task, corrupting + * the DPU/PPU output. Userspace signals this via the + * ROCKET_TASK_NO_CNA_CORE flag. + */ + if (!(task->flags & ROCKET_TASK_NO_CNA_CORE)) { + unsigned int extra_bit =3D 0x10000000 * core->index; + rocket_cna_writel(core, S_POINTER, + CNA_S_POINTER_POINTER_PP_EN(1) | + CNA_S_POINTER_EXECUTER_PP_EN(1) | + CNA_S_POINTER_POINTER_PP_MODE(1) | + extra_bit); + + rocket_core_writel(core, S_POINTER, + CORE_S_POINTER_POINTER_PP_EN(1) | + CORE_S_POINTER_EXECUTER_PP_EN(1) | + CORE_S_POINTER_POINTER_PP_MODE(1) | + extra_bit); + } =20 rocket_pc_writel(core, BASE_ADDRESS, task->regcmd); rocket_pc_writel(core, REGISTER_AMOUNTS, PC_REGISTER_AMOUNTS_PC_DATA_AMOUNT((task->regcmd_count + 1) / 2 - 1)); =20 - rocket_pc_writel(core, INTERRUPT_MASK, PC_INTERRUPT_MASK_DPU_0 | PC_INTER= RUPT_MASK_DPU_1); - rocket_pc_writel(core, INTERRUPT_CLEAR, PC_INTERRUPT_CLEAR_DPU_0 | PC_INT= ERRUPT_CLEAR_DPU_1); + /* + * Enable interrupts for the last block in this task's pipeline. + * + * The int_mask field from userspace specifies which block completion + * signals that this task is done: + * - Conv/DPU tasks: DPU_0 | DPU_1 + * - PPU tasks (DPU=E2=86=92PPU pipeline): PPU_0 | PPU_1 + * + * Only enabling the terminal block's interrupt prevents the kernel + * from stopping the pipeline early (e.g. DPU fires before PPU has + * finished writing its output). + */ + rocket_pc_writel(core, INTERRUPT_MASK, task->int_mask); + rocket_pc_writel(core, INTERRUPT_CLEAR, 0x1ffff); =20 rocket_pc_writel(core, TASK_CON, PC_TASK_CON_RESERVED_0(1) | PC_TASK_CON_TASK_COUNT_CLEAR(1) | PC_TASK_CON_TASK_NUMBER(1) | - PC_TASK_CON_TASK_PP_EN(1)); + PC_TASK_CON_TASK_PP_EN(job->task_count > 1 ? 1 : 0)); =20 rocket_pc_writel(core, TASK_DMA_BASE_ADDR, PC_TASK_DMA_BASE_ADDR_DMA_BASE= _ADDR(0x0)); =20 @@ -385,7 +423,23 @@ static enum drm_gpu_sched_stat rocket_job_timedout(str= uct drm_sched_job *sched_j struct rocket_device *rdev =3D job->rdev; struct rocket_core *core =3D sched_to_core(rdev, sched_job->sched); =20 - dev_err(core->dev, "NPU job timed out"); + { + u32 raw =3D rocket_pc_readl(core, INTERRUPT_RAW_STATUS); + u32 status =3D rocket_pc_readl(core, INTERRUPT_STATUS); + u32 mask =3D rocket_pc_readl(core, INTERRUPT_MASK); + u32 op_en =3D rocket_pc_readl(core, OPERATION_ENABLE); + u32 task_status =3D rocket_pc_readl(core, TASK_STATUS); + u32 cna_s_status =3D rocket_cna_readl(core, S_STATUS); + u32 core_s_status =3D rocket_core_readl(core, S_STATUS); + u32 core_misc =3D readl(core->core_iomem + 0x10); /* MISC_CFG */ + u32 core_op_en =3D readl(core->core_iomem + 0x08); /* OPERATION_ENABLE = */ + + dev_err(core->dev, + "NPU job timed out: raw=3D0x%08x mask=3D0x%08x op_en=3D0x%x task_status= =3D0x%x cna_s=3D0x%x core_s=3D0x%x core_misc=3D0x%x core_op_en=3D0x%x task= =3D%u/%u", + raw, mask, op_en, task_status, + cna_s_status, core_s_status, core_misc, core_op_en, + job->next_task_idx, job->task_count); + } =20 atomic_set(&core->reset.pending, 1); rocket_reset(core, sched_job); @@ -424,8 +478,17 @@ static irqreturn_t rocket_job_irq_handler(int irq, voi= d *data) WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR); WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR); =20 - if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 || - raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) + /* + * Check for any job completion interrupt: DPU or PPU. + * + * Conv and standalone DPU jobs signal via DPU_0/DPU_1. + * PPU pooling jobs signal via PPU_0/PPU_1. + * We must recognize both to avoid PPU job timeouts. + */ + if (!(raw_status & (PC_INTERRUPT_RAW_STATUS_DPU_0 | + PC_INTERRUPT_RAW_STATUS_DPU_1 | + PC_INTERRUPT_RAW_STATUS_PPU_0 | + PC_INTERRUPT_RAW_STATUS_PPU_1))) return IRQ_NONE; =20 rocket_pc_writel(core, INTERRUPT_MASK, 0x0); diff --git a/drivers/accel/rocket/rocket_job.h b/drivers/accel/rocket/rocke= t_job.h index 4ae00feec3b9..6931dfed8615 100644 --- a/drivers/accel/rocket/rocket_job.h +++ b/drivers/accel/rocket/rocket_job.h @@ -13,6 +13,8 @@ struct rocket_task { u64 regcmd; u32 regcmd_count; + u32 int_mask; + u32 flags; }; =20 struct rocket_job { diff --git a/include/uapi/drm/rocket_accel.h b/include/uapi/drm/rocket_acce= l.h index 14b2e12b7c49..b041bcb05e27 100644 --- a/include/uapi/drm/rocket_accel.h +++ b/include/uapi/drm/rocket_accel.h @@ -73,6 +73,11 @@ struct drm_rocket_fini_bo { __u32 reserved; }; =20 +/** + * Flags for drm_rocket_task.flags + */ +#define ROCKET_TASK_NO_CNA_CORE 0x1 + /** * struct drm_rocket_task - A task to be run on the NPU * @@ -84,6 +89,31 @@ struct drm_rocket_task { =20 /** Input: Number of commands in the register command buffer */ __u32 regcmd_count; + + /** + * Input: Interrupt mask specifying which block completion signals + * that this task is done. Uses PC_INTERRUPT_MASK_* bits. + * + * For conv/DPU tasks: DPU_0 | DPU_1 (0x0300) + * For PPU tasks: PPU_0 | PPU_1 (0x0C00) + * + * If zero, defaults to DPU_0 | DPU_1 for backwards compatibility. + */ + __u32 int_mask; + + /** + * Input: Task flags. + * + * ROCKET_TASK_NO_CNA_CORE: Skip CNA and Core S_POINTER MMIO + * writes for this task. Used for standalone DPU element-wise + * and PPU pooling tasks that don't use CNA/Core. Without this + * flag, CNA/Core get re-armed with stale state from the + * previous conv task, corrupting the DPU/PPU output. + * + * Zero means write CNA/Core S_POINTER (default for conv tasks, + * backwards compatible with old userspace). + */ + __u32 flags; }; =20 /** --=20 2.52.0