From nobody Tue Dec 16 14:36:45 2025 Received: from mail-vs1-f98.google.com (mail-vs1-f98.google.com [209.85.217.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E12BE30B501 for ; Mon, 15 Dec 2025 20:10:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.217.98 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829432; cv=none; b=OuCfVVJuog4TaFngFcdFkOrUZl3jDcxC5NG5gRdSxqO3nMhkCnonLCHR9CF3KxAH1PhZSUWmBLDQ/cN60kuAp89JAeT8cbduajpCt72Wc8vh7jY7s4y/QeugXZtNRisbefeXd4RqdS668kIyPv/SMG47QDJ//YWdVctcT6tjicc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829432; c=relaxed/simple; bh=rpg9xRwqXJdoa9o/0hJC0BVwNsBbVZZp4L9qh0QLmM4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HS6Z3sQWdFEVevr5ar45/MsfbrxVOmL3ihrOd0UIsC/h6E0TYYZDzLun3W6Q0wwW5tQuXbXzPbsqbf/gGypYImCikbeKHi2AqzT9QPci4DVOPo7WRQv/WXIs9AbE5WsHHOYn6at//+eXLEbjcFok4c9yXlVjjBnbXPqErFVicAs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=D3FlZClr; arc=none smtp.client-ip=209.85.217.98 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="D3FlZClr" Received: by mail-vs1-f98.google.com with SMTP id ada2fe7eead31-5dfa7048c21so252782137.2 for ; Mon, 15 Dec 2025 12:10:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1765829428; x=1766434228; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=BMOc2yXwMWNwizXE4LJ3r+8jQkSXD2ri2LLDuLQglmM=; b=D3FlZClrS3Yp/90KJDnW+DlP9Q41LWOmgRePjXdGBoE++tgDCYchPFysyEH0zySfq1 R1vlbE4GNcu3Vv8rvL3VnoCVUYuc1NgN0hxrYQri8DhNUO+5oMV+ASnQt10a2VRtQziz wZfRxdcAvFERgCCLbiYqTIG76oUnezAK/qyp+9eJwafJ+xp/PRdhrlYzi8yfIbdVsoeK cCplRwdJUmYMc+BPpUFv4mqXr9elbelaxA3/ZE5XNBzO0ZnSXBBKyl+RVQrxGinx4OvL tKSVTEvLrb1msRUzZg6Eky4dbg9i8v99bYt8vt5APzWxzLUKa+TR8H6C42uMoIEwVsNt grxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765829428; x=1766434228; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=BMOc2yXwMWNwizXE4LJ3r+8jQkSXD2ri2LLDuLQglmM=; b=ma3GcfupjGjfkyiFMuhk6Ye7/b9/wqNqqhg3QIxKjJolHV4zF93K3bzTfwPkC7dosR hYn0oqHBedR10mGO6O6sB6kx+e7cqwjdFZ2ZBbhwn1BbXXdzsEDllLy7nKnFhX6Puqwf RKCi+Aegnf2CkPAwxAqSCeWRGEvfMnyIvJXbhXlaR6ADI6HohVCvJo6q0SNmYXMPmDb3 bJpYaq8faETg6GNXya2OMFH2tjv1XUL1ikVZULLTSWHHNLibaCtmiVqr6QQeNk9H+Vtg L74kobtQsH1+QKn5UbtS+uM7VzfZjWNH4KS5GtbCtxWzeJ9YKg1vokEeh5trqj4MbOEd U9BA== X-Forwarded-Encrypted: i=1; AJvYcCVSgte5cjT5jLm9mqL/alDg81DLV8W+uSH2UH+P4LRZlGFoXX9xmIlD42TF9PbRABkkLyXnO5VM1U4lSDA=@vger.kernel.org X-Gm-Message-State: AOJu0YzbFrbCN7YtgbOOdNCYAWSz2U6RaI+JPb+qCTZRI7DTZ25rU6t0 8a49dhE70SMbLxMA2r565Bkr/mgvJuqm2FwmCR8nuXTYDlXJoOFvJBTFX+i2lenWG3DfwWPfyLF +zBUPoZqZn1Eiq0qEx+Uv8c9O78ZEmIIQ5jqcLZtQHbNFhQKxfFXq X-Gm-Gg: AY/fxX4uxU31bVtrOVLPg0x2WyPKIA7GsHuow1kHwyc80D40hzdCWX43S1v4ri4dL/u i2mnJneElWbpXbSx3pNT3iiB6wwcZnyIw7aXvBuh8qM1KeQXhD/BWw40f0IPESSspOCwYG+tT7B tshbCxlGgobs6Wqnw79o8CtT0AlbyE5fYT5pSM5MGfJHWxtNClOIIPhCywHhDSlgZPeAnCOOntK D00Kydd8N5BqyvOBc9GTdko61QjJ/bmA9FMEsL9sagBE9InmQWDeplePzTxoSGNMyp6U2wWbFVc hwjKDcNBdBFjYUI4MhOlOc9xNVa2eJsZFSW7SDuScnbjJRCBSHyTRHPwKz0f6A0d6rgXVG8dkE7 pxnNIYudLSmhOV4+6zlvqDxDvoS4= X-Google-Smtp-Source: AGHT+IFJA1XGOBY4ZrexkEdFA58KYk98JHMbYC0iRw6qzib7NP4qfcqgFFh8UrRyUIpfsJWwOs5lf8+Waaix X-Received: by 2002:a05:6102:1951:b0:5e1:8746:85ed with SMTP id ada2fe7eead31-5e8277aa8b6mr1917739137.4.1765829427472; Mon, 15 Dec 2025 12:10:27 -0800 (PST) Received: from c7-smtp-2023.dev.purestorage.com ([2620:125:9017:12:36:3:5:0]) by smtp-relay.gmail.com with ESMTPS id ada2fe7eead31-5e7db1f2cf2sm2156930137.3.2025.12.15.12.10.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 12:10:27 -0800 (PST) X-Relaying-Domain: purestorage.com Received: from dev-csander.dev.purestorage.com (dev-csander.dev.purestorage.com [10.7.70.37]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id 4F11F340644; Mon, 15 Dec 2025 13:10:26 -0700 (MST) Received: by dev-csander.dev.purestorage.com (Postfix, from userid 1557716354) id 4BCB4E41D23; Mon, 15 Dec 2025 13:10:26 -0700 (MST) From: Caleb Sander Mateos To: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Joanne Koong , Caleb Sander Mateos Subject: [PATCH v5 1/6] io_uring: use release-acquire ordering for IORING_SETUP_R_DISABLED Date: Mon, 15 Dec 2025 13:09:04 -0700 Message-ID: <20251215200909.3505001-2-csander@purestorage.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20251215200909.3505001-1-csander@purestorage.com> References: <20251215200909.3505001-1-csander@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" io_uring_enter() and io_msg_ring() read ctx->flags and ctx->submitter_task without holding the ctx's uring_lock. This means they may race with the assignment to ctx->submitter_task and the clearing of IORING_SETUP_R_DISABLED from ctx->flags in io_register_enable_rings(). Ensure the correct ordering of the ctx->flags and ctx->submitter_task memory accesses by storing to ctx->flags using release ordering and loading it using acquire ordering. Signed-off-by: Caleb Sander Mateos Fixes: 4add705e4eeb ("io_uring: remove io_register_submitter") Reviewed-by: Joanne Koong --- io_uring/io_uring.c | 2 +- io_uring/msg_ring.c | 4 ++-- io_uring/register.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 6cb24cdf8e68..761b9612c5b6 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3249,11 +3249,11 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u= 32, to_submit, goto out; } =20 ctx =3D file->private_data; ret =3D -EBADFD; - if (unlikely(ctx->flags & IORING_SETUP_R_DISABLED)) + if (unlikely(smp_load_acquire(&ctx->flags) & IORING_SETUP_R_DISABLED)) goto out; =20 /* * For SQ polling, the thread will do all submissions and completions. * Just return the requested submit count, and wake the thread if diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index 7063ea7964e7..c48588e06bfb 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -123,11 +123,11 @@ static int __io_msg_ring_data(struct io_ring_ctx *tar= get_ctx, =20 if (msg->src_fd || msg->flags & ~IORING_MSG_RING_FLAGS_PASS) return -EINVAL; if (!(msg->flags & IORING_MSG_RING_FLAGS_PASS) && msg->dst_fd) return -EINVAL; - if (target_ctx->flags & IORING_SETUP_R_DISABLED) + if (smp_load_acquire(&target_ctx->flags) & IORING_SETUP_R_DISABLED) return -EBADFD; =20 if (io_msg_need_remote(target_ctx)) return io_msg_data_remote(target_ctx, msg); =20 @@ -243,11 +243,11 @@ static int io_msg_send_fd(struct io_kiocb *req, unsig= ned int issue_flags) =20 if (msg->len) return -EINVAL; if (target_ctx =3D=3D ctx) return -EINVAL; - if (target_ctx->flags & IORING_SETUP_R_DISABLED) + if (smp_load_acquire(&target_ctx->flags) & IORING_SETUP_R_DISABLED) return -EBADFD; if (!msg->src_file) { int ret =3D io_msg_grab_file(req, issue_flags); if (unlikely(ret)) return ret; diff --git a/io_uring/register.c b/io_uring/register.c index 62d39b3ff317..9e473c244041 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -191,11 +191,11 @@ static int io_register_enable_rings(struct io_ring_ct= x *ctx) } =20 if (ctx->restrictions.registered) ctx->restricted =3D 1; =20 - ctx->flags &=3D ~IORING_SETUP_R_DISABLED; + smp_store_release(&ctx->flags, ctx->flags & ~IORING_SETUP_R_DISABLED); if (ctx->sq_data && wq_has_sleeper(&ctx->sq_data->wait)) wake_up(&ctx->sq_data->wait); return 0; } =20 --=20 2.45.2 From nobody Tue Dec 16 14:36:45 2025 Received: from mail-vs1-f100.google.com (mail-vs1-f100.google.com [209.85.217.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 325F02F25F2 for ; Mon, 15 Dec 2025 20:10:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.217.100 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829431; cv=none; b=ccENy3c9yS0z6g442lyQfGVD2qitSDrZaVQQ5pc+RZWkjHT0FEPJzIA31DA5twPdbVu+f8pbOV6Dq60SDQUVuNd4gmm+xU11iKCynkwG3oYtaI09x7HyRC9nTh7p182hAC8GZU/oBu3WSPSzaC3hLlNSM61j2wJMypfIpOBJWx0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829431; c=relaxed/simple; bh=L5Uqu9+rr+DwT6pdOZXAwcq9nYIz91fuoF+OXKgDo4U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kgeezkcrhjBBBWh86y18C4LpBgHxrXKZxP94+1mTNGdP7oPqoolZnGQ/mOwgnifkVQLdpO/LBa4wXZJ0PEQnJTIYdzRn36sDAwKuyiIBBJuJAbucMKEHO0qRHAC3OmiSaomNppT5VkdeTMTZqv8XK1Q5VB6z+wRZj+JwLJZ7wlg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=D2R8823k; arc=none smtp.client-ip=209.85.217.100 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="D2R8823k" Received: by mail-vs1-f100.google.com with SMTP id ada2fe7eead31-5dfac1bac03so200078137.2 for ; Mon, 15 Dec 2025 12:10:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1765829428; x=1766434228; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=8/b4bbf5Fpl1ur/4rCqqMbk9Yx0Nl8rzYgHyXMsZmA8=; b=D2R8823kiebbKgRlkM8K0puk0F56/RwS/dl3FTRxXrIKVIHUIdC5ND/uoVQPjhR0nb nH97znrM8QWQbUBSYy06IWvhQWCyLfD+3qGtFNaV4NZGriB7GwK8r7/lnzy+f46s+rTx FgF9yICYWblpvyEaSCDEhGwaFLZ64C22HjGSLeurvZMw/PVJai9m+EOhboaXodkiqscu zx5bqKZ0A2qZGrLzwQ5R0G8wmbGCaswBhSTBl9TBhXC7I8hEkOdm3LCrTu3SO4Hxiu0g 07e0O6IOarAyOQVOBzAJb6KgNa5qRuohxew32yoOHAzwjUiY/6muCxdVMHweipjUqN1Q KY2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765829428; x=1766434228; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=8/b4bbf5Fpl1ur/4rCqqMbk9Yx0Nl8rzYgHyXMsZmA8=; b=Tdh0+QDlkMtCSHznJ6IOS1v8B28N3QSYbvpDWQZLE6HP2iUXYDR2kNo/MBuJDlL6nQ Pq2hEgsuNlVyRo2lTgdV4qUdp8XRPiPYI7PmZkJEjE2V6tg4VzAYF0OvWyH2vy71ihBe u6Yt58w1oqM8s2ZlNEEhAEyMyNucVX6KFqPxVJZKNT5sMHIEWAhFC3H3zpnS9plJC+UY FiXXWmZuEYf/L4+wxtwCIi0t4biqG96hcr+wQW9rhmIfuKQao5WBwR4LbJaee380E1mC N1bH09PQW6LJP9B/dR3MU0/GienYCRud9CgWKg+RWW60YcL+KDBJzL27LvcgiWTAEFJb jg5Q== X-Forwarded-Encrypted: i=1; AJvYcCWQk2NuD/Q733FBWjLjLyifYkB98duh0qCPYNiFmSjPx9GgVaex7v5vbNf26Q1ngoHEZJPYVUklDw9Guy0=@vger.kernel.org X-Gm-Message-State: AOJu0Yws8RlDShRzebtxV5rmivKzZBJ0sS5DjHvpdlVBZZ3eYI8Xeq5f cLtYIyZf20273N5htJQDWTrKiDQjDPH2Av286LEpDEso7mYVNhCkBvE7ewXu/hKNlSYaHR7gH49 QcWyKHrubvrFj14cJV2KEvgSZKrCrUmW4SFKY X-Gm-Gg: AY/fxX4BLDi2HpuxFR7guT3lGydjYMAgH3QRiY5G2BXWoweHtcfEOshLJ08naTnTNz4 QdfQ4s3kDZ5N+klUlEJ/yE4g/uJK4IdQyKaE6WHS6qxUPdwNxmKecp0AC5NXdUX7FzKYYxe8jY+ AFjL58fGz+qpB2tBVPinxgvO/92POYqSuWfl4J3e7y8CX1Rs/XWzTZR5tcY+4BvDa/GT8bj3H6d cyLFCnkLmNen0z7zZo8kl27mJWLP1qlfdDJpe5jfXLLko6LYsdY2Swacdr6n3/Cxc7qFbGp04Ht gY08EX9cQ0XbUbFJgqaUvjWbCU7m8l5FrYXQ+9Z/bXSeS2GBoCwq7afBMxPPq7eD4O16LXOn/b1 f7+ltTcgisMYaXQykujxS5xvlMn0ojmXzyGYNYUiGpw== X-Google-Smtp-Source: AGHT+IFKo5M5Rl8K/RJA2NKxyvyaJHy8f7f3xqa4502hQs2OTK/7NEPZO40ym/Kndj/mdUSjtyBo9OgDxpGI X-Received: by 2002:a05:6102:1915:b0:5db:cc92:26f3 with SMTP id ada2fe7eead31-5e8277b9ea1mr2168153137.3.1765829427659; Mon, 15 Dec 2025 12:10:27 -0800 (PST) Received: from c7-smtp-2023.dev.purestorage.com ([208.88.159.129]) by smtp-relay.gmail.com with ESMTPS id ada2fe7eead31-5e81e5a1d6bsm1924476137.1.2025.12.15.12.10.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 12:10:27 -0800 (PST) X-Relaying-Domain: purestorage.com Received: from dev-csander.dev.purestorage.com (dev-csander.dev.purestorage.com [10.7.70.37]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id 9DBEF34076B; Mon, 15 Dec 2025 13:10:26 -0700 (MST) Received: by dev-csander.dev.purestorage.com (Postfix, from userid 1557716354) id 9BAC1E41D23; Mon, 15 Dec 2025 13:10:26 -0700 (MST) From: Caleb Sander Mateos To: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Joanne Koong , Caleb Sander Mateos Subject: [PATCH v5 2/6] io_uring: clear IORING_SETUP_SINGLE_ISSUER for IORING_SETUP_SQPOLL Date: Mon, 15 Dec 2025 13:09:05 -0700 Message-ID: <20251215200909.3505001-3-csander@purestorage.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20251215200909.3505001-1-csander@purestorage.com> References: <20251215200909.3505001-1-csander@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" IORING_SETUP_SINGLE_ISSUER doesn't currently enable any optimizations, but it will soon be used to avoid taking io_ring_ctx's uring_lock when submitting from the single issuer task. If the IORING_SETUP_SQPOLL flag is set, the SQ thread is the sole task issuing SQEs. However, other tasks may make io_uring_register() syscalls, which must be synchronized with SQE submission. So it wouldn't be safe to skip the uring_lock around the SQ thread's submission even if IORING_SETUP_SINGLE_ISSUER is set. Therefore, clear IORING_SETUP_SINGLE_ISSUER from the io_ring_ctx flags if IORING_SETUP_SQPOLL is set. Signed-off-by: Caleb Sander Mateos --- io_uring/io_uring.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 761b9612c5b6..44ff5756b328 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3478,10 +3478,19 @@ static int io_uring_sanitise_params(struct io_uring= _params *p) */ if ((flags & (IORING_SETUP_SQE128|IORING_SETUP_SQE_MIXED)) =3D=3D (IORING_SETUP_SQE128|IORING_SETUP_SQE_MIXED)) return -EINVAL; =20 + /* + * If IORING_SETUP_SQPOLL is set, only the SQ thread issues SQEs, + * but other threads may call io_uring_register() concurrently. + * We still need ctx uring lock to synchronize these io_ring_ctx + * accesses, so disable the single issuer optimizations. + */ + if (flags & IORING_SETUP_SQPOLL) + p->flags &=3D ~IORING_SETUP_SINGLE_ISSUER; + return 0; } =20 static int io_uring_fill_params(struct io_uring_params *p) { --=20 2.45.2 From nobody Tue Dec 16 14:36:45 2025 Received: from mail-pf1-f228.google.com (mail-pf1-f228.google.com [209.85.210.228]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 908582EA178 for ; Mon, 15 Dec 2025 20:10:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.228 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829430; cv=none; b=svay+G6R6jppX2OHzbXrl+oyGvIZ1YSYHiPqqt2HHTABicum74LTdcJGC/ls8jokjtRUv6asKibbL6K5BtAhrIYd/DMDTVlvzyJPBrzFOSmObsgP3YNEbaaUyN0hpJzTB8/cDBmYn7abZ6kMmXNg4QNefeyJabcCABIqxL3eya8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829430; c=relaxed/simple; bh=o5lGEBDV5UAe201SZiJ5LVTaZ61EzzdPrWLpVnIezHo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=L8/t/i9r3Ief0hoGcy1yXVNIQJJpovmw+DrnF4571arquuKioKOLGwQS9lBG4ZsinhhUENxYDpQL+of5uTUH4Ew6mkCT4zXseN0d6fr7I/qh5/B/foRv8anQKgpy7HgwP6vrIcTXtGTu6/BgwZNqOCRGzyQyK8vYlygzli44c0Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=T0HdSJ0H; arc=none smtp.client-ip=209.85.210.228 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="T0HdSJ0H" Received: by mail-pf1-f228.google.com with SMTP id d2e1a72fcca58-7ab689d3fa0so245231b3a.0 for ; Mon, 15 Dec 2025 12:10:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1765829428; x=1766434228; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0e/XFewNZHIQT+u9pv570Jztv9BwIrFZDFKmU9UVav0=; b=T0HdSJ0HoELXf1gIFv/QkxtaPA3vejxY+X+pLLDKoBnwDAOkGYhM4g2Zgn6+kbqI23 kdNf/DF5vQpLM7BEtN7T6SNBphnmzwwKevwYOVns3QDoS/KUQWckeTO+C8oaVCp7Tiub c4UjgPX1ajLVB7c6FTS/nJvBb9NiBwR72zMMoUzanOfzOrBebLVB0q2CfF4F3sOj7Awl 0NUK87g5COpmBboKnSd838vfWJwuOTjjdGgz/75IAhI/fBPVmTvtAVfXVcAoVvEIP6qE n8dVdOS8OoPnQBs9x+cT7LPvu0na0LbC+pahqvkz/mW7K95FM+7/ufndLyc85Fh4TOMi G6hQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765829428; x=1766434228; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=0e/XFewNZHIQT+u9pv570Jztv9BwIrFZDFKmU9UVav0=; b=mLNTICxQVFWiKupwC7xGOrs972T6jFdaU2oaSzH/+DlZ3fuiwwGh1ueOwGULGEyORk hfTveQeO9DcAyU7bEuVxj+PVHovD+UwrXb3osfzxDD/HsGFyI0Vf9SyC0mE3IqRVFswd rAN1SxshQbQWcIFG7gwsyvaFaylk1DQQtUzZvgLQI2A4A8pJlIBc3xluDyb5bdQHaLfj ZMdsAXZU7YdJaD9t7TR/A7qpy6n6snbdSDNbtZVgM9ypdsd/0w1DoZTIwFHc5I/RGHTO wNd9ieMse6Nc/yV0kPWLJ6KUoBdAKfiDQNI2LMLQMh2E3vuCVmZShPfOxL5wgD694g8u 5xJg== X-Forwarded-Encrypted: i=1; AJvYcCUeqQK/tMjfrdTvJ0YymEfXwJhJBhfjIvs0uor5vFKn+r2scCt9JKbMI91jLw3NN/i3Qy4KEwQj5CnLjdI=@vger.kernel.org X-Gm-Message-State: AOJu0YzjAQNfYz8H9SxXvPqzkp9VFFy9U2ITgJP+OFWX5aw0ZSAuVnnV 5KuTi+OpJCsvDV9aAyOCcgulybA+WhY0LhWDfmlwHspYM3AAVbBaz+61mVioxuM0kDRqHJnVIou vFKkZWOABlZh9NWl3BvWuYWgxPjyXEe8ZfkACTsu0E+pu1n362NF7 X-Gm-Gg: AY/fxX7bAw4vf7uyojkQxevl4iD1CuXMzamjmdH+HZbAw0GDsTFbk8xVboUVSw3I7NB 7L2ENi+pEJeSXaUd1N/zR18Q9hOmsGgF8ZWDN0yEVggBf1lVkT2X1WbjKIThvtQ4bFBDygL1Zw1 DEaNeAVXxifF6rmn7uxc/spWF7QSAaNDqw3CMjwYsRsXOeogAxcOztOi+416/gczen+oaqrXSoS lNI+j/zDbeXXCcWs2tkN32hEAA/u+X0iXT5nWGFdFIo+77Hd5cKz/rsVxzRrnyooCYbGS9r4Uz4 7B8c/AfSBjdCcKD8p1rVphabsOIBm18mnnsLVWHiLUn1t4IJx4ULsL10sxBzPu4mVqb4Q9LjGuv rP91umfCBZUqmNaUavIFzApbTNsA= X-Google-Smtp-Source: AGHT+IFXM8p3TwJdJCvpPaO3rA9ZkB9JG6xOKudOAnxLn3WwIdq94virxVk4I+/m85vem/IuaYkd6Xnpobb3 X-Received: by 2002:a05:6a00:2e15:b0:7b2:b20:e8d9 with SMTP id d2e1a72fcca58-7f6693a5b3cmr7963554b3a.6.1765829427516; Mon, 15 Dec 2025 12:10:27 -0800 (PST) Received: from c7-smtp-2023.dev.purestorage.com ([2620:125:9017:12:36:3:5:0]) by smtp-relay.gmail.com with ESMTPS id d2e1a72fcca58-7f4c3ff229csm1929188b3a.6.2025.12.15.12.10.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 12:10:27 -0800 (PST) X-Relaying-Domain: purestorage.com Received: from dev-csander.dev.purestorage.com (unknown [IPv6:2620:125:9007:640:ffff::1199]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id EB30F340BC4; Mon, 15 Dec 2025 13:10:26 -0700 (MST) Received: by dev-csander.dev.purestorage.com (Postfix, from userid 1557716354) id E89D7E41D23; Mon, 15 Dec 2025 13:10:26 -0700 (MST) From: Caleb Sander Mateos To: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Joanne Koong , Caleb Sander Mateos , kernel test robot Subject: [PATCH v5 3/6] io_uring: ensure io_uring_create() initializes submitter_task Date: Mon, 15 Dec 2025 13:09:06 -0700 Message-ID: <20251215200909.3505001-4-csander@purestorage.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20251215200909.3505001-1-csander@purestorage.com> References: <20251215200909.3505001-1-csander@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If io_uring_create() fails after allocating the struct io_ring_ctx, it may call io_ring_ctx_wait_and_kill() before submitter_task has been assigned. This is currently harmless, as the submit and register paths that check submitter_task aren't reachable until the io_ring_ctx has been successfully created. However, a subsequent commit will expect submitter_task to be set for every IORING_SETUP_SINGLE_ISSUER && !IORING_SETUP_R_DISABLED ctx. So assign ctx->submitter_task prior to any call to io_ring_ctx_wait_and_kill() in io_uring_create(). Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202512101405.a7a2bdb2-lkp@intel.com Signed-off-by: Caleb Sander Mateos --- io_uring/io_uring.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 44ff5756b328..6d6fe5bdebda 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3647,10 +3647,19 @@ static __cold int io_uring_create(struct io_ctx_con= fig *config) * memory (locked/pinned vm). It's not used for anything else. */ mmgrab(current->mm); ctx->mm_account =3D current->mm; =20 + if (ctx->flags & IORING_SETUP_SINGLE_ISSUER + && !(ctx->flags & IORING_SETUP_R_DISABLED)) { + /* + * Unlike io_register_enable_rings(), don't need WRITE_ONCE() + * since ctx isn't yet accessible from other tasks + */ + ctx->submitter_task =3D get_task_struct(current); + } + ret =3D io_allocate_scq_urings(ctx, config); if (ret) goto err; =20 ret =3D io_sq_offload_create(ctx, p); @@ -3662,19 +3671,10 @@ static __cold int io_uring_create(struct io_ctx_con= fig *config) if (copy_to_user(config->uptr, p, sizeof(*p))) { ret =3D -EFAULT; goto err; } =20 - if (ctx->flags & IORING_SETUP_SINGLE_ISSUER - && !(ctx->flags & IORING_SETUP_R_DISABLED)) { - /* - * Unlike io_register_enable_rings(), don't need WRITE_ONCE() - * since ctx isn't yet accessible from other tasks - */ - ctx->submitter_task =3D get_task_struct(current); - } - file =3D io_uring_get_file(ctx); if (IS_ERR(file)) { ret =3D PTR_ERR(file); goto err; } --=20 2.45.2 From nobody Tue Dec 16 14:36:45 2025 Received: from mail-pf1-f225.google.com (mail-pf1-f225.google.com [209.85.210.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 908E12EB845 for ; Mon, 15 Dec 2025 20:10:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.225 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829430; cv=none; b=flbzihnjnWTJQFKV+CmmC9BeRpwcgfOj855y0n2FhRDxZlJKiudpZpwUFGOpEmMgWTRmuN4pZvAphQ2AMxnTTCsQA/0demniK9P/7ocyssoqiOQOMYhPdJGkR0oCbB+LDt7SfJ71J5UdSrfwvHc12kBRQhFm10/H4lDhTBegDZM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829430; c=relaxed/simple; bh=mCH4UHXC/WywxK2XbUmSCE7uuYnugHpAiJbLAuIlfJ8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rRxMGBdWYQ4t4CiXEQOY5BEV/O7vAtbxzfLtgMQ3XHDsyIrue0nyS/51hCOSvfcr22HrW5n0j/Ob/vtpxY3a/TTTrodiotb56VD3XlRftxRBD3UIKrquQ6HrZzZmNKrRkWHBf6aETDjJZ6jckFlw42UkQXUS47bP/sGiVupgl6M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=fb261Wwr; arc=none smtp.client-ip=209.85.210.225 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="fb261Wwr" Received: by mail-pf1-f225.google.com with SMTP id d2e1a72fcca58-7baa28f553dso105239b3a.3 for ; Mon, 15 Dec 2025 12:10:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1765829428; x=1766434228; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tc5cAPWLhKrkCI4CYdzzTi0ug0E8D643Rd4mnx3PNyc=; b=fb261Wwr26iDMrjN6rRp2/liMKX/YtdRsmRn/uT0qzoM+f2BFZ6ehBwdsrtPeLExcv 6Yf2gfHdA+qQu3jr5jdvtHmqW34hjOS8oWSiNuLnEOfrcT8xeIIGNs9pqraSxbnmVm5w o1npCKyYDGYgC9539CsH5TMYS53k2GnFjhfIdq3N3leWivXy0vq8s7BgAv5Tz36K/oye c5/ZiYVT5I+ilu8sWP6fGAvfBpDO+q3Uxo2K/HqKyOdjzS6pMRcBA8T0Lwf1OMUNqiSE 2MbVbs+FNpl/1FXWT0GYjzCwzryDIyYtwrFDZcgzmTLatIyQL6GHHX7x5fVc9FGFr4QL NxRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765829428; x=1766434228; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=tc5cAPWLhKrkCI4CYdzzTi0ug0E8D643Rd4mnx3PNyc=; b=Wb5mJ2RvHoZVRJRolHGkDBrnyttJ8YyVP35AFI9G1XCv9Qf+H9Leso2bxtfuIQ0enT 7zkiggNUkUtO+ujC2bKF9Hl5nADZnkkcyejWyRq1h049kQRcpWplXzxvdR2697JQf6LS zVwxXi8JMkW0uat2HVHpM5CVLSukQsnzeDmA7YtCFr3RgFA3Jj4hic/Z/5ZqmvnhK1eY TLZM9L1jOIRhbpPSvn2lcHZ2CLyEyh9eZ8YGXN/zbvi3+w+69mGL+1psR8qRlKJ5pMPX zRHFzKblQ0V+MYbTiPa4IuVgaXLU9nr/VApnGVYaPqycAFsCiDFHxUh60a2+DMqyRtdb NbZw== X-Forwarded-Encrypted: i=1; AJvYcCVVscVhMj9u+wL89XP9fWwQmblic7nqSNzZL7HQR1D+6a7zuQPKiFmF/eJm4gdNykivXPZ2FstWIuOoKlo=@vger.kernel.org X-Gm-Message-State: AOJu0YzwudjoVfYfZ4rDrOuJ+n5s43a38GzsVe+i6oFLS99N2emi+hoT 0uoVreX2HWaCjcyAksiiGRg1Saimgrux3zGjMBM5CUNNR6oM952q3VjeLnuL5JU5qEsqmELWkp5 CzmYV8W1xMokCUEKkxtpfkBNyvjMamDJk0+uTYETYJzDHHXDmDyL0 X-Gm-Gg: AY/fxX6b33+rWehBktmU1uhwdJlu6BNJpD9A4KQCGcAeaO981raj66weCb3TeafTcVp l1c1Qam1meQUyQvtSoA98mcUquBqHVg1iuaVjEONYnu8je1oWMR55A+JwdQ9S4f179a0d83Y68P 1ujT51HYnHis6VLu78v2vu86kJHTr/3yvJ5U2JKcF7Fe+nU1P5v3ugrQgCJGlfKrl6/IgE48t0D yM7RwvoCGQ6Hnyi5xF7AUT161rgeCAqU5eEi6x41B5zIh7LXovM1/lOcRiG2qOu4Or35wSk4oa+ VFyNn8J9n8ZZA6PO3rnwONyS/3AE9c/+zvlcdmgpXkwo115gfR2UuTEpCS1uDxH/RhKeI3T+bDV imqEa8/7yZTMQ1ffCs52dkLJB7X0= X-Google-Smtp-Source: AGHT+IEGuk3sz77FXrQmCpSJFrQkto4AKlscE81EjDJnI5PpcY4ooIEl2sS/zT/JoK6BwdRaXQJ+J4oVa0Ls X-Received: by 2002:a17:903:11d0:b0:297:e67f:cd5 with SMTP id d9443c01a7336-29f23ca7d97mr83328505ad.7.1765829427722; Mon, 15 Dec 2025 12:10:27 -0800 (PST) Received: from c7-smtp-2023.dev.purestorage.com ([2620:125:9017:12:36:3:5:0]) by smtp-relay.gmail.com with ESMTPS id d9443c01a7336-2a0a4a6163fsm11123395ad.38.2025.12.15.12.10.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 12:10:27 -0800 (PST) X-Relaying-Domain: purestorage.com Received: from dev-csander.dev.purestorage.com (unknown [IPv6:2620:125:9007:640:ffff::1199]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id 45722341FDC; Mon, 15 Dec 2025 13:10:27 -0700 (MST) Received: by dev-csander.dev.purestorage.com (Postfix, from userid 1557716354) id 4040EE41D23; Mon, 15 Dec 2025 13:10:27 -0700 (MST) From: Caleb Sander Mateos To: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Joanne Koong , Caleb Sander Mateos Subject: [PATCH v5 4/6] io_uring: use io_ring_submit_lock() in io_iopoll_req_issued() Date: Mon, 15 Dec 2025 13:09:07 -0700 Message-ID: <20251215200909.3505001-5-csander@purestorage.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20251215200909.3505001-1-csander@purestorage.com> References: <20251215200909.3505001-1-csander@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Use the io_ring_submit_lock() helper in io_iopoll_req_issued() instead of open-coding the logic. io_ring_submit_unlock() can't be used for the unlock, though, due to the extra logic before releasing the mutex. Signed-off-by: Caleb Sander Mateos Reviewed-by: Joanne Koong --- io_uring/io_uring.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 6d6fe5bdebda..40582121c6a7 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1670,15 +1670,13 @@ void io_req_task_complete(struct io_tw_req tw_req, = io_tw_token_t tw) * accessing the kiocb cookie. */ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_= flags) { struct io_ring_ctx *ctx =3D req->ctx; - const bool needs_lock =3D issue_flags & IO_URING_F_UNLOCKED; =20 /* workqueue context doesn't hold uring_lock, grab it now */ - if (unlikely(needs_lock)) - mutex_lock(&ctx->uring_lock); + io_ring_submit_lock(ctx, issue_flags); =20 /* * Track whether we have multiple files in our lists. This will impact * how we do polling eventually, not spinning if we're on potentially * different devices. @@ -1701,11 +1699,11 @@ static void io_iopoll_req_issued(struct io_kiocb *r= eq, unsigned int issue_flags) if (READ_ONCE(req->iopoll_completed)) wq_list_add_head(&req->comp_list, &ctx->iopoll_list); else wq_list_add_tail(&req->comp_list, &ctx->iopoll_list); =20 - if (unlikely(needs_lock)) { + if (unlikely(issue_flags & IO_URING_F_UNLOCKED)) { /* * If IORING_SETUP_SQPOLL is enabled, sqes are either handle * in sq thread task context or in io worker task context. If * current task context is sq thread, we don't need to check * whether should wake up sq thread. --=20 2.45.2 From nobody Tue Dec 16 14:36:45 2025 Received: from mail-pl1-f230.google.com (mail-pl1-f230.google.com [209.85.214.230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4730D2F2619 for ; Mon, 15 Dec 2025 20:10:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.230 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829436; cv=none; b=CZ9tz4cas6e3/mqPFfOgjmkJgvw3DlE++6WEQmX7C8he7eYtjsTJlgSKY+piCq0MYL9utakjj9lKSL6a9cgnakMfznH0fl5/2skAyql1jIhjlbGrBMFB/a7sn02e4nSuYcEcVDEy6WBJUACu0rsRYJCdMFtiECQVCDbGfbCnD4Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829436; c=relaxed/simple; bh=F/vMZABKaxCa5g47FvtNk+ov4///sGd9hN9sXdcZZnU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lwrUGbh3AfGDXkKKtJ71JeEWenBkLEqQaj0h0WpES6Lg9xX3QSHBgtXzxRAKsrcV/pt0au+escoCNaMsz256GciHuSNs5oQFj6RNcDxT9I0mfZO11APwplVJ1Gol02vvVxh8/FcPeH7xD+gO8hZLPQzQwq+soJYrOTigOF+kIgQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=LARSYHYc; arc=none smtp.client-ip=209.85.214.230 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="LARSYHYc" Received: by mail-pl1-f230.google.com with SMTP id d9443c01a7336-29f08b909aeso8962695ad.2 for ; Mon, 15 Dec 2025 12:10:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1765829428; x=1766434228; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=5TubtioW4QMq49x2xNQR3ew6720wtCXa6cgT9ViLS7o=; b=LARSYHYc00g3AlkDXkKghyPCYblnzSAX/zhxfPp1w11k+QwzE/+PvWY8iVryDP7MdR KOgW3UhLstXV8yPlq9EEEKtcZiNWJHGOsIFtfOpQPvUnvmKaPJk6QREIIJJqJqqLddJr 1AW3EixQV8ycK8HmR81tskwwRH8O5ZKLE0GrLG+SquSReojsI9G0Z/RMQK/5dE852/+U vMLPsIyiTKofQNPLQVqprr8F4tdYZ6LTCG6YD1u4zg/55Mw9FYQzTvy3oFb7/t8gUCZ/ CmSxK0Cq9DXzz9U3rIOA50FAP5Dog2NXppGjUZH9A0KeGl3mrBqfGbJ4AMisbZIyutSf lNIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765829428; x=1766434228; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=5TubtioW4QMq49x2xNQR3ew6720wtCXa6cgT9ViLS7o=; b=FTfXVNsJq6KD8FU1j86ihQ2wkUYwVC6Av+a8o4wCm9bj+Ko2It5XOxLewhzPDo0LB4 lf+rkVfIaG1IugRfbMtBe5BMY5arLiEpytgF/xmIqMAxtuSNXh3iYUSQgFYJN1ScOASn AB4J0sa5fHPK3+2svXjRXn9KrC+D49G1b1vXy73pHRN6f3/i2Rpv+EDwaOThzOJZwHQf 5+un77dMDRpJ0/6LZvQg6VEDp39iZLk+lvQgnYYF8FWq7zjgMQIaPARdq+hKvN3RBwcn ablokWOnFqZk4adqXBkdlA2YtyBIO0NfRNo22gwCyA3ZiS2waBeUVH6ZUwVMoq8wvRKa OkFA== X-Forwarded-Encrypted: i=1; AJvYcCXs4uKvZ8bbq4D6DOn0Eq1EnOSX+JLfuqbL6FLYj/xWHEsVpmocGtHQsxDH3s8RuUt5s8zaGrJma5TBxoE=@vger.kernel.org X-Gm-Message-State: AOJu0YzugCGZyKEseEhtwutzN8svN0ULWepOYrLMCoremhE402RKa0WT g9Q5b+yFUc+2xqw6frUMdGtiJJg8Pu7aG60pIebF6fUPd3GBRz62HiqgQGL+w7SNzl5MoKRMt3l RSwFn5BigbqOoksKx+9CLsvzWgwfPpmlHOgLNWTw/S2PCGgongEol X-Gm-Gg: AY/fxX6QMD8SK3ypCUMRuCwI9Bp8g453FopT4dn+vT+oF0GMQUXhRsQQ21a47Zy+TRX no5awVMzPwSjxK3gwh/Bo1CwOpBs/ByRM8oEy5m00RF3JTRRlPDAlR72NWizoKlPCZHXONdFlBY yYTQWG/7QmsQKrHCu389DSxjm1agO/Tva1KYjqT7cT9O1kpYR5fRlYppqRTxZ/6jRTSj+JplrAc yc9zCq3GGYTXkHTU9hZlQo07Hs9msVGYt/3EzClbhAiLeLkTs46QUd4PzGHvJzksh4aMb2N1FFq ON0OfJHwDvUG3IJLYgSxv8lJeKVishMvJ85ugHpNYRRGVjdJ1Ns5JAHyhEpq33ueP1fOkVkcHRi 40ODYbpZfsW4MJgthO4FyUFOFVhM= X-Google-Smtp-Source: AGHT+IERjRJ7cI4xHso7cCio2iOyBao6RPkC8lg0WTJnJrgtdv3KxBbjlzzDbk7X5rtiwiZLWwgu+xFPXRiy X-Received: by 2002:a05:7300:5709:b0:2ab:ca55:89cb with SMTP id 5a478bee46e88-2ac3027b0a0mr4891998eec.6.1765829428129; Mon, 15 Dec 2025 12:10:28 -0800 (PST) Received: from c7-smtp-2023.dev.purestorage.com ([2620:125:9017:12:36:3:5:0]) by smtp-relay.gmail.com with ESMTPS id a92af1059eb24-11f2e2fe356sm2513419c88.5.2025.12.15.12.10.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 12:10:28 -0800 (PST) X-Relaying-Domain: purestorage.com Received: from dev-csander.dev.purestorage.com (unknown [IPv6:2620:125:9007:640:ffff::1199]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id 951CB3401D2; Mon, 15 Dec 2025 13:10:27 -0700 (MST) Received: by dev-csander.dev.purestorage.com (Postfix, from userid 1557716354) id 91C44E41D23; Mon, 15 Dec 2025 13:10:27 -0700 (MST) From: Caleb Sander Mateos To: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Joanne Koong , Caleb Sander Mateos Subject: [PATCH v5 5/6] io_uring: factor out uring_lock helpers Date: Mon, 15 Dec 2025 13:09:08 -0700 Message-ID: <20251215200909.3505001-6-csander@purestorage.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20251215200909.3505001-1-csander@purestorage.com> References: <20251215200909.3505001-1-csander@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A subsequent commit will skip acquiring the io_ring_ctx uring_lock in io_uring_enter() and io_handle_tw_list() for IORING_SETUP_SINGLE_ISSUER. Prepare for this change by factoring out the uring_lock accesses under these functions into helpers. Aside from the helpers, the only remaining access of uring_lock is its mutex_init() call. Define a struct io_ring_ctx_lock_state to pass state from io_ring_ctx_lock() to io_ring_ctx_unlock(). It's currently empty but a subsequent commit will add fields. Helpers: - io_ring_ctx_lock() for mutex_lock() - io_ring_ctx_lock_nested() for mutex_lock_nested() - io_ring_ctx_trylock() for mutex_trylock() - io_ring_ctx_unlock() for mutex_unlock() - io_ring_ctx_lock_held() for lockdep_is_held() - io_ring_ctx_assert_locked() for lockdep_assert_held() Signed-off-by: Caleb Sander Mateos --- include/linux/io_uring_types.h | 12 +-- io_uring/cancel.c | 40 ++++---- io_uring/cancel.h | 5 +- io_uring/eventfd.c | 5 +- io_uring/fdinfo.c | 8 +- io_uring/filetable.c | 8 +- io_uring/futex.c | 14 +-- io_uring/io_uring.c | 181 +++++++++++++++++++-------------- io_uring/io_uring.h | 75 +++++++++++--- io_uring/kbuf.c | 32 +++--- io_uring/memmap.h | 2 +- io_uring/msg_ring.c | 29 ++++-- io_uring/notif.c | 5 +- io_uring/notif.h | 3 +- io_uring/openclose.c | 14 +-- io_uring/poll.c | 21 ++-- io_uring/register.c | 79 +++++++------- io_uring/rsrc.c | 51 ++++++---- io_uring/rsrc.h | 6 +- io_uring/rw.c | 2 +- io_uring/splice.c | 5 +- io_uring/sqpoll.c | 5 +- io_uring/tctx.c | 27 +++-- io_uring/tctx.h | 5 +- io_uring/uring_cmd.c | 13 ++- io_uring/waitid.c | 13 +-- io_uring/zcrx.c | 2 +- 27 files changed, 404 insertions(+), 258 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index e1adb0d20a0a..74d202394b20 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -86,11 +86,11 @@ struct io_mapped_region { =20 /* * Return value from io_buffer_list selection, to avoid stashing it in * struct io_kiocb. For legacy/classic provided buffers, keeping a referen= ce * across execution contexts are fine. But for ring provided buffers, the - * list may go away as soon as ->uring_lock is dropped. As the io_kiocb + * list may go away as soon as the ctx uring lock is dropped. As the io_ki= ocb * persists, it's better to just keep the buffer local for those cases. */ struct io_br_sel { struct io_buffer_list *buf_list; /* @@ -231,11 +231,11 @@ struct io_submit_link { struct io_kiocb *head; struct io_kiocb *last; }; =20 struct io_submit_state { - /* inline/task_work completion list, under ->uring_lock */ + /* inline/task_work completion list, under ctx uring lock */ struct io_wq_work_node free_list; /* batch completion logic */ struct io_wq_work_list compl_reqs; struct io_submit_link link; =20 @@ -303,16 +303,16 @@ struct io_ring_ctx { unsigned cached_sq_head; unsigned sq_entries; =20 /* * Fixed resources fast path, should be accessed only under - * uring_lock, and updated through io_uring_register(2) + * ctx uring lock, and updated through io_uring_register(2) */ atomic_t cancel_seq; =20 /* - * ->iopoll_list is protected by the ctx->uring_lock for + * ->iopoll_list is protected by the ctx uring lock for * io_uring instances that don't use IORING_SETUP_SQPOLL. * For SQPOLL, only the single threaded io_sq_thread() will * manipulate the list, hence no extra locking is needed there. */ bool poll_multi_queue; @@ -324,11 +324,11 @@ struct io_ring_ctx { struct io_alloc_cache imu_cache; =20 struct io_submit_state submit_state; =20 /* - * Modifications are protected by ->uring_lock and ->mmap_lock. + * Modifications protected by ctx uring lock and ->mmap_lock. * The buffer list's io mapped region should be stable once * published. */ struct xarray io_bl_xa; =20 @@ -467,11 +467,11 @@ struct io_ring_ctx { struct io_mapped_region param_region; }; =20 /* * Token indicating function is called in task work context: - * ctx->uring_lock is held and any completions generated will be flushed. + * ctx uring lock is held and any completions generated will be flushed. * ONLY core io_uring.c should instantiate this struct. */ struct io_tw_state { bool cancel; }; diff --git a/io_uring/cancel.c b/io_uring/cancel.c index ca12ac10c0ae..68b58c7765ef 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -168,10 +168,11 @@ int io_async_cancel_prep(struct io_kiocb *req, const = struct io_uring_sqe *sqe) static int __io_async_cancel(struct io_cancel_data *cd, struct io_uring_task *tctx, unsigned int issue_flags) { bool all =3D cd->flags & (IORING_ASYNC_CANCEL_ALL|IORING_ASYNC_CANCEL_ANY= ); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D cd->ctx; struct io_tctx_node *node; int ret, nr =3D 0; =20 do { @@ -182,21 +183,21 @@ static int __io_async_cancel(struct io_cancel_data *c= d, return ret; nr++; } while (1); =20 /* slow path, try all io-wq's */ - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); ret =3D -ENOENT; list_for_each_entry(node, &ctx->tctx_list, ctx_node) { ret =3D io_async_cancel_one(node->task->io_uring, cd); if (ret !=3D -ENOENT) { if (!all) break; nr++; } } - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return all ? nr : ret; } =20 int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags) { @@ -238,11 +239,11 @@ int io_async_cancel(struct io_kiocb *req, unsigned in= t issue_flags) static int __io_sync_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd, int fd) { struct io_ring_ctx *ctx =3D cd->ctx; =20 - /* fixed must be grabbed every time since we drop the uring_lock */ + /* fixed must be grabbed every time since we drop the ctx uring lock */ if ((cd->flags & IORING_ASYNC_CANCEL_FD) && (cd->flags & IORING_ASYNC_CANCEL_FD_FIXED)) { struct io_rsrc_node *node; =20 node =3D io_rsrc_node_lookup(&ctx->file_table.data, fd); @@ -254,12 +255,12 @@ static int __io_sync_cancel(struct io_uring_task *tct= x, } =20 return __io_async_cancel(cd, tctx, 0); } =20 -int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg) - __must_hold(&ctx->uring_lock) +int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg, + struct io_ring_ctx_lock_state *lock_state) { struct io_cancel_data cd =3D { .ctx =3D ctx, .seq =3D atomic_inc_return(&ctx->cancel_seq), }; @@ -267,10 +268,12 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __us= er *arg) struct io_uring_sync_cancel_reg sc; struct file *file =3D NULL; DEFINE_WAIT(wait); int ret, i; =20 + io_ring_ctx_assert_locked(ctx); + if (copy_from_user(&sc, arg, sizeof(sc))) return -EFAULT; if (sc.flags & ~CANCEL_FLAGS) return -EINVAL; for (i =3D 0; i < ARRAY_SIZE(sc.pad); i++) @@ -317,11 +320,11 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __us= er *arg) =20 prepare_to_wait(&ctx->cq_wait, &wait, TASK_INTERRUPTIBLE); =20 ret =3D __io_sync_cancel(current->io_uring, &cd, sc.fd); =20 - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); if (ret !=3D -EALREADY) break; =20 ret =3D io_run_task_work_sig(ctx); if (ret < 0) @@ -329,15 +332,15 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __us= er *arg) ret =3D schedule_hrtimeout(&timeout, HRTIMER_MODE_ABS); if (!ret) { ret =3D -ETIME; break; } - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); } while (1); =20 finish_wait(&ctx->cq_wait, &wait); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); =20 if (ret =3D=3D -ENOENT || ret > 0) ret =3D 0; out: if (file) @@ -351,11 +354,11 @@ bool io_cancel_remove_all(struct io_ring_ctx *ctx, st= ruct io_uring_task *tctx, { struct hlist_node *tmp; struct io_kiocb *req; bool found =3D false; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 hlist_for_each_entry_safe(req, tmp, list, hash_node) { if (!io_match_task_safe(req, tctx, cancel_all)) continue; hlist_del_init(&req->hash_node); @@ -368,24 +371,25 @@ bool io_cancel_remove_all(struct io_ring_ctx *ctx, st= ruct io_uring_task *tctx, =20 int io_cancel_remove(struct io_ring_ctx *ctx, struct io_cancel_data *cd, unsigned int issue_flags, struct hlist_head *list, bool (*cancel)(struct io_kiocb *)) { + struct io_ring_ctx_lock_state lock_state; struct hlist_node *tmp; struct io_kiocb *req; int nr =3D 0; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); hlist_for_each_entry_safe(req, tmp, list, hash_node) { if (!io_cancel_req_match(req, cd)) continue; if (cancel(req)) nr++; if (!(cd->flags & IORING_ASYNC_CANCEL_ALL)) break; } - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return nr ?: -ENOENT; } =20 static bool io_match_linked(struct io_kiocb *head) { @@ -477,37 +481,39 @@ __cold bool io_cancel_ctx_cb(struct io_wq_work *work,= void *data) return req->ctx =3D=3D data; } =20 static __cold bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx) { + struct io_ring_ctx_lock_state lock_state; struct io_tctx_node *node; enum io_wq_cancel cret; bool ret =3D false; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); list_for_each_entry(node, &ctx->tctx_list, ctx_node) { struct io_uring_task *tctx =3D node->task->io_uring; =20 /* - * io_wq will stay alive while we hold uring_lock, because it's - * killed after ctx nodes, which requires to take the lock. + * io_wq will stay alive while we hold ctx uring lock, because + * it's killed after ctx nodes, which requires to take the lock. */ if (!tctx || !tctx->io_wq) continue; cret =3D io_wq_cancel_cb(tctx->io_wq, io_cancel_ctx_cb, ctx, true); ret |=3D (cret !=3D IO_WQ_CANCEL_NOTFOUND); } - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 return ret; } =20 __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, struct io_uring_task *tctx, bool cancel_all, bool is_sqpoll_thread) { struct io_task_cancel cancel =3D { .tctx =3D tctx, .all =3D cancel_all, }; + struct io_ring_ctx_lock_state lock_state; enum io_wq_cancel cret; bool ret =3D false; =20 /* set it so io_req_local_work_add() would wake us up */ if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) { @@ -542,17 +548,17 @@ __cold bool io_uring_try_cancel_requests(struct io_ri= ng_ctx *ctx, } =20 if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) && io_allowed_defer_tw_run(ctx)) ret |=3D io_run_local_work(ctx, INT_MAX, INT_MAX) > 0; - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); ret |=3D io_cancel_defer_files(ctx, tctx, cancel_all); ret |=3D io_poll_remove_all(ctx, tctx, cancel_all); ret |=3D io_waitid_remove_all(ctx, tctx, cancel_all); ret |=3D io_futex_remove_all(ctx, tctx, cancel_all); ret |=3D io_uring_try_cancel_uring_cmd(ctx, tctx, cancel_all); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); ret |=3D io_kill_timeouts(ctx, tctx, cancel_all); if (tctx) ret |=3D io_run_task_work() > 0; else ret |=3D flush_delayed_work(&ctx->fallback_work); diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6783961ede1b..ce4f6b69218e 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -2,10 +2,12 @@ #ifndef IORING_CANCEL_H #define IORING_CANCEL_H =20 #include =20 +#include "io_uring.h" + struct io_cancel_data { struct io_ring_ctx *ctx; union { u64 data; struct file *file; @@ -19,11 +21,12 @@ int io_async_cancel_prep(struct io_kiocb *req, const st= ruct io_uring_sqe *sqe); int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags); =20 int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd, unsigned int issue_flags); =20 -int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg); +int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg, + struct io_ring_ctx_lock_state *lock_state); bool io_cancel_req_match(struct io_kiocb *req, struct io_cancel_data *cd); bool io_match_task_safe(struct io_kiocb *head, struct io_uring_task *tctx, bool cancel_all); =20 bool io_cancel_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *t= ctx, diff --git a/io_uring/eventfd.c b/io_uring/eventfd.c index 78f8ab7db104..0c615be71edf 100644 --- a/io_uring/eventfd.c +++ b/io_uring/eventfd.c @@ -6,10 +6,11 @@ #include #include #include #include =20 +#include "io_uring.h" #include "io-wq.h" #include "eventfd.h" =20 struct io_ev_fd { struct eventfd_ctx *cq_ev_fd; @@ -118,11 +119,11 @@ int io_eventfd_register(struct io_ring_ctx *ctx, void= __user *arg, struct io_ev_fd *ev_fd; __s32 __user *fds =3D arg; int fd; =20 ev_fd =3D rcu_dereference_protected(ctx->io_ev_fd, - lockdep_is_held(&ctx->uring_lock)); + io_ring_ctx_lock_held(ctx)); if (ev_fd) return -EBUSY; =20 if (copy_from_user(&fd, fds, sizeof(*fds))) return -EFAULT; @@ -154,11 +155,11 @@ int io_eventfd_register(struct io_ring_ctx *ctx, void= __user *arg, int io_eventfd_unregister(struct io_ring_ctx *ctx) { struct io_ev_fd *ev_fd; =20 ev_fd =3D rcu_dereference_protected(ctx->io_ev_fd, - lockdep_is_held(&ctx->uring_lock)); + io_ring_ctx_lock_held(ctx)); if (ev_fd) { ctx->has_evfd =3D false; rcu_assign_pointer(ctx->io_ev_fd, NULL); io_eventfd_put(ev_fd); return 0; diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index a87d4e26eee8..886c06278a9b 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -9,10 +9,11 @@ #include =20 #include =20 #include "filetable.h" +#include "io_uring.h" #include "sqpoll.h" #include "fdinfo.h" #include "cancel.h" #include "rsrc.h" #include "opdef.h" @@ -75,11 +76,11 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *= ctx, struct seq_file *m) if (ctx->flags & IORING_SETUP_SQE128) sq_shift =3D 1; =20 /* * we may get imprecise sqe and cqe info if uring is actively running - * since we get cached_sq_head and cached_cq_tail without uring_lock + * since we get cached_sq_head and cached_cq_tail without ctx uring lock * and sq_tail and cq_head are changed by userspace. But it's ok since * we usually use these info when it is stuck. */ seq_printf(m, "SqMask:\t0x%x\n", sq_mask); seq_printf(m, "SqHead:\t%u\n", sq_head); @@ -249,16 +250,17 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx= *ctx, struct seq_file *m) * anything else to get an extra reference. */ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file) { struct io_ring_ctx *ctx =3D file->private_data; + struct io_ring_ctx_lock_state lock_state; =20 /* * Avoid ABBA deadlock between the seq lock and the io_uring mutex, * since fdinfo case grabs it in the opposite direction of normal use * cases. */ - if (mutex_trylock(&ctx->uring_lock)) { + if (io_ring_ctx_trylock(ctx, &lock_state)) { __io_uring_show_fdinfo(ctx, m); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); } } diff --git a/io_uring/filetable.c b/io_uring/filetable.c index 794ef95df293..40ad4a08dc89 100644 --- a/io_uring/filetable.c +++ b/io_uring/filetable.c @@ -55,14 +55,15 @@ void io_free_file_tables(struct io_ring_ctx *ctx, struc= t io_file_table *table) table->bitmap =3D NULL; } =20 static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *fil= e, u32 slot_index) - __must_hold(&ctx->uring_lock) { struct io_rsrc_node *node; =20 + io_ring_ctx_assert_locked(ctx); + if (io_is_uring_fops(file)) return -EBADF; if (!ctx->file_table.data.nr) return -ENXIO; if (slot_index >=3D ctx->file_table.data.nr) @@ -105,16 +106,17 @@ int __io_fixed_fd_install(struct io_ring_ctx *ctx, st= ruct file *file, * fput() is called correspondingly. */ int io_fixed_fd_install(struct io_kiocb *req, unsigned int issue_flags, struct file *file, unsigned int file_slot) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; int ret; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); ret =3D __io_fixed_fd_install(ctx, file, file_slot); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); =20 if (unlikely(ret < 0)) fput(file); return ret; } diff --git a/io_uring/futex.c b/io_uring/futex.c index 11bfff5a80df..aeda00981c7a 100644 --- a/io_uring/futex.c +++ b/io_uring/futex.c @@ -220,22 +220,23 @@ static void io_futex_wake_fn(struct wake_q_head *wake= _q, struct futex_q *q) =20 int io_futexv_wait(struct io_kiocb *req, unsigned int issue_flags) { struct io_futex *iof =3D io_kiocb_to_cmd(req, struct io_futex); struct io_futexv_data *ifd =3D req->async_data; + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; int ret, woken =3D -1; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); =20 ret =3D futex_wait_multiple_setup(ifd->futexv, iof->futex_nr, &woken); =20 /* * Error case, ret is < 0. Mark the request as failed. */ if (unlikely(ret < 0)) { - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); req_set_fail(req); io_req_set_res(req, ret, 0); io_req_async_data_free(req); return IOU_COMPLETE; } @@ -265,27 +266,28 @@ int io_futexv_wait(struct io_kiocb *req, unsigned int= issue_flags) iof->futexv_unqueued =3D 1; if (woken !=3D -1) io_req_set_res(req, woken, 0); } =20 - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return IOU_ISSUE_SKIP_COMPLETE; } =20 int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags) { struct io_futex *iof =3D io_kiocb_to_cmd(req, struct io_futex); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_futex_data *ifd =3D NULL; int ret; =20 if (!iof->futex_mask) { ret =3D -EINVAL; goto done; } =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); ifd =3D io_cache_alloc(&ctx->futex_cache, GFP_NOWAIT); if (!ifd) { ret =3D -ENOMEM; goto done_unlock; } @@ -299,17 +301,17 @@ int io_futex_wait(struct io_kiocb *req, unsigned int = issue_flags) =20 ret =3D futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags, &ifd->q, NULL, NULL); if (!ret) { hlist_add_head(&req->hash_node, &ctx->futex_list); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); =20 return IOU_ISSUE_SKIP_COMPLETE; } =20 done_unlock: - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); done: if (ret < 0) req_set_fail(req); io_req_set_res(req, ret, 0); io_req_async_data_free(req); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 40582121c6a7..ac71350285d7 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -234,20 +234,21 @@ static inline bool io_should_terminate_tw(struct io_r= ing_ctx *ctx) static __cold void io_fallback_req_func(struct work_struct *work) { struct io_ring_ctx *ctx =3D container_of(work, struct io_ring_ctx, fallback_work.work); struct llist_node *node =3D llist_del_all(&ctx->fallback_llist); + struct io_ring_ctx_lock_state lock_state; struct io_kiocb *req, *tmp; struct io_tw_state ts =3D {}; =20 percpu_ref_get(&ctx->refs); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); ts.cancel =3D io_should_terminate_tw(ctx); llist_for_each_entry_safe(req, tmp, node, io_task_work.node) req->io_task_work.func((struct io_tw_req){req}, ts); io_submit_flush_completions(ctx); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); percpu_ref_put(&ctx->refs); } =20 static int io_alloc_hash_table(struct io_hash_table *table, unsigned bits) { @@ -514,11 +515,11 @@ unsigned io_linked_nr(struct io_kiocb *req) =20 static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx) { bool drain_seen =3D false, first =3D true; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); __io_req_caches_free(ctx); =20 while (!list_empty(&ctx->defer_list)) { struct io_defer_entry *de =3D list_first_entry(&ctx->defer_list, struct io_defer_entry, list); @@ -577,13 +578,15 @@ static void io_cq_unlock_post(struct io_ring_ctx *ctx) spin_unlock(&ctx->completion_lock); io_cqring_wake(ctx); io_commit_cqring_flush(ctx); } =20 -static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool dying) +static void +__io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool dying, + struct io_ring_ctx_lock_state *lock_state) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 /* don't abort if we're dying, entries must get freed */ if (!dying && __io_cqring_events(ctx) =3D=3D ctx->cq_entries) return; =20 @@ -620,13 +623,13 @@ static void __io_cqring_overflow_flush(struct io_ring= _ctx *ctx, bool dying) * to care for a non-real case. */ if (need_resched()) { ctx->cqe_sentinel =3D ctx->cqe_cached; io_cq_unlock_post(ctx); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); cond_resched(); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); io_cq_lock(ctx); } } =20 if (list_empty(&ctx->cq_overflow_list)) { @@ -634,21 +637,24 @@ static void __io_cqring_overflow_flush(struct io_ring= _ctx *ctx, bool dying) atomic_andnot(IORING_SQ_CQ_OVERFLOW, &ctx->rings->sq_flags); } io_cq_unlock_post(ctx); } =20 -static void io_cqring_overflow_kill(struct io_ring_ctx *ctx) +static void io_cqring_overflow_kill(struct io_ring_ctx *ctx, + struct io_ring_ctx_lock_state *lock_state) { if (ctx->rings) - __io_cqring_overflow_flush(ctx, true); + __io_cqring_overflow_flush(ctx, true, lock_state); } =20 static void io_cqring_do_overflow_flush(struct io_ring_ctx *ctx) { - mutex_lock(&ctx->uring_lock); - __io_cqring_overflow_flush(ctx, false); - mutex_unlock(&ctx->uring_lock); + struct io_ring_ctx_lock_state lock_state; + + io_ring_ctx_lock(ctx, &lock_state); + __io_cqring_overflow_flush(ctx, false, &lock_state); + io_ring_ctx_unlock(ctx, &lock_state); } =20 /* must to be called somewhat shortly after putting a request */ static inline void io_put_task(struct io_kiocb *req) { @@ -883,15 +889,15 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 use= r_data, s32 res, u32 cflags return filled; } =20 /* * Must be called from inline task_work so we know a flush will happen lat= er, - * and obviously with ctx->uring_lock held (tw always has that). + * and obviously with ctx uring lock held (tw always has that). */ void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 c= flags) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); lockdep_assert(ctx->lockless_cq); =20 if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) { struct io_cqe cqe =3D io_init_cqe(user_data, res, cflags); =20 @@ -916,11 +922,11 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u= 32 cflags) */ if (!wq_list_empty(&ctx->submit_state.compl_reqs)) __io_submit_flush_completions(ctx); =20 lockdep_assert(!io_wq_current_is_worker()); - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 if (!ctx->lockless_cq) { spin_lock(&ctx->completion_lock); posted =3D io_fill_cqe_aux(ctx, req->cqe.user_data, res, cflags); spin_unlock(&ctx->completion_lock); @@ -940,11 +946,11 @@ bool io_req_post_cqe32(struct io_kiocb *req, struct i= o_uring_cqe cqe[2]) { struct io_ring_ctx *ctx =3D req->ctx; bool posted; =20 lockdep_assert(!io_wq_current_is_worker()); - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 cqe[0].user_data =3D req->cqe.user_data; if (!ctx->lockless_cq) { spin_lock(&ctx->completion_lock); posted =3D io_fill_cqe_aux32(ctx, cqe); @@ -969,11 +975,11 @@ static void io_req_complete_post(struct io_kiocb *req= , unsigned issue_flags) if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_IOWQ))) return; =20 /* * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires - * the submitter task context, IOPOLL protects with uring_lock. + * the submitter task context, IOPOLL protects with ctx uring lock. */ if (ctx->lockless_cq || (req->flags & REQ_F_REISSUE)) { defer_complete: req->io_task_work.func =3D io_req_task_complete; io_req_task_work_add(req); @@ -994,15 +1000,14 @@ static void io_req_complete_post(struct io_kiocb *re= q, unsigned issue_flags) */ req_ref_put(req); } =20 void io_req_defer_failed(struct io_kiocb *req, s32 res) - __must_hold(&ctx->uring_lock) { const struct io_cold_def *def =3D &io_cold_defs[req->opcode]; =20 - lockdep_assert_held(&req->ctx->uring_lock); + io_ring_ctx_assert_locked(req->ctx); =20 req_set_fail(req); io_req_set_res(req, res, io_put_kbuf(req, res, NULL)); if (def->fail) def->fail(req); @@ -1010,20 +1015,21 @@ void io_req_defer_failed(struct io_kiocb *req, s32 = res) } =20 /* * A request might get retired back into the request caches even before op= code * handlers and io_issue_sqe() are done with it, e.g. inline completion pa= th. - * Because of that, io_alloc_req() should be called only under ->uring_lock + * Because of that, io_alloc_req() should be called only under ctx uring l= ock * and with extra caution to not get a request that is still worked on. */ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx) - __must_hold(&ctx->uring_lock) { gfp_t gfp =3D GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO; void *reqs[IO_REQ_ALLOC_BATCH]; int ret; =20 + io_ring_ctx_assert_locked(ctx); + ret =3D kmem_cache_alloc_bulk(req_cachep, gfp, ARRAY_SIZE(reqs), reqs); =20 /* * Bulk alloc is all-or-nothing. If we fail to get a batch, * retry single alloc to be on the safe side. @@ -1080,19 +1086,20 @@ static inline struct io_kiocb *io_req_find_next(str= uct io_kiocb *req) nxt =3D req->link; req->link =3D NULL; return nxt; } =20 -static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw) +static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw, + struct io_ring_ctx_lock_state *lock_state) { if (!ctx) return; if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags); =20 io_submit_flush_completions(ctx); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); percpu_ref_put(&ctx->refs); } =20 /* * Run queued task_work, returning the number of entries processed in *cou= nt. @@ -1101,38 +1108,39 @@ static void ctx_flush_and_put(struct io_ring_ctx *c= tx, io_tw_token_t tw) */ struct llist_node *io_handle_tw_list(struct llist_node *node, unsigned int *count, unsigned int max_entries) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D NULL; struct io_tw_state ts =3D { }; =20 do { struct llist_node *next =3D node->next; struct io_kiocb *req =3D container_of(node, struct io_kiocb, io_task_work.node); =20 if (req->ctx !=3D ctx) { - ctx_flush_and_put(ctx, ts); + ctx_flush_and_put(ctx, ts, &lock_state); ctx =3D req->ctx; - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); percpu_ref_get(&ctx->refs); ts.cancel =3D io_should_terminate_tw(ctx); } INDIRECT_CALL_2(req->io_task_work.func, io_poll_task_func, io_req_rw_complete, (struct io_tw_req){req}, ts); node =3D next; (*count)++; if (unlikely(need_resched())) { - ctx_flush_and_put(ctx, ts); + ctx_flush_and_put(ctx, ts, &lock_state); ctx =3D NULL; cond_resched(); } } while (node && *count < max_entries); =20 - ctx_flush_and_put(ctx, ts); + ctx_flush_and_put(ctx, ts, &lock_state); return node; } =20 static __cold void __io_fallback_tw(struct llist_node *node, bool sync) { @@ -1401,16 +1409,17 @@ static inline int io_run_local_work_locked(struct i= o_ring_ctx *ctx, max(IO_LOCAL_TW_DEFAULT_MAX, min_events)); } =20 int io_run_local_work(struct io_ring_ctx *ctx, int min_events, int max_eve= nts) { + struct io_ring_ctx_lock_state lock_state; struct io_tw_state ts =3D {}; int ret; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); ret =3D __io_run_local_work(ctx, ts, min_events, max_events); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); return ret; } =20 static void io_req_task_cancel(struct io_tw_req tw_req, io_tw_token_t tw) { @@ -1465,12 +1474,13 @@ static inline void io_req_put_rsrc_nodes(struct io_= kiocb *req) io_put_rsrc_node(req->ctx, req->buf_node); } =20 static void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node) - __must_hold(&ctx->uring_lock) { + io_ring_ctx_assert_locked(ctx); + do { struct io_kiocb *req =3D container_of(node, struct io_kiocb, comp_list); =20 if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) { @@ -1506,15 +1516,16 @@ static void io_free_batch_list(struct io_ring_ctx *= ctx, io_req_add_to_cache(req, ctx); } while (node); } =20 void __io_submit_flush_completions(struct io_ring_ctx *ctx) - __must_hold(&ctx->uring_lock) { struct io_submit_state *state =3D &ctx->submit_state; struct io_wq_work_node *node; =20 + io_ring_ctx_assert_locked(ctx); + __io_cq_lock(ctx); __wq_list_for_each(node, &state->compl_reqs) { struct io_kiocb *req =3D container_of(node, struct io_kiocb, comp_list); =20 @@ -1555,51 +1566,54 @@ static unsigned io_cqring_events(struct io_ring_ctx= *ctx) * We can't just wait for polled events to come to us, we have to actively * find and complete them. */ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) { + struct io_ring_ctx_lock_state lock_state; + if (!(ctx->flags & IORING_SETUP_IOPOLL)) return; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); while (!wq_list_empty(&ctx->iopoll_list)) { /* let it sleep and repeat later if can't complete a request */ if (io_do_iopoll(ctx, true) =3D=3D 0) break; /* * Ensure we allow local-to-the-cpu processing to take place, * in this case we need to ensure that we reap all events. * Also let task_work, etc. to progress by releasing the mutex */ if (need_resched()) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); cond_resched(); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); } } - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) io_move_task_work_from_local(ctx); } =20 -static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_event= s) +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_event= s, + struct io_ring_ctx_lock_state *lock_state) { unsigned int nr_events =3D 0; unsigned long check_cq; =20 min_events =3D min(min_events, ctx->cq_entries); =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 if (!io_allowed_run_tw(ctx)) return -EEXIST; =20 check_cq =3D READ_ONCE(ctx->check_cq); if (unlikely(check_cq)) { if (check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)) - __io_cqring_overflow_flush(ctx, false); + __io_cqring_overflow_flush(ctx, false, lock_state); /* * Similarly do not spin if we have not informed the user of any * dropped CQE. */ if (check_cq & BIT(IO_CHECK_CQ_DROPPED_BIT)) @@ -1617,11 +1631,11 @@ static int io_iopoll_check(struct io_ring_ctx *ctx,= unsigned int min_events) int ret =3D 0; =20 /* * If a submit got punted to a workqueue, we can have the * application entering polling for a command before it gets - * issued. That app will hold the uring_lock for the duration + * issued. That app holds the ctx uring lock for the duration * of the poll right here, so we need to take a breather every * now and then to ensure that the issue has a chance to add * the poll to the issued list. Otherwise we can spin here * forever, while the workqueue is stuck trying to acquire the * very same mutex. @@ -1632,13 +1646,13 @@ static int io_iopoll_check(struct io_ring_ctx *ctx,= unsigned int min_events) =20 (void) io_run_local_work_locked(ctx, min_events); =20 if (task_work_pending(current) || wq_list_empty(&ctx->iopoll_list)) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); io_run_task_work(); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); } /* some requests don't go through iopoll_list */ if (tail !=3D ctx->cached_cq_tail || wq_list_empty(&ctx->iopoll_list)) break; @@ -1669,14 +1683,15 @@ void io_req_task_complete(struct io_tw_req tw_req, = io_tw_token_t tw) * find it from a io_do_iopoll() thread before the issuer is done * accessing the kiocb cookie. */ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_= flags) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; =20 - /* workqueue context doesn't hold uring_lock, grab it now */ - io_ring_submit_lock(ctx, issue_flags); + /* workqueue context doesn't hold ctx uring lock, grab it now */ + io_ring_submit_lock(ctx, issue_flags, &lock_state); =20 /* * Track whether we have multiple files in our lists. This will impact * how we do polling eventually, not spinning if we're on potentially * different devices. @@ -1710,11 +1725,11 @@ static void io_iopoll_req_issued(struct io_kiocb *r= eq, unsigned int issue_flags) */ if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(&ctx->sq_data->wait)) wake_up(&ctx->sq_data->wait); =20 - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); } } =20 io_req_flags_t io_file_get_flags(struct file *file) { @@ -1728,16 +1743,17 @@ io_req_flags_t io_file_get_flags(struct file *file) res |=3D REQ_F_SUPPORT_NOWAIT; return res; } =20 static __cold void io_drain_req(struct io_kiocb *req) - __must_hold(&ctx->uring_lock) { struct io_ring_ctx *ctx =3D req->ctx; bool drain =3D req->flags & IOSQE_IO_DRAIN; struct io_defer_entry *de; =20 + io_ring_ctx_assert_locked(ctx); + de =3D kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT); if (!de) { io_req_defer_failed(req, -ENOMEM); return; } @@ -1960,23 +1976,24 @@ void io_wq_submit_work(struct io_wq_work *work) } =20 inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd, unsigned int issue_flags) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_rsrc_node *node; struct file *file =3D NULL; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); node =3D io_rsrc_node_lookup(&ctx->file_table.data, fd); if (node) { node->refs++; req->file_node =3D node; req->flags |=3D io_slot_flags(node); file =3D io_slot_file(node); } - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return file; } =20 struct file *io_file_get_normal(struct io_kiocb *req, int fd) { @@ -2004,12 +2021,13 @@ static int io_req_sqe_copy(struct io_kiocb *req, un= signed int issue_flags) def->sqe_copy(req); return 0; } =20 static void io_queue_async(struct io_kiocb *req, unsigned int issue_flags,= int ret) - __must_hold(&req->ctx->uring_lock) { + io_ring_ctx_assert_locked(req->ctx); + if (ret !=3D -EAGAIN || (req->flags & REQ_F_NOWAIT)) { fail: io_req_defer_failed(req, ret); return; } @@ -2029,16 +2047,17 @@ static void io_queue_async(struct io_kiocb *req, un= signed int issue_flags, int r break; } } =20 static inline void io_queue_sqe(struct io_kiocb *req, unsigned int extra_f= lags) - __must_hold(&req->ctx->uring_lock) { unsigned int issue_flags =3D IO_URING_F_NONBLOCK | IO_URING_F_COMPLETE_DEFER | extra_flags; int ret; =20 + io_ring_ctx_assert_locked(req->ctx); + ret =3D io_issue_sqe(req, issue_flags); =20 /* * We async punt it if the file wasn't marked NOWAIT, or if the file * doesn't support non-blocking read/write attempts @@ -2046,12 +2065,13 @@ static inline void io_queue_sqe(struct io_kiocb *re= q, unsigned int extra_flags) if (unlikely(ret)) io_queue_async(req, issue_flags, ret); } =20 static void io_queue_sqe_fallback(struct io_kiocb *req) - __must_hold(&req->ctx->uring_lock) { + io_ring_ctx_assert_locked(req->ctx); + if (unlikely(req->flags & REQ_F_FAIL)) { /* * We don't submit, fail them all, for that replace hardlinks * with normal links. Extra REQ_F_LINK is tolerated. */ @@ -2116,17 +2136,18 @@ static __cold int io_init_fail_req(struct io_kiocb = *req, int err) return err; } =20 static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct io_uring_sqe *sqe, unsigned int *left) - __must_hold(&ctx->uring_lock) { const struct io_issue_def *def; unsigned int sqe_flags; int personality; u8 opcode; =20 + io_ring_ctx_assert_locked(ctx); + req->ctx =3D ctx; req->opcode =3D opcode =3D READ_ONCE(sqe->opcode); /* same numerical values with corresponding REQ_F_*, safe to copy */ sqe_flags =3D READ_ONCE(sqe->flags); req->flags =3D (__force io_req_flags_t) sqe_flags; @@ -2269,15 +2290,16 @@ static __cold int io_submit_fail_init(const struct = io_uring_sqe *sqe, return 0; } =20 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *= req, const struct io_uring_sqe *sqe, unsigned int *left) - __must_hold(&ctx->uring_lock) { struct io_submit_link *link =3D &ctx->submit_state.link; int ret; =20 + io_ring_ctx_assert_locked(ctx); + ret =3D io_init_req(ctx, req, sqe, left); if (unlikely(ret)) return io_submit_fail_init(sqe, req, ret); =20 trace_io_uring_submit_req(req); @@ -2398,16 +2420,17 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, con= st struct io_uring_sqe **sqe) *sqe =3D &ctx->sq_sqes[head]; return true; } =20 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) - __must_hold(&ctx->uring_lock) { unsigned int entries =3D io_sqring_entries(ctx); unsigned int left; int ret; =20 + io_ring_ctx_assert_locked(ctx); + entries =3D min(nr, entries); if (unlikely(!entries)) return 0; =20 ret =3D left =3D entries; @@ -2830,28 +2853,33 @@ static __cold void __io_req_caches_free(struct io_r= ing_ctx *ctx) } } =20 static __cold void io_req_caches_free(struct io_ring_ctx *ctx) { - guard(mutex)(&ctx->uring_lock); + struct io_ring_ctx_lock_state lock_state; + + io_ring_ctx_lock(ctx, &lock_state); __io_req_caches_free(ctx); + io_ring_ctx_unlock(ctx, &lock_state); } =20 static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) { + struct io_ring_ctx_lock_state lock_state; + io_sq_thread_finish(ctx); =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); io_sqe_buffers_unregister(ctx); io_sqe_files_unregister(ctx); io_unregister_zcrx_ifqs(ctx); - io_cqring_overflow_kill(ctx); + io_cqring_overflow_kill(ctx, &lock_state); io_eventfd_unregister(ctx); io_free_alloc_caches(ctx); io_destroy_buffers(ctx); io_free_region(ctx->user, &ctx->param_region); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); if (ctx->sq_creds) put_cred(ctx->sq_creds); if (ctx->submitter_task) put_task_struct(ctx->submitter_task); =20 @@ -2882,14 +2910,15 @@ static __cold void io_ring_ctx_free(struct io_ring_= ctx *ctx) =20 static __cold void io_activate_pollwq_cb(struct callback_head *cb) { struct io_ring_ctx *ctx =3D container_of(cb, struct io_ring_ctx, poll_wq_task_work); + struct io_ring_ctx_lock_state lock_state; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); ctx->poll_activated =3D true; - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 /* * Wake ups for some events between start of polling and activation * might've been lost due to loose synchronisation. */ @@ -2979,10 +3008,11 @@ static __cold void io_tctx_exit_cb(struct callback_= head *cb) } =20 static __cold void io_ring_exit_work(struct work_struct *work) { struct io_ring_ctx *ctx =3D container_of(work, struct io_ring_ctx, exit_w= ork); + struct io_ring_ctx_lock_state lock_state; unsigned long timeout =3D jiffies + HZ * 60 * 5; unsigned long interval =3D HZ / 20; struct io_tctx_exit exit; struct io_tctx_node *node; int ret; @@ -2993,13 +3023,13 @@ static __cold void io_ring_exit_work(struct work_st= ruct *work) * we're waiting for refs to drop. We need to reap these manually, * as nobody else will be looking for them. */ do { if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) { - mutex_lock(&ctx->uring_lock); - io_cqring_overflow_kill(ctx); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); + io_cqring_overflow_kill(ctx, &lock_state); + io_ring_ctx_unlock(ctx, &lock_state); } =20 if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) io_move_task_work_from_local(ctx); =20 @@ -3040,11 +3070,11 @@ static __cold void io_ring_exit_work(struct work_st= ruct *work) =20 init_completion(&exit.completion); init_task_work(&exit.task_work, io_tctx_exit_cb); exit.ctx =3D ctx; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); while (!list_empty(&ctx->tctx_list)) { WARN_ON_ONCE(time_after(jiffies, timeout)); =20 node =3D list_first_entry(&ctx->tctx_list, struct io_tctx_node, ctx_node); @@ -3052,20 +3082,20 @@ static __cold void io_ring_exit_work(struct work_st= ruct *work) list_rotate_left(&ctx->tctx_list); ret =3D task_work_add(node->task, &exit.task_work, TWA_SIGNAL); if (WARN_ON_ONCE(ret)) continue; =20 - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); /* * See comment above for * wait_for_completion_interruptible_timeout() on why this * wait is marked as interruptible. */ wait_for_completion_interruptible(&exit.completion); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); } - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); spin_lock(&ctx->completion_lock); spin_unlock(&ctx->completion_lock); =20 /* pairs with RCU read section in io_req_local_work_add() */ if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) @@ -3074,18 +3104,19 @@ static __cold void io_ring_exit_work(struct work_st= ruct *work) io_ring_ctx_free(ctx); } =20 static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { + struct io_ring_ctx_lock_state lock_state; unsigned long index; struct creds *creds; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); percpu_ref_kill(&ctx->refs); xa_for_each(&ctx->personalities, index, creds) io_unregister_personality(ctx, index); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 flush_delayed_work(&ctx->fallback_work); =20 INIT_WORK(&ctx->exit_work, io_ring_exit_work); /* @@ -3216,10 +3247,11 @@ static int io_get_ext_arg(struct io_ring_ctx *ctx, = unsigned flags, =20 SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, u32, min_complete, u32, flags, const void __user *, argp, size_t, argsz) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx; struct file *file; long ret; =20 if (unlikely(flags & ~IORING_ENTER_FLAGS)) @@ -3272,14 +3304,14 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u= 32, to_submit, } else if (to_submit) { ret =3D io_uring_add_tctx_node(ctx); if (unlikely(ret)) goto out; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); ret =3D io_submit_sqes(ctx, to_submit); if (ret !=3D to_submit) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); goto out; } if (flags & IORING_ENTER_GETEVENTS) { if (ctx->syscall_iopoll) goto iopoll_locked; @@ -3288,11 +3320,11 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u= 32, to_submit, * it should handle ownership problems if any. */ if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) (void)io_run_local_work_locked(ctx, min_complete); } - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); } =20 if (flags & IORING_ENTER_GETEVENTS) { int ret2; =20 @@ -3301,16 +3333,17 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u= 32, to_submit, * We disallow the app entering submit/complete with * polling, but we still need to lock the ring to * prevent racing with polled issue that got punted to * a workqueue. */ - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); iopoll_locked: ret2 =3D io_validate_ext_arg(ctx, flags, argp, argsz); if (likely(!ret2)) - ret2 =3D io_iopoll_check(ctx, min_complete); - mutex_unlock(&ctx->uring_lock); + ret2 =3D io_iopoll_check(ctx, min_complete, + &lock_state); + io_ring_ctx_unlock(ctx, &lock_state); } else { struct ext_arg ext_arg =3D { .argsz =3D argsz }; =20 ret2 =3D io_get_ext_arg(ctx, flags, argp, &ext_arg); if (likely(!ret2)) diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index a790c16854d3..57c3eef26a88 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -195,20 +195,64 @@ void io_queue_next(struct io_kiocb *req); void io_task_refs_refill(struct io_uring_task *tctx); bool __io_alloc_req_refill(struct io_ring_ctx *ctx); =20 void io_activate_pollwq(struct io_ring_ctx *ctx); =20 +struct io_ring_ctx_lock_state { +}; + +/* Acquire the ctx uring lock with the given nesting level */ +static inline void io_ring_ctx_lock_nested(struct io_ring_ctx *ctx, + unsigned int subclass, + struct io_ring_ctx_lock_state *state) +{ + mutex_lock_nested(&ctx->uring_lock, subclass); +} + +/* Acquire the ctx uring lock */ +static inline void io_ring_ctx_lock(struct io_ring_ctx *ctx, + struct io_ring_ctx_lock_state *state) +{ + io_ring_ctx_lock_nested(ctx, 0, state); +} + +/* Attempt to acquire the ctx uring lock without blocking */ +static inline bool io_ring_ctx_trylock(struct io_ring_ctx *ctx, + struct io_ring_ctx_lock_state *state) +{ + return mutex_trylock(&ctx->uring_lock); +} + +/* Release the ctx uring lock */ +static inline void io_ring_ctx_unlock(struct io_ring_ctx *ctx, + struct io_ring_ctx_lock_state *state) +{ + mutex_unlock(&ctx->uring_lock); +} + +/* Return (if CONFIG_LOCKDEP) whether the ctx uring lock is held */ +static inline bool io_ring_ctx_lock_held(const struct io_ring_ctx *ctx) +{ + return lockdep_is_held(&ctx->uring_lock); +} + +/* Assert (if CONFIG_LOCKDEP) that the ctx uring lock is held */ +static inline void io_ring_ctx_assert_locked(const struct io_ring_ctx *ctx) +{ + lockdep_assert_held(&ctx->uring_lock); +} + static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx) { #if defined(CONFIG_PROVE_LOCKING) lockdep_assert(in_task()); =20 if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 if (ctx->flags & IORING_SETUP_IOPOLL) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); } else if (!ctx->task_complete) { lockdep_assert_held(&ctx->completion_lock); } else if (ctx->submitter_task) { /* * ->submitter_task may be NULL and we can still post a CQE, @@ -373,30 +417,32 @@ static inline void io_put_file(struct io_kiocb *req) { if (!(req->flags & REQ_F_FIXED_FILE) && req->file) fput(req->file); } =20 -static inline void io_ring_submit_unlock(struct io_ring_ctx *ctx, - unsigned issue_flags) +static inline void +io_ring_submit_unlock(struct io_ring_ctx *ctx, unsigned issue_flags, + struct io_ring_ctx_lock_state *lock_state) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); if (unlikely(issue_flags & IO_URING_F_UNLOCKED)) - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); } =20 -static inline void io_ring_submit_lock(struct io_ring_ctx *ctx, - unsigned issue_flags) +static inline void +io_ring_submit_lock(struct io_ring_ctx *ctx, unsigned issue_flags, + struct io_ring_ctx_lock_state *lock_state) { /* - * "Normal" inline submissions always hold the uring_lock, since we + * "Normal" inline submissions always hold the ctx uring lock, since we * grab it from the system call. Same is true for the SQPOLL offload. * The only exception is when we've detached the request and issue it * from an async worker thread, grab the lock for that case. */ if (unlikely(issue_flags & IO_URING_F_UNLOCKED)) - mutex_lock(&ctx->uring_lock); - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); + io_ring_ctx_assert_locked(ctx); } =20 static inline void io_commit_cqring(struct io_ring_ctx *ctx) { /* order cqe stores with ring update */ @@ -504,24 +550,23 @@ static inline bool io_task_work_pending(struct io_rin= g_ctx *ctx) return task_work_pending(current) || io_local_work_pending(ctx); } =20 static inline void io_tw_lock(struct io_ring_ctx *ctx, io_tw_token_t tw) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); } =20 /* * Don't complete immediately but use deferred completion infrastructure. - * Protected by ->uring_lock and can only be used either with + * Protected by ctx uring lock and can only be used either with * IO_URING_F_COMPLETE_DEFER or inside a tw handler holding the mutex. */ static inline void io_req_complete_defer(struct io_kiocb *req) - __must_hold(&req->ctx->uring_lock) { struct io_submit_state *state =3D &req->ctx->submit_state; =20 - lockdep_assert_held(&req->ctx->uring_lock); + io_ring_ctx_assert_locked(req->ctx); =20 wq_list_add_tail(&req->comp_list, &state->compl_reqs); } =20 static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx) diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 796d131107dd..0fb9b22171d4 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -72,22 +72,22 @@ bool io_kbuf_commit(struct io_kiocb *req, } =20 static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx= *ctx, unsigned int bgid) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 return xa_load(&ctx->io_bl_xa, bgid); } =20 static int io_buffer_add_list(struct io_ring_ctx *ctx, struct io_buffer_list *bl, unsigned int bgid) { /* * Store buffer group ID and finally mark the list as visible. * The normal lookup doesn't care about the visibility as we're - * always under the ->uring_lock, but lookups from mmap do. + * always under the ctx uring lock, but lookups from mmap do. */ bl->bgid =3D bgid; guard(mutex)(&ctx->mmap_lock); return xa_err(xa_store(&ctx->io_bl_xa, bgid, bl, GFP_KERNEL)); } @@ -101,23 +101,24 @@ void io_kbuf_drop_legacy(struct io_kiocb *req) req->kbuf =3D NULL; } =20 bool io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_buffer_list *bl; struct io_buffer *buf; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); =20 buf =3D req->kbuf; bl =3D io_buffer_get_list(ctx, buf->bgid); list_add(&buf->list, &bl->buf_list); bl->nbufs++; req->flags &=3D ~REQ_F_BUFFER_SELECTED; =20 - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return true; } =20 static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t= *len, struct io_buffer_list *bl) @@ -210,24 +211,25 @@ static struct io_br_sel io_ring_buffer_select(struct = io_kiocb *req, size_t *len, } =20 struct io_br_sel io_buffer_select(struct io_kiocb *req, size_t *len, unsigned buf_group, unsigned int issue_flags) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_br_sel sel =3D { }; struct io_buffer_list *bl; =20 - io_ring_submit_lock(req->ctx, issue_flags); + io_ring_submit_lock(req->ctx, issue_flags, &lock_state); =20 bl =3D io_buffer_get_list(ctx, buf_group); if (likely(bl)) { if (bl->flags & IOBL_BUF_RING) sel =3D io_ring_buffer_select(req, len, bl, issue_flags); else sel.addr =3D io_provided_buffer_select(req, len, bl); } - io_ring_submit_unlock(req->ctx, issue_flags); + io_ring_submit_unlock(req->ctx, issue_flags, &lock_state); return sel; } =20 /* cap it at a reasonable 256, will be one page even for 4K */ #define PEEK_MAX_IMPORT 256 @@ -315,14 +317,15 @@ static int io_ring_buffers_peek(struct io_kiocb *req,= struct buf_sel_arg *arg, } =20 int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg, struct io_br_sel *sel, unsigned int issue_flags) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; int ret =3D -ENOENT; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); sel->buf_list =3D io_buffer_get_list(ctx, arg->buf_group); if (unlikely(!sel->buf_list)) goto out_unlock; =20 if (sel->buf_list->flags & IOBL_BUF_RING) { @@ -342,11 +345,11 @@ int io_buffers_select(struct io_kiocb *req, struct bu= f_sel_arg *arg, ret =3D io_provided_buffers_select(req, &arg->out_len, sel->buf_list, ar= g->iovs); } out_unlock: if (issue_flags & IO_URING_F_UNLOCKED) { sel->buf_list =3D NULL; - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); } return ret; } =20 int io_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg, @@ -354,11 +357,11 @@ int io_buffers_peek(struct io_kiocb *req, struct buf_= sel_arg *arg, { struct io_ring_ctx *ctx =3D req->ctx; struct io_buffer_list *bl; int ret; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 bl =3D io_buffer_get_list(ctx, arg->buf_group); if (unlikely(!bl)) return -ENOENT; =20 @@ -410,11 +413,11 @@ static int io_remove_buffers_legacy(struct io_ring_ct= x *ctx, { unsigned long i =3D 0; struct io_buffer *nxt; =20 /* protects io_buffers_cache */ - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); WARN_ON_ONCE(bl->flags & IOBL_BUF_RING); =20 for (i =3D 0; i < nbufs && !list_empty(&bl->buf_list); i++) { nxt =3D list_first_entry(&bl->buf_list, struct io_buffer, list); list_del(&nxt->list); @@ -579,18 +582,19 @@ static int __io_manage_buffers_legacy(struct io_kiocb= *req, } =20 int io_manage_buffers_legacy(struct io_kiocb *req, unsigned int issue_flag= s) { struct io_provide_buf *p =3D io_kiocb_to_cmd(req, struct io_provide_buf); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_buffer_list *bl; int ret; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); bl =3D io_buffer_get_list(ctx, p->bgid); ret =3D __io_manage_buffers_legacy(req, bl); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); =20 if (ret < 0) req_set_fail(req); io_req_set_res(req, ret, 0); return IOU_COMPLETE; @@ -604,11 +608,11 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, vo= id __user *arg) struct io_uring_buf_ring *br; unsigned long mmap_offset; unsigned long ring_size; int ret; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 if (copy_from_user(®, arg, sizeof(reg))) return -EFAULT; if (!mem_is_zero(reg.resv, sizeof(reg.resv))) return -EINVAL; @@ -680,11 +684,11 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, vo= id __user *arg) int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) { struct io_uring_buf_reg reg; struct io_buffer_list *bl; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 if (copy_from_user(®, arg, sizeof(reg))) return -EFAULT; if (!mem_is_zero(reg.resv, sizeof(reg.resv)) || reg.flags) return -EINVAL; diff --git a/io_uring/memmap.h b/io_uring/memmap.h index a39d9e518905..080285686a05 100644 --- a/io_uring/memmap.h +++ b/io_uring/memmap.h @@ -35,11 +35,11 @@ static inline void io_region_publish(struct io_ring_ctx= *ctx, struct io_mapped_region *src_region, struct io_mapped_region *dst_region) { /* * Once published mmap can find it without holding only the ->mmap_lock - * and not ->uring_lock. + * and not the ctx uring lock. */ guard(mutex)(&ctx->mmap_lock); *dst_region =3D *src_region; } =20 diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index c48588e06bfb..47c7cc56782d 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -30,29 +30,31 @@ struct io_msg { u32 cqe_flags; }; u32 flags; }; =20 -static void io_double_unlock_ctx(struct io_ring_ctx *octx) +static void io_double_unlock_ctx(struct io_ring_ctx *octx, + struct io_ring_ctx_lock_state *lock_state) { - mutex_unlock(&octx->uring_lock); + io_ring_ctx_unlock(octx, lock_state); } =20 static int io_lock_external_ctx(struct io_ring_ctx *octx, - unsigned int issue_flags) + unsigned int issue_flags, + struct io_ring_ctx_lock_state *lock_state) { /* * To ensure proper ordering between the two ctxs, we can only * attempt a trylock on the target. If that fails and we already have * the source ctx lock, punt to io-wq. */ if (!(issue_flags & IO_URING_F_UNLOCKED)) { - if (!mutex_trylock(&octx->uring_lock)) + if (!io_ring_ctx_trylock(octx, lock_state)) return -EAGAIN; return 0; } - mutex_lock(&octx->uring_lock); + io_ring_ctx_lock(octx, lock_state); return 0; } =20 void io_msg_ring_cleanup(struct io_kiocb *req) { @@ -116,10 +118,11 @@ static int io_msg_data_remote(struct io_ring_ctx *tar= get_ctx, } =20 static int __io_msg_ring_data(struct io_ring_ctx *target_ctx, struct io_msg *msg, unsigned int issue_flags) { + struct io_ring_ctx_lock_state lock_state; u32 flags =3D 0; int ret; =20 if (msg->src_fd || msg->flags & ~IORING_MSG_RING_FLAGS_PASS) return -EINVAL; @@ -134,17 +137,18 @@ static int __io_msg_ring_data(struct io_ring_ctx *tar= get_ctx, if (msg->flags & IORING_MSG_RING_FLAGS_PASS) flags =3D msg->cqe_flags; =20 ret =3D -EOVERFLOW; if (target_ctx->flags & IORING_SETUP_IOPOLL) { - if (unlikely(io_lock_external_ctx(target_ctx, issue_flags))) + if (unlikely(io_lock_external_ctx(target_ctx, issue_flags, + &lock_state))) return -EAGAIN; } if (io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags)) ret =3D 0; if (target_ctx->flags & IORING_SETUP_IOPOLL) - io_double_unlock_ctx(target_ctx); + io_double_unlock_ctx(target_ctx, &lock_state); return ret; } =20 static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags) { @@ -155,35 +159,38 @@ static int io_msg_ring_data(struct io_kiocb *req, uns= igned int issue_flags) } =20 static int io_msg_grab_file(struct io_kiocb *req, unsigned int issue_flags) { struct io_msg *msg =3D io_kiocb_to_cmd(req, struct io_msg); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_rsrc_node *node; int ret =3D -EBADF; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); node =3D io_rsrc_node_lookup(&ctx->file_table.data, msg->src_fd); if (node) { msg->src_file =3D io_slot_file(node); if (msg->src_file) get_file(msg->src_file); req->flags |=3D REQ_F_NEED_CLEANUP; ret =3D 0; } - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return ret; } =20 static int io_msg_install_complete(struct io_kiocb *req, unsigned int issu= e_flags) { struct io_ring_ctx *target_ctx =3D req->file->private_data; struct io_msg *msg =3D io_kiocb_to_cmd(req, struct io_msg); + struct io_ring_ctx_lock_state lock_state; struct file *src_file =3D msg->src_file; int ret; =20 - if (unlikely(io_lock_external_ctx(target_ctx, issue_flags))) + if (unlikely(io_lock_external_ctx(target_ctx, issue_flags, + &lock_state))) return -EAGAIN; =20 ret =3D __io_fixed_fd_install(target_ctx, src_file, msg->dst_fd); if (ret < 0) goto out_unlock; @@ -200,11 +207,11 @@ static int io_msg_install_complete(struct io_kiocb *r= eq, unsigned int issue_flag * later IORING_OP_MSG_RING delivers the message. */ if (!io_post_aux_cqe(target_ctx, msg->user_data, ret, 0)) ret =3D -EOVERFLOW; out_unlock: - io_double_unlock_ctx(target_ctx); + io_double_unlock_ctx(target_ctx, &lock_state); return ret; } =20 static void io_msg_tw_fd_complete(struct callback_head *head) { diff --git a/io_uring/notif.c b/io_uring/notif.c index f476775ba44b..8099b87af588 100644 --- a/io_uring/notif.c +++ b/io_uring/notif.c @@ -15,11 +15,11 @@ static void io_notif_tw_complete(struct io_tw_req tw_re= q, io_tw_token_t tw) { struct io_kiocb *notif =3D tw_req.req; struct io_notif_data *nd =3D io_notif_to_data(notif); struct io_ring_ctx *ctx =3D notif->ctx; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 do { notif =3D cmd_to_io_kiocb(nd); =20 if (WARN_ON_ONCE(ctx !=3D notif->ctx)) @@ -109,15 +109,16 @@ static const struct ubuf_info_ops io_ubuf_ops =3D { .complete =3D io_tx_ubuf_complete, .link_skb =3D io_link_skb, }; =20 struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx) - __must_hold(&ctx->uring_lock) { struct io_kiocb *notif; struct io_notif_data *nd; =20 + io_ring_ctx_assert_locked(ctx); + if (unlikely(!io_alloc_req(ctx, ¬if))) return NULL; notif->ctx =3D ctx; notif->opcode =3D IORING_OP_NOP; notif->flags =3D 0; diff --git a/io_uring/notif.h b/io_uring/notif.h index f3589cfef4a9..c33c9a1179c9 100644 --- a/io_uring/notif.h +++ b/io_uring/notif.h @@ -31,14 +31,15 @@ static inline struct io_notif_data *io_notif_to_data(st= ruct io_kiocb *notif) { return io_kiocb_to_cmd(notif, struct io_notif_data); } =20 static inline void io_notif_flush(struct io_kiocb *notif) - __must_hold(¬if->ctx->uring_lock) { struct io_notif_data *nd =3D io_notif_to_data(notif); =20 + io_ring_ctx_assert_locked(notif->ctx); + io_tx_ubuf_complete(NULL, &nd->uarg, true); } =20 static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned le= n) { diff --git a/io_uring/openclose.c b/io_uring/openclose.c index bfeb91b31bba..432a7a68eec1 100644 --- a/io_uring/openclose.c +++ b/io_uring/openclose.c @@ -189,15 +189,16 @@ void io_open_cleanup(struct io_kiocb *req) } =20 int __io_close_fixed(struct io_ring_ctx *ctx, unsigned int issue_flags, unsigned int offset) { + struct io_ring_ctx_lock_state lock_state; int ret; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); ret =3D io_fixed_fd_remove(ctx, offset); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); =20 return ret; } =20 static inline int io_close_fixed(struct io_kiocb *req, unsigned int issue_= flags) @@ -333,18 +334,19 @@ int io_pipe_prep(struct io_kiocb *req, const struct i= o_uring_sqe *sqe) =20 static int io_pipe_fixed(struct io_kiocb *req, struct file **files, unsigned int issue_flags) { struct io_pipe *p =3D io_kiocb_to_cmd(req, struct io_pipe); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; int ret, fds[2] =3D { -1, -1 }; int slot =3D p->file_slot; =20 if (p->flags & O_CLOEXEC) return -EINVAL; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); =20 ret =3D __io_fixed_fd_install(ctx, files[0], slot); if (ret < 0) goto err; fds[0] =3D ret; @@ -361,23 +363,23 @@ static int io_pipe_fixed(struct io_kiocb *req, struct= file **files, if (ret < 0) goto err; fds[1] =3D ret; files[1] =3D NULL; =20 - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); =20 if (!copy_to_user(p->fds, fds, sizeof(fds))) return 0; =20 ret =3D -EFAULT; - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); err: if (fds[0] !=3D -1) io_fixed_fd_remove(ctx, fds[0]); if (fds[1] !=3D -1) io_fixed_fd_remove(ctx, fds[1]); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return ret; } =20 static int io_pipe_fd(struct io_kiocb *req, struct file **files) { diff --git a/io_uring/poll.c b/io_uring/poll.c index aac4b3b881fb..9e82315f977b 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -121,11 +121,11 @@ static struct io_poll *io_poll_get_single(struct io_k= iocb *req) static void io_poll_req_insert(struct io_kiocb *req) { struct io_hash_table *table =3D &req->ctx->cancel_table; u32 index =3D hash_long(req->cqe.user_data, table->hash_bits); =20 - lockdep_assert_held(&req->ctx->uring_lock); + io_ring_ctx_assert_locked(req->ctx); =20 hlist_add_head(&req->hash_node, &table->hbs[index].list); } =20 static void io_init_poll_iocb(struct io_poll *poll, __poll_t events) @@ -339,11 +339,11 @@ void io_poll_task_func(struct io_tw_req tw_req, io_tw= _token_t tw) } else if (ret =3D=3D IOU_POLL_REQUEUE) { __io_poll_execute(req, 0); return; } io_poll_remove_entries(req); - /* task_work always has ->uring_lock held */ + /* task_work always holds ctx uring lock */ hash_del(&req->hash_node); =20 if (req->opcode =3D=3D IORING_OP_POLL_ADD) { if (ret =3D=3D IOU_POLL_DONE) { struct io_poll *poll; @@ -525,15 +525,16 @@ static bool io_poll_can_finish_inline(struct io_kiocb= *req, return pt->owning || io_poll_get_ownership(req); } =20 static void io_poll_add_hash(struct io_kiocb *req, unsigned int issue_flag= s) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); io_poll_req_insert(req); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); } =20 /* * Returns 0 when it's handed over for polling. The caller owns the reques= ts if * it returns non-zero, but otherwise should not touch it. Negative values @@ -728,11 +729,11 @@ __cold bool io_poll_remove_all(struct io_ring_ctx *ct= x, struct io_uring_task *tc struct hlist_node *tmp; struct io_kiocb *req; bool found =3D false; int i; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 for (i =3D 0; i < nr_buckets; i++) { struct io_hash_bucket *hb =3D &ctx->cancel_table.hbs[i]; =20 hlist_for_each_entry_safe(req, tmp, &hb->list, hash_node) { @@ -814,15 +815,16 @@ static int __io_poll_cancel(struct io_ring_ctx *ctx, = struct io_cancel_data *cd) } =20 int io_poll_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd, unsigned issue_flags) { + struct io_ring_ctx_lock_state lock_state; int ret; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); ret =3D __io_poll_cancel(ctx, cd); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return ret; } =20 static __poll_t io_poll_parse_events(const struct io_uring_sqe *sqe, unsigned int flags) @@ -905,16 +907,17 @@ int io_poll_add(struct io_kiocb *req, unsigned int is= sue_flags) } =20 int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags) { struct io_poll_update *poll_update =3D io_kiocb_to_cmd(req, struct io_pol= l_update); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_cancel_data cd =3D { .ctx =3D ctx, .data =3D poll_update->old_u= ser_data, }; struct io_kiocb *preq; int ret2, ret =3D 0; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); preq =3D io_poll_find(ctx, true, &cd); ret2 =3D io_poll_disarm(preq); if (ret2) { ret =3D ret2; goto out; @@ -950,11 +953,11 @@ int io_poll_remove(struct io_kiocb *req, unsigned int= issue_flags) if (preq->cqe.res < 0) req_set_fail(preq); preq->io_task_work.func =3D io_req_task_complete; io_req_task_work_add(preq); out: - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); if (ret < 0) { req_set_fail(req); return ret; } /* complete update request, we're done with it */ diff --git a/io_uring/register.c b/io_uring/register.c index 9e473c244041..da5030bcae2f 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -197,28 +197,30 @@ static int io_register_enable_rings(struct io_ring_ct= x *ctx) if (ctx->sq_data && wq_has_sleeper(&ctx->sq_data->wait)) wake_up(&ctx->sq_data->wait); return 0; } =20 -static __cold int __io_register_iowq_aff(struct io_ring_ctx *ctx, - cpumask_var_t new_mask) +static __cold int +__io_register_iowq_aff(struct io_ring_ctx *ctx, cpumask_var_t new_mask, + struct io_ring_ctx_lock_state *lock_state) { int ret; =20 if (!(ctx->flags & IORING_SETUP_SQPOLL)) { ret =3D io_wq_cpu_affinity(current->io_uring, new_mask); } else { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); ret =3D io_sqpoll_wq_cpu_affinity(ctx, new_mask); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); } =20 return ret; } =20 -static __cold int io_register_iowq_aff(struct io_ring_ctx *ctx, - void __user *arg, unsigned len) +static __cold int +io_register_iowq_aff(struct io_ring_ctx *ctx, void __user *arg, unsigned l= en, + struct io_ring_ctx_lock_state *lock_state) { cpumask_var_t new_mask; int ret; =20 if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) @@ -240,30 +242,34 @@ static __cold int io_register_iowq_aff(struct io_ring= _ctx *ctx, if (ret) { free_cpumask_var(new_mask); return -EFAULT; } =20 - ret =3D __io_register_iowq_aff(ctx, new_mask); + ret =3D __io_register_iowq_aff(ctx, new_mask, lock_state); free_cpumask_var(new_mask); return ret; } =20 -static __cold int io_unregister_iowq_aff(struct io_ring_ctx *ctx) +static __cold int +io_unregister_iowq_aff(struct io_ring_ctx *ctx, + struct io_ring_ctx_lock_state *lock_state) { - return __io_register_iowq_aff(ctx, NULL); + return __io_register_iowq_aff(ctx, NULL, lock_state); } =20 -static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx, - void __user *arg) - __must_hold(&ctx->uring_lock) +static __cold int +io_register_iowq_max_workers(struct io_ring_ctx *ctx, void __user *arg, + struct io_ring_ctx_lock_state *lock_state) { struct io_tctx_node *node; struct io_uring_task *tctx =3D NULL; struct io_sq_data *sqd =3D NULL; __u32 new_count[2]; int i, ret; =20 + io_ring_ctx_assert_locked(ctx); + if (copy_from_user(new_count, arg, sizeof(new_count))) return -EFAULT; for (i =3D 0; i < ARRAY_SIZE(new_count); i++) if (new_count[i] > INT_MAX) return -EINVAL; @@ -272,18 +278,18 @@ static __cold int io_register_iowq_max_workers(struct= io_ring_ctx *ctx, sqd =3D ctx->sq_data; if (sqd) { struct task_struct *tsk; =20 /* - * Observe the correct sqd->lock -> ctx->uring_lock - * ordering. Fine to drop uring_lock here, we hold + * Observe the correct sqd->lock -> ctx uring lock + * ordering. Fine to drop ctx uring lock here, we hold * a ref to the ctx. */ refcount_inc(&sqd->refs); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); mutex_lock(&sqd->lock); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); tsk =3D sqpoll_task_locked(sqd); if (tsk) tctx =3D tsk->io_uring; } } else { @@ -304,14 +310,14 @@ static __cold int io_register_iowq_max_workers(struct= io_ring_ctx *ctx, } else { memset(new_count, 0, sizeof(new_count)); } =20 if (sqd) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); mutex_unlock(&sqd->lock); io_put_sq_data(sqd); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); } =20 if (copy_to_user(arg, new_count, sizeof(new_count))) return -EFAULT; =20 @@ -331,14 +337,14 @@ static __cold int io_register_iowq_max_workers(struct= io_ring_ctx *ctx, (void)io_wq_max_workers(tctx->io_wq, new_count); } return 0; err: if (sqd) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); mutex_unlock(&sqd->lock); io_put_sq_data(sqd); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); } return ret; } =20 static int io_register_clock(struct io_ring_ctx *ctx, @@ -394,11 +400,12 @@ static void io_register_free_rings(struct io_ring_ctx= *ctx, #define RESIZE_FLAGS (IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP) #define COPY_FLAGS (IORING_SETUP_NO_SQARRAY | IORING_SETUP_SQE128 | \ IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP | \ IORING_SETUP_CQE_MIXED | IORING_SETUP_SQE_MIXED) =20 -static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *= arg) +static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *= arg, + struct io_ring_ctx_lock_state *lock_state) { struct io_ctx_config config; struct io_uring_region_desc rd; struct io_ring_ctx_rings o =3D { }, n =3D { }, *to_free =3D NULL; unsigned i, tail, old_head; @@ -468,13 +475,13 @@ static int io_register_resize_rings(struct io_ring_ct= x *ctx, void __user *arg) =20 /* * If using SQPOLL, park the thread */ if (ctx->sq_data) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); io_sq_thread_park(ctx->sq_data); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); } =20 /* * We'll do the swap. Grab the ctx->mmap_lock, which will exclude * any new mmap's on the ring fd. Clear out existing mappings to prevent @@ -605,13 +612,12 @@ static int io_register_mem_region(struct io_ring_ctx = *ctx, void __user *uarg) io_region_publish(ctx, ®ion, &ctx->param_region); return 0; } =20 static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, - void __user *arg, unsigned nr_args) - __releases(ctx->uring_lock) - __acquires(ctx->uring_lock) + void __user *arg, unsigned nr_args, + struct io_ring_ctx_lock_state *lock_state) { int ret; =20 /* * We don't quiesce the refs for register anymore and so it can't be @@ -718,26 +724,26 @@ static int __io_uring_register(struct io_ring_ctx *ct= x, unsigned opcode, break; case IORING_REGISTER_IOWQ_AFF: ret =3D -EINVAL; if (!arg || !nr_args) break; - ret =3D io_register_iowq_aff(ctx, arg, nr_args); + ret =3D io_register_iowq_aff(ctx, arg, nr_args, lock_state); break; case IORING_UNREGISTER_IOWQ_AFF: ret =3D -EINVAL; if (arg || nr_args) break; - ret =3D io_unregister_iowq_aff(ctx); + ret =3D io_unregister_iowq_aff(ctx, lock_state); break; case IORING_REGISTER_IOWQ_MAX_WORKERS: ret =3D -EINVAL; if (!arg || nr_args !=3D 2) break; - ret =3D io_register_iowq_max_workers(ctx, arg); + ret =3D io_register_iowq_max_workers(ctx, arg, lock_state); break; case IORING_REGISTER_RING_FDS: - ret =3D io_ringfd_register(ctx, arg, nr_args); + ret =3D io_ringfd_register(ctx, arg, nr_args, lock_state); break; case IORING_UNREGISTER_RING_FDS: ret =3D io_ringfd_unregister(ctx, arg, nr_args); break; case IORING_REGISTER_PBUF_RING: @@ -754,11 +760,11 @@ static int __io_uring_register(struct io_ring_ctx *ct= x, unsigned opcode, break; case IORING_REGISTER_SYNC_CANCEL: ret =3D -EINVAL; if (!arg || nr_args !=3D 1) break; - ret =3D io_sync_cancel(ctx, arg); + ret =3D io_sync_cancel(ctx, arg, lock_state); break; case IORING_REGISTER_FILE_ALLOC_RANGE: ret =3D -EINVAL; if (!arg || nr_args) break; @@ -790,11 +796,11 @@ static int __io_uring_register(struct io_ring_ctx *ct= x, unsigned opcode, break; case IORING_REGISTER_CLONE_BUFFERS: ret =3D -EINVAL; if (!arg || nr_args !=3D 1) break; - ret =3D io_register_clone_buffers(ctx, arg); + ret =3D io_register_clone_buffers(ctx, arg, lock_state); break; case IORING_REGISTER_ZCRX_IFQ: ret =3D -EINVAL; if (!arg || nr_args !=3D 1) break; @@ -802,11 +808,11 @@ static int __io_uring_register(struct io_ring_ctx *ct= x, unsigned opcode, break; case IORING_REGISTER_RESIZE_RINGS: ret =3D -EINVAL; if (!arg || nr_args !=3D 1) break; - ret =3D io_register_resize_rings(ctx, arg); + ret =3D io_register_resize_rings(ctx, arg, lock_state); break; case IORING_REGISTER_MEM_REGION: ret =3D -EINVAL; if (!arg || nr_args !=3D 1) break; @@ -894,10 +900,11 @@ static int io_uring_register_blind(unsigned int opcod= e, void __user *arg, } =20 SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, void __user *, arg, unsigned int, nr_args) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx; long ret =3D -EBADF; struct file *file; bool use_registered_ring; =20 @@ -913,15 +920,15 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, = unsigned int, opcode, file =3D io_uring_register_get_file(fd, use_registered_ring); if (IS_ERR(file)) return PTR_ERR(file); ctx =3D file->private_data; =20 - mutex_lock(&ctx->uring_lock); - ret =3D __io_uring_register(ctx, opcode, arg, nr_args); + io_ring_ctx_lock(ctx, &lock_state); + ret =3D __io_uring_register(ctx, opcode, arg, nr_args, &lock_state); =20 trace_io_uring_register(ctx, opcode, ctx->file_table.data.nr, ctx->buf_table.nr, ret); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 fput(file); return ret; } diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index a63474b331bf..93bebedf86eb 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -349,11 +349,11 @@ static int __io_register_rsrc_update(struct io_ring_c= tx *ctx, unsigned type, struct io_uring_rsrc_update2 *up, unsigned nr_args) { __u32 tmp; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 if (check_add_overflow(up->offset, nr_args, &tmp)) return -EOVERFLOW; =20 switch (type) { @@ -497,14 +497,16 @@ int io_files_update(struct io_kiocb *req, unsigned in= t issue_flags) up2.resv2 =3D 0; =20 if (up->offset =3D=3D IORING_FILE_INDEX_ALLOC) { ret =3D io_files_update_with_index_alloc(req, issue_flags); } else { - io_ring_submit_lock(ctx, issue_flags); + struct io_ring_ctx_lock_state lock_state; + + io_ring_submit_lock(ctx, issue_flags, &lock_state); ret =3D __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up2, up->nr_args); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); } =20 if (ret < 0) req_set_fail(req); io_req_set_res(req, ret, 0); @@ -940,18 +942,19 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd,= struct request *rq, void (*release)(void *), unsigned int index, unsigned int issue_flags) { struct io_ring_ctx *ctx =3D cmd_to_io_kiocb(cmd)->ctx; struct io_rsrc_data *data =3D &ctx->buf_table; + struct io_ring_ctx_lock_state lock_state; struct req_iterator rq_iter; struct io_mapped_ubuf *imu; struct io_rsrc_node *node; struct bio_vec bv; unsigned int nr_bvecs =3D 0; int ret =3D 0; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); if (index >=3D data->nr) { ret =3D -EINVAL; goto unlock; } index =3D array_index_nospec(index, data->nr); @@ -993,24 +996,25 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd,= struct request *rq, imu->nr_bvecs =3D nr_bvecs; =20 node->buf =3D imu; data->nodes[index] =3D node; unlock: - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return ret; } EXPORT_SYMBOL_GPL(io_buffer_register_bvec); =20 int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, unsigned int issue_flags) { struct io_ring_ctx *ctx =3D cmd_to_io_kiocb(cmd)->ctx; struct io_rsrc_data *data =3D &ctx->buf_table; + struct io_ring_ctx_lock_state lock_state; struct io_rsrc_node *node; int ret =3D 0; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); if (index >=3D data->nr) { ret =3D -EINVAL; goto unlock; } index =3D array_index_nospec(index, data->nr); @@ -1026,11 +1030,11 @@ int io_buffer_unregister_bvec(struct io_uring_cmd *= cmd, unsigned int index, } =20 io_put_rsrc_node(ctx, node); data->nodes[index] =3D NULL; unlock: - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return ret; } EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec); =20 static int validate_fixed_range(u64 buf_addr, size_t len, @@ -1117,27 +1121,28 @@ static int io_import_fixed(int ddir, struct iov_ite= r *iter, } =20 inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req, unsigned issue_flags) { + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_rsrc_node *node; =20 if (req->flags & REQ_F_BUF_NODE) return req->buf_node; req->flags |=3D REQ_F_BUF_NODE; =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); node =3D io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); if (node) { node->refs++; req->buf_node =3D node; - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return node; } req->flags &=3D ~REQ_F_BUF_NODE; - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return NULL; } =20 int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter, u64 buf_addr, size_t len, int ddir, @@ -1150,28 +1155,32 @@ int io_import_reg_buf(struct io_kiocb *req, struct = iov_iter *iter, return -EFAULT; return io_import_fixed(ddir, iter, node->buf, buf_addr, len); } =20 /* Lock two rings at once. The rings must be different! */ -static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *c= tx2) +static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *c= tx2, + struct io_ring_ctx_lock_state *lock_state1, + struct io_ring_ctx_lock_state *lock_state2) { - if (ctx1 > ctx2) + if (ctx1 > ctx2) { swap(ctx1, ctx2); - mutex_lock(&ctx1->uring_lock); - mutex_lock_nested(&ctx2->uring_lock, SINGLE_DEPTH_NESTING); + swap(lock_state1, lock_state2); + } + io_ring_ctx_lock(ctx1, lock_state1); + io_ring_ctx_lock_nested(ctx2, SINGLE_DEPTH_NESTING, lock_state2); } =20 /* Both rings are locked by the caller. */ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *s= rc_ctx, struct io_uring_clone_buffers *arg) { struct io_rsrc_data data; int i, ret, off, nr; unsigned int nbufs; =20 - lockdep_assert_held(&ctx->uring_lock); - lockdep_assert_held(&src_ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); + io_ring_ctx_assert_locked(src_ctx); =20 /* * Accounting state is shared between the two rings; that only works if * both rings are accounted towards the same counters. */ @@ -1271,12 +1280,14 @@ static int io_clone_buffers(struct io_ring_ctx *ctx= , struct io_ring_ctx *src_ctx * is given in the src_fd to the current ring. This is identical to regist= ering * the buffers with ctx, except faster as mappings already exist. * * Since the memory is already accounted once, don't account it again. */ -int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg) +int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg, + struct io_ring_ctx_lock_state *lock_state) { + struct io_ring_ctx_lock_state lock_state2; struct io_uring_clone_buffers buf; struct io_ring_ctx *src_ctx; bool registered_src; struct file *file; int ret; @@ -1295,12 +1306,12 @@ int io_register_clone_buffers(struct io_ring_ctx *c= tx, void __user *arg) if (IS_ERR(file)) return PTR_ERR(file); =20 src_ctx =3D file->private_data; if (src_ctx !=3D ctx) { - mutex_unlock(&ctx->uring_lock); - lock_two_rings(ctx, src_ctx); + io_ring_ctx_unlock(ctx, lock_state); + lock_two_rings(ctx, src_ctx, lock_state, &lock_state2); =20 if (src_ctx->submitter_task && src_ctx->submitter_task !=3D current) { ret =3D -EEXIST; goto out; @@ -1309,11 +1320,11 @@ int io_register_clone_buffers(struct io_ring_ctx *c= tx, void __user *arg) =20 ret =3D io_clone_buffers(ctx, src_ctx, &buf); =20 out: if (src_ctx !=3D ctx) - mutex_unlock(&src_ctx->uring_lock); + io_ring_ctx_unlock(src_ctx, &lock_state2); =20 fput(file); return ret; } =20 diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index d603f6a47f5e..388a0508ec59 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -2,10 +2,11 @@ #ifndef IOU_RSRC_H #define IOU_RSRC_H =20 #include #include +#include "io_uring.h" =20 #define IO_VEC_CACHE_SOFT_CAP 256 =20 enum { IORING_RSRC_FILE =3D 0, @@ -68,11 +69,12 @@ int io_import_reg_vec(int ddir, struct iov_iter *iter, struct io_kiocb *req, struct iou_vec *vec, unsigned nr_iovs, unsigned issue_flags); int io_prep_reg_iovec(struct io_kiocb *req, struct iou_vec *iv, const struct iovec __user *uvec, size_t uvec_segs); =20 -int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg); +int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg, + struct io_ring_ctx_lock_state *lock_state); int io_sqe_buffers_unregister(struct io_ring_ctx *ctx); int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, unsigned int nr_args, u64 __user *tags); int io_sqe_files_unregister(struct io_ring_ctx *ctx); int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, @@ -97,11 +99,11 @@ static inline struct io_rsrc_node *io_rsrc_node_lookup(= struct io_rsrc_data *data return NULL; } =20 static inline void io_put_rsrc_node(struct io_ring_ctx *ctx, struct io_rsr= c_node *node) { - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); if (!--node->refs) io_free_rsrc_node(ctx, node); } =20 static inline bool io_reset_rsrc_node(struct io_ring_ctx *ctx, diff --git a/io_uring/rw.c b/io_uring/rw.c index 331af6bf4234..4688b210cff8 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -462,11 +462,11 @@ int io_read_mshot_prep(struct io_kiocb *req, const st= ruct io_uring_sqe *sqe) =20 void io_readv_writev_cleanup(struct io_kiocb *req) { struct io_async_rw *rw =3D req->async_data; =20 - lockdep_assert_held(&req->ctx->uring_lock); + io_ring_ctx_assert_locked(req->ctx); io_vec_free(&rw->vec); io_rw_recycle(req, 0); } =20 static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) diff --git a/io_uring/splice.c b/io_uring/splice.c index e81ebbb91925..567695c39091 100644 --- a/io_uring/splice.c +++ b/io_uring/splice.c @@ -58,26 +58,27 @@ void io_splice_cleanup(struct io_kiocb *req) =20 static struct file *io_splice_get_file(struct io_kiocb *req, unsigned int issue_flags) { struct io_splice *sp =3D io_kiocb_to_cmd(req, struct io_splice); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; struct io_rsrc_node *node; struct file *file =3D NULL; =20 if (!(sp->flags & SPLICE_F_FD_IN_FIXED)) return io_file_get_normal(req, sp->splice_fd_in); =20 - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); node =3D io_rsrc_node_lookup(&ctx->file_table.data, sp->splice_fd_in); if (node) { node->refs++; sp->rsrc_node =3D node; file =3D io_slot_file(node); req->flags |=3D REQ_F_NEED_CLEANUP; } - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return file; } =20 int io_tee(struct io_kiocb *req, unsigned int issue_flags) { diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index 74c1a130cd87..0b4573b53cf3 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -211,29 +211,30 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, st= ruct io_sq_data *sqd, /* if we're handling multiple rings, cap submit size for fairness */ if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE) to_submit =3D IORING_SQPOLL_CAP_ENTRIES_VALUE; =20 if (to_submit || !wq_list_empty(&ctx->iopoll_list)) { + struct io_ring_ctx_lock_state lock_state; const struct cred *creds =3D NULL; =20 io_sq_start_worktime(ist); =20 if (ctx->sq_creds !=3D current_cred()) creds =3D override_creds(ctx->sq_creds); =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); if (!wq_list_empty(&ctx->iopoll_list)) io_do_iopoll(ctx, true); =20 /* * Don't submit if refs are dying, good for io_uring_register(), * but also it is relied upon by io_ring_exit_work() */ if (to_submit && likely(!percpu_ref_is_dying(&ctx->refs)) && !(ctx->flags & IORING_SETUP_R_DISABLED)) ret =3D io_submit_sqes(ctx, to_submit); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait)) wake_up(&ctx->sqo_sq_wait); if (creds) revert_creds(creds); diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 5b66755579c0..add6134e934d 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -13,27 +13,28 @@ #include "tctx.h" =20 static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, struct task_struct *task) { + struct io_ring_ctx_lock_state lock_state; struct io_wq_hash *hash; struct io_wq_data data; unsigned int concurrency; =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); hash =3D ctx->hash_map; if (!hash) { hash =3D kzalloc(sizeof(*hash), GFP_KERNEL); if (!hash) { - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); return ERR_PTR(-ENOMEM); } refcount_set(&hash->refs, 1); init_waitqueue_head(&hash->wait); ctx->hash_map =3D hash; } - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); =20 data.hash =3D hash; data.task =3D task; =20 /* Do QD, or 4 * CPUS, whatever is smallest */ @@ -121,10 +122,12 @@ int __io_uring_add_tctx_node(struct io_ring_ctx *ctx) if (ret) return ret; } } if (!xa_load(&tctx->xa, (unsigned long)ctx)) { + struct io_ring_ctx_lock_state lock_state; + node =3D kmalloc(sizeof(*node), GFP_KERNEL); if (!node) return -ENOMEM; node->ctx =3D ctx; node->task =3D current; @@ -134,13 +137,13 @@ int __io_uring_add_tctx_node(struct io_ring_ctx *ctx) if (ret) { kfree(node); return ret; } =20 - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, &lock_state); list_add(&node->ctx_node, &ctx->tctx_list); - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, &lock_state); } return 0; } =20 int __io_uring_add_tctx_node_from_submit(struct io_ring_ctx *ctx) @@ -163,10 +166,11 @@ int __io_uring_add_tctx_node_from_submit(struct io_ri= ng_ctx *ctx) * Remove this io_uring_file -> task mapping. */ __cold void io_uring_del_tctx_node(unsigned long index) { struct io_uring_task *tctx =3D current->io_uring; + struct io_ring_ctx_lock_state lock_state; struct io_tctx_node *node; =20 if (!tctx) return; node =3D xa_erase(&tctx->xa, index); @@ -174,13 +178,13 @@ __cold void io_uring_del_tctx_node(unsigned long inde= x) return; =20 WARN_ON_ONCE(current !=3D node->task); WARN_ON_ONCE(list_empty(&node->ctx_node)); =20 - mutex_lock(&node->ctx->uring_lock); + io_ring_ctx_lock(node->ctx, &lock_state); list_del(&node->ctx_node); - mutex_unlock(&node->ctx->uring_lock); + io_ring_ctx_unlock(node->ctx, &lock_state); =20 if (tctx->last =3D=3D node->ctx) tctx->last =3D NULL; kfree(node); } @@ -196,11 +200,11 @@ __cold void io_uring_clean_tctx(struct io_uring_task = *tctx) cond_resched(); } if (wq) { /* * Must be after io_uring_del_tctx_node() (removes nodes under - * uring_lock) to avoid race with io_uring_try_cancel_iowq(). + * ctx uring lock) to avoid race with io_uring_try_cancel_iowq() */ io_wq_put_and_exit(wq); tctx->io_wq =3D NULL; } } @@ -259,23 +263,24 @@ static int io_ring_add_registered_fd(struct io_uring_= task *tctx, int fd, * index. If no index is desired, application may set ->offset =3D=3D -1U * and we'll find an available index. Returns number of entries * successfully processed, or < 0 on error if none were processed. */ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, - unsigned nr_args) + unsigned nr_args, + struct io_ring_ctx_lock_state *lock_state) { struct io_uring_rsrc_update __user *arg =3D __arg; struct io_uring_rsrc_update reg; struct io_uring_task *tctx; int ret, i; =20 if (!nr_args || nr_args > IO_RINGFD_REG_MAX) return -EINVAL; =20 - mutex_unlock(&ctx->uring_lock); + io_ring_ctx_unlock(ctx, lock_state); ret =3D __io_uring_add_tctx_node(ctx); - mutex_lock(&ctx->uring_lock); + io_ring_ctx_lock(ctx, lock_state); if (ret) return ret; =20 tctx =3D current->io_uring; for (i =3D 0; i < nr_args; i++) { diff --git a/io_uring/tctx.h b/io_uring/tctx.h index 608e96de70a2..f35dbf19bb80 100644 --- a/io_uring/tctx.h +++ b/io_uring/tctx.h @@ -1,7 +1,9 @@ // SPDX-License-Identifier: GPL-2.0 =20 +#include "io_uring.h" + struct io_tctx_node { struct list_head ctx_node; struct task_struct *task; struct io_ring_ctx *ctx; }; @@ -13,11 +15,12 @@ int __io_uring_add_tctx_node(struct io_ring_ctx *ctx); int __io_uring_add_tctx_node_from_submit(struct io_ring_ctx *ctx); void io_uring_clean_tctx(struct io_uring_task *tctx); =20 void io_uring_unreg_ringfd(void); int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, - unsigned nr_args); + unsigned nr_args, + struct io_ring_ctx_lock_state *lock_state); int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args); =20 /* * Note that this task has used io_uring. We use it for cancelation purpos= es. diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 197474911f04..a8a128a3f0a2 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -51,11 +51,11 @@ bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *= ctx, { struct hlist_node *tmp; struct io_kiocb *req; bool ret =3D false; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 hlist_for_each_entry_safe(req, tmp, &ctx->cancelable_uring_cmd, hash_node) { struct io_uring_cmd *cmd =3D io_kiocb_to_cmd(req, struct io_uring_cmd); @@ -76,19 +76,20 @@ bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *= ctx, =20 static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd, unsigned int issue_flags) { struct io_kiocb *req =3D cmd_to_io_kiocb(cmd); + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; =20 if (!(cmd->flags & IORING_URING_CMD_CANCELABLE)) return; =20 cmd->flags &=3D ~IORING_URING_CMD_CANCELABLE; - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); hlist_del(&req->hash_node); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); } =20 /* * Mark this command as concelable, then io_uring_try_cancel_uring_cmd() * will try to cancel this issued command by sending ->uring_cmd() with @@ -103,14 +104,16 @@ void io_uring_cmd_mark_cancelable(struct io_uring_cmd= *cmd, { struct io_kiocb *req =3D cmd_to_io_kiocb(cmd); struct io_ring_ctx *ctx =3D req->ctx; =20 if (!(cmd->flags & IORING_URING_CMD_CANCELABLE)) { + struct io_ring_ctx_lock_state lock_state; + cmd->flags |=3D IORING_URING_CMD_CANCELABLE; - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); hlist_add_head(&req->hash_node, &ctx->cancelable_uring_cmd); - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); } } EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable); =20 void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd, diff --git a/io_uring/waitid.c b/io_uring/waitid.c index 2d4cbd47c67c..a69eb1b30b89 100644 --- a/io_uring/waitid.c +++ b/io_uring/waitid.c @@ -130,11 +130,11 @@ static void io_waitid_complete(struct io_kiocb *req, = int ret) struct io_waitid *iw =3D io_kiocb_to_cmd(req, struct io_waitid); =20 /* anyone completing better be holding a reference */ WARN_ON_ONCE(!(atomic_read(&iw->refs) & IO_WAITID_REF_MASK)); =20 - lockdep_assert_held(&req->ctx->uring_lock); + io_ring_ctx_assert_locked(req->ctx); =20 hlist_del_init(&req->hash_node); io_waitid_remove_wq(req); =20 ret =3D io_waitid_finish(req, ret); @@ -145,11 +145,11 @@ static void io_waitid_complete(struct io_kiocb *req, = int ret) =20 static bool __io_waitid_cancel(struct io_kiocb *req) { struct io_waitid *iw =3D io_kiocb_to_cmd(req, struct io_waitid); =20 - lockdep_assert_held(&req->ctx->uring_lock); + io_ring_ctx_assert_locked(req->ctx); =20 /* * Mark us canceled regardless of ownership. This will prevent a * potential retry from a spurious wakeup. */ @@ -280,10 +280,11 @@ int io_waitid_prep(struct io_kiocb *req, const struct= io_uring_sqe *sqe) =20 int io_waitid(struct io_kiocb *req, unsigned int issue_flags) { struct io_waitid *iw =3D io_kiocb_to_cmd(req, struct io_waitid); struct io_waitid_async *iwa =3D req->async_data; + struct io_ring_ctx_lock_state lock_state; struct io_ring_ctx *ctx =3D req->ctx; int ret; =20 ret =3D kernel_waitid_prepare(&iwa->wo, iw->which, iw->upid, &iw->info, iw->options, NULL); @@ -301,11 +302,11 @@ int io_waitid(struct io_kiocb *req, unsigned int issu= e_flags) * Cancel must hold the ctx lock, so there's no risk of cancelation * finding us until a) we remain on the list, and b) the lock is * dropped. We only need to worry about racing with the wakeup * callback. */ - io_ring_submit_lock(ctx, issue_flags); + io_ring_submit_lock(ctx, issue_flags, &lock_state); =20 /* * iw->head is valid under the ring lock, and as long as the request * is on the waitid_list where cancelations may find it. */ @@ -321,27 +322,27 @@ int io_waitid(struct io_kiocb *req, unsigned int issu= e_flags) /* * Nobody else grabbed a reference, it'll complete when we get * a waitqueue callback, or if someone cancels it. */ if (!io_waitid_drop_issue_ref(req)) { - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return IOU_ISSUE_SKIP_COMPLETE; } =20 /* * Wakeup triggered, racing with us. It was prevented from * completing because of that, queue up the tw to do that. */ - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); return IOU_ISSUE_SKIP_COMPLETE; } =20 hlist_del_init(&req->hash_node); io_waitid_remove_wq(req); ret =3D io_waitid_finish(req, ret); =20 - io_ring_submit_unlock(ctx, issue_flags); + io_ring_submit_unlock(ctx, issue_flags, &lock_state); done: if (ret < 0) req_set_fail(req); io_req_set_res(req, ret, 0); return IOU_COMPLETE; diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index b99cf2c6670a..f2ed49bbad63 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -851,11 +851,11 @@ static struct net_iov *__io_zcrx_get_free_niov(struct= io_zcrx_area *area) =20 void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx) { struct io_zcrx_ifq *ifq; =20 - lockdep_assert_held(&ctx->uring_lock); + io_ring_ctx_assert_locked(ctx); =20 while (1) { scoped_guard(mutex, &ctx->mmap_lock) { unsigned long id =3D 0; =20 --=20 2.45.2 From nobody Tue Dec 16 14:36:45 2025 Received: from mail-pl1-f227.google.com (mail-pl1-f227.google.com [209.85.214.227]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93F39308F07 for ; Mon, 15 Dec 2025 20:10:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.227 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829431; cv=none; b=PXXMBiuWslX++huzukBpcXv2ASRECiSmqo5cygQiy8NK+8aZ0KIgSkJm8szu2tNz4wQ8BYUyBTFGEGPwgceUobRN9b/18UCRNNkGuCOLCmxThA1OGpE7qPd7/RnyYMnSsYSLji8dDC6+ZA8zFaTWn8LQgZ9OXjIBL3mtckr77s4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765829431; c=relaxed/simple; bh=Th3x1GnVLvyWsIs7o2mH3MO7rmVc9JYDl4IXg7VFffc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NzXT/Fc48YJHi1CGkvuADPQQnHggIXKY3gKu+Ckc7ecYMKqzDkxzh+f+QzNEbTwAm7E5OTbwAYiM/Q912eX+hzbvCGt1l/IydoJfFFBy8N7BIUGCkJnJNwptHqUHk8NxJ7kelL/9EdR/huN0r2P27pJfM+IqMAWy8fh2LjBnTUU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=CQgFxq9+; arc=none smtp.client-ip=209.85.214.227 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="CQgFxq9+" Received: by mail-pl1-f227.google.com with SMTP id d9443c01a7336-2a08c65fceeso4504715ad.2 for ; Mon, 15 Dec 2025 12:10:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1765829429; x=1766434229; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Ko/LClSHTdkG76HWn8DkHdX7BBXIyp6DFUvzQNarTeQ=; b=CQgFxq9+TSpOX3XyaEdLPrdGb4q+/TRFyBcJOh2GuCWiN3TVfcVp8uEW0+O1uLHrMg TrX0pd2HxJRamiu/FhvF7gndfsM6IFqJo4Zd6wyqslB8Q1KOCuHrcBiXCYg9YXddWEMZ eJOnmLdKVk++75Ldyjw93r6d+HiWbwxctWNjjfm2Ft1iGvVjrzlvmGWPg9+B9c5cvzn3 GH8H9OpdLOzTZVQiKyQqRx4OCek/E3W49EPo2+NAT0pJ8pFdoWJt3Lh8aOyAYxWcV+4p kZmNUIemh1jysAp6ZE77ksb65XNZ/QngpHRwApfFceQvbxx9+xJ9X34YNeeywo9vJp51 Qbww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765829429; x=1766434229; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Ko/LClSHTdkG76HWn8DkHdX7BBXIyp6DFUvzQNarTeQ=; b=EQ0nGsbhx54muB24muIK+k3aUD4AKHB4+ggSpCwCCICq181/TEURU3NPTj1miXHzSK BgsmBYYsOfR5a1xafBCLtGwBgK50ev6vt6P4rzYxGWjCt0RwM2H4Q6eAuMo4SCW/5U4S oIr+sZQw72deHx277pbo+6giV7+aBawD/onJ6QDtiWCxrWEMkby8aQrnrPE47mxXhvti fotv+Y2Rk1ucse7vlfdevhH/e09pYx1Gzd6FeIg1+KFgarsB/8Zx02aJz4bWmFU2s9OS Y87uUyW1o/M9wrFLBlKCkSEt/L6EewxG4QJyQDfpN+dDCg7X6AqHsau3cbtOrKvkqGar P1Mg== X-Forwarded-Encrypted: i=1; AJvYcCUJNn+P5/ty3iPiLJiG9qLqxXqqNJ2XjH18rDUtKRk5afQYeS2Z72VHQkVmQkcfRwjGYIJdrtcJDlDSA5I=@vger.kernel.org X-Gm-Message-State: AOJu0YxEAfTrO8BY2NBulZWH4mzHSDfBC8yX9Y+4TeT18DtuGRnwQWVd VvRxwBO83eXxxa7iiXeQGGJ1kWKoteArnD4ilENGf4anBCfyyFqC9FHm0NEv7RMzj4BpMGvu9tF OGuEgkwOtCSH20H4oFxT4q8P2UAO06AKSmKvnutq5lV2tpMW40tIU X-Gm-Gg: AY/fxX5X0CJtHA92q3rwWOEddtA5rqB6iy21UdG4CYiGaw4r0+O7V026Rg6dHnZSVw5 Mh3L+aKeHxbMMyc+ixER4MKXVCLICLHMJCVX8g32MKUdWVnGcd7JFl3hLNDoz8BJbv9PzY02AEA rir+WVvGtv2RgPgfaFamz6S+h5RotdJClNubo6NHDzaXPA08JdOSPnmcR6pqj0548XhzKV6IDbg c8+4+osREWH+3pWis4GDsdWvFX8OnoYitI5nUqDGYGUckZRf63DfB+vgyKh7MhxdzBmcQ6vzOYc AALR8qLIowFjHZoGkttiRee1jD27ZE8nYcmwBdPXXd78d890GDQ5vNwxQ0f54y5GnCAnwhPDucO LcjUrZFxn1ED3otGHGy25nKtUuZ0= X-Google-Smtp-Source: AGHT+IEZQLFh/AgKtPTPGty6OJt5+mqM9TvDA8qf8mzAk9jAKV7/9jzHrUr8dG3UFHhpYJiCs2Be0je2xaqc X-Received: by 2002:a17:903:1510:b0:2a0:b7cd:d9c6 with SMTP id d9443c01a7336-2a0b7cddbb1mr54185655ad.6.1765829428629; Mon, 15 Dec 2025 12:10:28 -0800 (PST) Received: from c7-smtp-2023.dev.purestorage.com ([2620:125:9017:12:36:3:5:0]) by smtp-relay.gmail.com with ESMTPS id d9443c01a7336-2a0d72620f2sm7448815ad.6.2025.12.15.12.10.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 12:10:28 -0800 (PST) X-Relaying-Domain: purestorage.com Received: from dev-csander.dev.purestorage.com (dev-csander.dev.purestorage.com [10.7.70.37]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id F1EB4340644; Mon, 15 Dec 2025 13:10:27 -0700 (MST) Received: by dev-csander.dev.purestorage.com (Postfix, from userid 1557716354) id EE9EFE41D23; Mon, 15 Dec 2025 13:10:27 -0700 (MST) From: Caleb Sander Mateos To: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Joanne Koong , Caleb Sander Mateos , syzbot@syzkaller.appspotmail.com Subject: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER Date: Mon, 15 Dec 2025 13:09:09 -0700 Message-ID: <20251215200909.3505001-7-csander@purestorage.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20251215200909.3505001-1-csander@purestorage.com> References: <20251215200909.3505001-1-csander@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS workloads. Even when only one thread pinned to a single CPU is accessing the io_ring_ctx, the atomic CASes required to lock and unlock the mutex are very hot instructions. The mutex's primary purpose is to prevent concurrent io_uring system calls on the same io_ring_ctx. However, there is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one task will make io_uring_enter() and io_uring_register() system calls on the io_ring_ctx once it's enabled. So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On other tasks acquiring the ctx uring lock, use a task work item to suspend the submitter_task for the critical section. If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task may be set concurrently, so acquire the uring_lock before checking it. If submitter_task isn't set yet, the uring_lock suffices to provide mutual exclusion. Signed-off-by: Caleb Sander Mateos Tested-by: syzbot@syzkaller.appspotmail.com --- io_uring/io_uring.c | 12 +++++ io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 123 insertions(+), 3 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index ac71350285d7..9a9dfcb0476e 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -363,10 +363,22 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(s= truct io_uring_params *p) xa_destroy(&ctx->io_bl_xa); kfree(ctx); return NULL; } =20 +void io_ring_suspend_work(struct callback_head *cb_head) +{ + struct io_ring_suspend_work *suspend_work =3D + container_of(cb_head, struct io_ring_suspend_work, cb_head); + DECLARE_COMPLETION_ONSTACK(suspend_end); + + *suspend_work->suspend_end =3D &suspend_end; + complete(&suspend_work->suspend_start); + + wait_for_completion(&suspend_end); +} + static void io_clean_op(struct io_kiocb *req) { if (unlikely(req->flags & REQ_F_BUFFER_SELECTED)) io_kbuf_drop_legacy(req); =20 diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 57c3eef26a88..2b08d0ddab30 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -1,8 +1,9 @@ #ifndef IOU_CORE_H #define IOU_CORE_H =20 +#include #include #include #include #include #include @@ -195,19 +196,85 @@ void io_queue_next(struct io_kiocb *req); void io_task_refs_refill(struct io_uring_task *tctx); bool __io_alloc_req_refill(struct io_ring_ctx *ctx); =20 void io_activate_pollwq(struct io_ring_ctx *ctx); =20 +/* + * The ctx uring lock protects most of the mutable struct io_ring_ctx state + * accessed in the struct io_kiocb issue path. In the I/O path, it is typi= cally + * acquired in the io_uring_enter() syscall and in io_handle_tw_list(). For + * IORING_SETUP_SQPOLL, it's acquired by io_sq_thread() instead. io_kiocb's + * issued with IO_URING_F_UNLOCKED in issue_flags (e.g. by io_wq_submit_wo= rk()) + * acquire and release the ctx uring lock whenever they must touch io_ring= _ctx + * state. io_uring_register() also acquires the ctx uring lock because most + * opcodes mutate io_ring_ctx state accessed in the issue path. + * + * For !IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, acquiring the ctx uring = lock + * is done via mutex_(try)lock(&ctx->uring_lock). + * + * However, for IORING_SETUP_SINGLE_ISSUER, we can avoid the mutex_lock() + + * mutex_unlock() overhead on submitter_task because a single thread can't= race + * with itself. In the uncommon case where the ctx uring lock is needed on + * another thread, it must suspend submitter_task by scheduling a task wor= k item + * on it. io_ring_ctx_lock() returns once the task work item has started. + * io_ring_ctx_unlock() allows the task work item to complete. + * If io_ring_ctx_lock() is called while the ctx is IORING_SETUP_R_DISABLED + * (e.g. during ctx create or exit), io_ring_ctx_lock() must acquire uring= _lock + * because submitter_task isn't set yet. submitter_task can be accessed on= ce + * uring_lock is held. If submitter_task exists, we do the same thing as i= n the + * non-IORING_SETUP_R_DISABLED case (except with uring_lock also held). If + * submitter_task isn't set, all other io_ring_ctx_lock() callers will also + * acquire uring_lock, so it suffices for mutual exclusion. + */ + +struct io_ring_suspend_work { + struct callback_head cb_head; + struct completion suspend_start; + struct completion **suspend_end; +}; + +void io_ring_suspend_work(struct callback_head *cb_head); + struct io_ring_ctx_lock_state { + bool need_mutex; + struct completion *suspend_end; }; =20 /* Acquire the ctx uring lock with the given nesting level */ static inline void io_ring_ctx_lock_nested(struct io_ring_ctx *ctx, unsigned int subclass, struct io_ring_ctx_lock_state *state) { - mutex_lock_nested(&ctx->uring_lock, subclass); + struct io_ring_suspend_work suspend_work; + + if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) { + mutex_lock_nested(&ctx->uring_lock, subclass); + return; + } + + state->suspend_end =3D NULL; + state->need_mutex =3D + !!(smp_load_acquire(&ctx->flags) & IORING_SETUP_R_DISABLED); + if (unlikely(state->need_mutex)) { + mutex_lock_nested(&ctx->uring_lock, subclass); + if (likely(!ctx->submitter_task)) + return; + } + + if (likely(current =3D=3D ctx->submitter_task)) + return; + + /* Use task work to suspend submitter_task */ + init_task_work(&suspend_work.cb_head, io_ring_suspend_work); + init_completion(&suspend_work.suspend_start); + suspend_work.suspend_end =3D &state->suspend_end; + /* If task_work_add() fails, task is exiting, so no need to suspend */ + if (unlikely(task_work_add(ctx->submitter_task, &suspend_work.cb_head, + TWA_SIGNAL))) + return; + + wait_for_completion(&suspend_work.suspend_start); } =20 /* Acquire the ctx uring lock */ static inline void io_ring_ctx_lock(struct io_ring_ctx *ctx, struct io_ring_ctx_lock_state *state) @@ -217,29 +284,70 @@ static inline void io_ring_ctx_lock(struct io_ring_ct= x *ctx, =20 /* Attempt to acquire the ctx uring lock without blocking */ static inline bool io_ring_ctx_trylock(struct io_ring_ctx *ctx, struct io_ring_ctx_lock_state *state) { - return mutex_trylock(&ctx->uring_lock); + if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) + return mutex_trylock(&ctx->uring_lock); + + state->suspend_end =3D NULL; + state->need_mutex =3D + !!(smp_load_acquire(&ctx->flags) & IORING_SETUP_R_DISABLED); + if (unlikely(state->need_mutex)) { + if (!mutex_trylock(&ctx->uring_lock)) + return false; + if (likely(!ctx->submitter_task)) + return true; + } + + if (unlikely(current !=3D ctx->submitter_task)) + goto unlock; + + return true; + +unlock: + if (unlikely(state->need_mutex)) + mutex_unlock(&ctx->uring_lock); + return false; } =20 /* Release the ctx uring lock */ static inline void io_ring_ctx_unlock(struct io_ring_ctx *ctx, struct io_ring_ctx_lock_state *state) { - mutex_unlock(&ctx->uring_lock); + if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) { + mutex_unlock(&ctx->uring_lock); + return; + } + + if (unlikely(state->need_mutex)) + mutex_unlock(&ctx->uring_lock); + if (unlikely(state->suspend_end)) + complete(state->suspend_end); } =20 /* Return (if CONFIG_LOCKDEP) whether the ctx uring lock is held */ static inline bool io_ring_ctx_lock_held(const struct io_ring_ctx *ctx) { + /* + * No straightforward way to check that submitter_task is suspended + * without access to struct io_ring_ctx_lock_state + */ + if (ctx->flags & IORING_SETUP_SINGLE_ISSUER && + !(ctx->flags & IORING_SETUP_R_DISABLED)) + return true; + return lockdep_is_held(&ctx->uring_lock); } =20 /* Assert (if CONFIG_LOCKDEP) that the ctx uring lock is held */ static inline void io_ring_ctx_assert_locked(const struct io_ring_ctx *ctx) { + if (ctx->flags & IORING_SETUP_SINGLE_ISSUER && + !(ctx->flags & IORING_SETUP_R_DISABLED)) + return; + lockdep_assert_held(&ctx->uring_lock); } =20 static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx) { --=20 2.45.2