From nobody Wed Feb 5 12:51:28 2025 Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01EF18F77 for ; Wed, 15 Jan 2025 20:26:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736972781; cv=none; b=XqtJCmjvYp/iLD0gwJsN6yoM7+c6IIKdcTsM9tMYbayX4iLOBir+8bu8ThGUZi3oUdOR9R1nF0x3kGJ17PT5tK/GjpLVpJAuTEy66toJc6kvIV8LBq2E+1txedIZMDG/5KMKZMRfiJP+zpW7FRMh1+hIYcTeAKB2dc5PaUi4bhs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736972781; c=relaxed/simple; bh=Wbj8wHc9hI+sc5M7Z80jkx1gKa1NoydY6mW7SjRm1PM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:Cc; b=TxMf8J+dBFVKa58veHOEeoJpqVMm0l1RbkiUcpKaBPLZAbNI+tDPcl5564rAkMeKRIHUu7RU+tkDUN42kX1ABDOnESNDhAEQKLyd2eZvrcntPxrH1pHpAn4NaoYjK0N0t2T7FxEZYJoqymM9jYW+yRZNdNE47+/SICktU+p1lQg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hOzzcs/Y; arc=none smtp.client-ip=209.85.128.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hOzzcs/Y" Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-43621d2dd4cso11695e9.0 for ; Wed, 15 Jan 2025 12:26:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736972778; x=1737577578; darn=vger.kernel.org; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:from:to:cc:subject:date:message-id:reply-to; bh=Ky6sqF/DxV3HW/KuX+jGIKyXzTOtCRQ/H4Z1APLNpdQ=; b=hOzzcs/YqARgINOqDBF+SuVzTsthYpZyunZaQ6qB/EjCb06gkTB6Bh4HddKAd+4ifh 66fod9i77PV7caAotWwpuaCENsboXVxX5CRDPEswLZp6j+sGGRlxAM2DUFVvAgyUB34L VUr9eJ6AdZOT3/q6GgZ4A2I/2Pt9GlATOABuhW3eI5vaYPWE40N9BOSSpiVWkstA7rId 7uLgXZ1l5vcSPjvgQFYZ/fWOXO+nHM6wpo1UmCHzdRiSSDpHDvFuwL0md/BT9SkqtboV iy+tBBltGjUGZJ2nm/x05lrOrVkV6PtXpWyM2BppWsKm7jQ5bvnEFxTmQh51rQFefPiK PFmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736972778; x=1737577578; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Ky6sqF/DxV3HW/KuX+jGIKyXzTOtCRQ/H4Z1APLNpdQ=; b=DYaePfkFU3yssxRTrcO7qw/FU7Nmnb6abtFRuKDrTADl9lPgafaJ4GHPVe5uswSNrr 4Ye8jNJNdQ2CZREFqubhcR0CNERdu2/hLYmiy6a4JtwJoo/UzwOIByzFWl8YMRauGP8M UyiV6HXGPPRKnR6f1JBMGvfha8Ck1MsbTo5DC2XGTQP0ysCGRf+ZQY1PRdbNh0YROQ2S NNplOZ7rGYNbaUszlemgDIwZe5tc52lvAzpH7IkijIYd74BiCjtAMHr+YixAZV5bQdm2 Hg1NKk955dL6GgR/rsXOkkT8FRJx+CcoQKvWUd+JJonbyN7JvXd67TSn3/GhitClwJQJ 3ZHA== X-Forwarded-Encrypted: i=1; AJvYcCXKq4a4Jux7CpKvKQoXRaTZaYBh/Xd/nBH85Qs6r9VYQ8dlNptnKr7djv04nhyykNxjn3h2pECr7eSbL94=@vger.kernel.org X-Gm-Message-State: AOJu0Yx88jOR5tQsFs953eBr/m9eH022XV6bMJEs1ednd7qNz59ulwfC vD46vtjp03lw+di2PaIbhN5DJXBECLCjLMKYAmOwPE3h9OWcN5RqUlN9/MG3fPTK7EB9uksYOZO hnmgo X-Gm-Gg: ASbGncvxNBwUdif7VyeUtYhsqYl8K3LGyewToSrwgWXScOvIkfBgmAKYU9PKc0V1RLv bFymKdJ+B7GUjeCPxmZWDP6LG3c/mv+q8O5vm/94xbYDMwtkMSWhW3uz0/5EHvHYx7itK429AC+ WF6qvBCpJt/mGbyCkbUOMXh9W0qjJXPH7eK3+QcuxK0qVU6I5anL5Gtc30jgaYbL6B7ht6hC4NE a+1XWa1/ZMDcuZKKUud5RumPJuE8AD51KkNcwQjwPpyOQ== X-Google-Smtp-Source: AGHT+IEMPaeI5PrSe2O9UFoYl0YyTRUodMrljDEYcPZZyFnTQLxQNA0YUBLH+y4f9UoINBr6AvQBOQ== X-Received: by 2002:a05:600c:19c9:b0:42c:9e35:cde6 with SMTP id 5b1f17b1804b1-4388b2ec3a2mr165965e9.2.1736972777649; Wed, 15 Jan 2025 12:26:17 -0800 (PST) Received: from localhost ([2a00:79e0:9d:4:7175:4098:36d3:bfd7]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-437c74accc3sm35317545e9.14.2025.01.15.12.26.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Jan 2025 12:26:17 -0800 (PST) From: Jann Horn Date: Wed, 15 Jan 2025 21:26:03 +0100 Subject: [PATCH v2] io_uring/rsrc: Simplify buffer cloning by locking both rings Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250115-uring-clone-refactor-v2-1-7289ba50776d@google.com> X-B4-Tracking: v=1; b=H4sIANoZiGcC/4WNSw7CIBCGr9LM2jEMCTa68h6mixZGOkkFA5VoG u4u9gIuv/+5QeYknOHSbZC4SJYYGuhDB3Yeg2cU1xi00kYRGXwlCR7tEgNj4vto15hQnZyz7Bx PSkGrPpsj7332NjSeJbfYZ38p9FP/DBZCwkm7syHT91rR1cfoFz7a+ICh1voFYi19CbkAAAA= X-Change-ID: 20250115-uring-clone-refactor-06ddceddeb00 To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, linux-kernel@vger.kernel.org, Jann Horn X-Mailer: b4 0.15-dev X-Developer-Signature: v=1; a=ed25519-sha256; t=1736972773; l=5934; i=jannh@google.com; s=20240730; h=from:subject:message-id; bh=Wbj8wHc9hI+sc5M7Z80jkx1gKa1NoydY6mW7SjRm1PM=; b=hLPL5px54/4jPG4P/qQJvPSY2WunB5VTCui3wPThbJE7Ac8zGEQY8rw1g8Pt91am+H4Ffr10+ np8xN99ZQ8wCZLMfnUI4lRi/ILRMr8SDvmMXhVhQRzwqkLgyhnrIT7H X-Developer-Key: i=jannh@google.com; a=ed25519; pk=AljNtGOzXeF6khBXDJVVvwSEkVDGnnZZYqfWhP1V+C8= The locking in the buffer cloning code is somewhat complex because it goes back and forth between locking the source ring and the destination ring. Make it easier to reason about by locking both rings at the same time. To avoid ABBA deadlocks, lock the rings in ascending kernel address order, just like in lock_two_nondirectories(). Signed-off-by: Jann Horn --- (patch is based on top of Jens' for-next tree) --- Changes in v2: - fold in suggested changes from Jens (https://lore.kernel.org/r/2439336d-b= 6ae-4d08-a1e8-2372fc6df383@kernel.dk) - Link to v1: https://lore.kernel.org/r/20250115-uring-clone-refactor-v1-1-= b2d951577201@google.com --- io_uring/rsrc.c | 73 +++++++++++++++++++++++++++++++----------------------= ---- 1 file changed, 40 insertions(+), 33 deletions(-) diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index e32ac58533914706c8e031754432ed634d7c0354..a1c7c8db55455e6db862a17ed61= 1f3d6ce885b3d 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -921,6 +921,16 @@ int io_import_fixed(int ddir, struct iov_iter *iter, return 0; } =20 +/* Lock two rings at once. The rings must be different! */ +static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *c= tx2) +{ + if (ctx1 > ctx2) + swap(ctx1, ctx2); + mutex_lock(&ctx1->uring_lock); + mutex_lock_nested(&ctx2->uring_lock, SINGLE_DEPTH_NESTING); +} + +/* Both rings are locked by the caller. */ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *s= rc_ctx, struct io_uring_clone_buffers *arg) { @@ -928,6 +938,9 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, st= ruct io_ring_ctx *src_ctx int i, ret, off, nr; unsigned int nbufs; =20 + lockdep_assert_held(&ctx->uring_lock); + lockdep_assert_held(&src_ctx->uring_lock); + /* * Accounting state is shared between the two rings; that only works if * both rings are accounted towards the same counters. @@ -942,7 +955,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, st= ruct io_ring_ctx *src_ctx if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE)) return -EBUSY; =20 - nbufs =3D READ_ONCE(src_ctx->buf_table.nr); + nbufs =3D src_ctx->buf_table.nr; if (!arg->nr) arg->nr =3D nbufs; else if (arg->nr > nbufs) @@ -966,27 +979,20 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, = struct io_ring_ctx *src_ctx } } =20 - /* - * Drop our own lock here. We'll setup the data we need and reference - * the source buffers, then re-grab, check, and assign at the end. - */ - mutex_unlock(&ctx->uring_lock); - - mutex_lock(&src_ctx->uring_lock); ret =3D -ENXIO; nbufs =3D src_ctx->buf_table.nr; if (!nbufs) - goto out_unlock; + goto out_free; ret =3D -EINVAL; if (!arg->nr) arg->nr =3D nbufs; else if (arg->nr > nbufs) - goto out_unlock; + goto out_free; ret =3D -EOVERFLOW; if (check_add_overflow(arg->nr, arg->src_off, &off)) - goto out_unlock; + goto out_free; if (off > nbufs) - goto out_unlock; + goto out_free; =20 off =3D arg->dst_off; i =3D arg->src_off; @@ -1001,7 +1007,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, = struct io_ring_ctx *src_ctx dst_node =3D io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!dst_node) { ret =3D -ENOMEM; - goto out_unlock; + goto out_free; } =20 refcount_inc(&src_node->buf->refs); @@ -1011,10 +1017,6 @@ static int io_clone_buffers(struct io_ring_ctx *ctx,= struct io_ring_ctx *src_ctx i++; } =20 - /* Have a ref on the bufs now, drop src lock and re-grab our own lock */ - mutex_unlock(&src_ctx->uring_lock); - mutex_lock(&ctx->uring_lock); - /* * If asked for replace, put the old table. data->nodes[] holds both * old and new nodes at this point. @@ -1023,24 +1025,17 @@ static int io_clone_buffers(struct io_ring_ctx *ctx= , struct io_ring_ctx *src_ctx io_rsrc_data_free(ctx, &ctx->buf_table); =20 /* - * ctx->buf_table should be empty now - either the contents are being - * replaced and we just freed the table, or someone raced setting up - * a buffer table while the clone was happening. If not empty, fall - * through to failure handling. + * ctx->buf_table must be empty now - either the contents are being + * replaced and we just freed the table, or the contents are being + * copied to a ring that does not have buffers yet (checked at function + * entry). */ - if (!ctx->buf_table.nr) { - ctx->buf_table =3D data; - return 0; - } + WARN_ON_ONCE(ctx->buf_table.nr); + ctx->buf_table =3D data; + return 0; =20 - mutex_unlock(&ctx->uring_lock); - mutex_lock(&src_ctx->uring_lock); - /* someone raced setting up buffers, dump ours */ - ret =3D -EBUSY; -out_unlock: +out_free: io_rsrc_data_free(ctx, &data); - mutex_unlock(&src_ctx->uring_lock); - mutex_lock(&ctx->uring_lock); return ret; } =20 @@ -1054,6 +1049,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, = struct io_ring_ctx *src_ctx int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg) { struct io_uring_clone_buffers buf; + struct io_ring_ctx *src_ctx; bool registered_src; struct file *file; int ret; @@ -1071,7 +1067,18 @@ int io_register_clone_buffers(struct io_ring_ctx *ct= x, void __user *arg) file =3D io_uring_register_get_file(buf.src_fd, registered_src); if (IS_ERR(file)) return PTR_ERR(file); - ret =3D io_clone_buffers(ctx, file->private_data, &buf); + + src_ctx =3D file->private_data; + if (src_ctx !=3D ctx) { + mutex_unlock(&ctx->uring_lock); + lock_two_rings(ctx, src_ctx); + } + + ret =3D io_clone_buffers(ctx, src_ctx, &buf); + + if (src_ctx !=3D ctx) + mutex_unlock(&src_ctx->uring_lock); + if (!registered_src) fput(file); return ret; --- base-commit: 1ac3ba2b3cc41645a30e49f93d3e09bd05e6d2c7 change-id: 20250115-uring-clone-refactor-06ddceddeb00 --=20 Jann Horn