From nobody Sun Feb  8 00:06:15 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 119472E5405
	for <linux-kernel@vger.kernel.org>; Fri, 21 Nov 2025 02:00:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1763690458; cv=none;
 b=jNlCepw47IDujRNn/JoV71Ri8K81o/2GMxeQrCKCzpM597RBE7ReJ0kMTMZk8P4b/aBlfIPqVeQqdd83FGi4/yYkCq/ZiNncNIsPcC5iK3ZOOW/WsAejcjaDmTEEmFGotMq/LFg8Crw0IlSNtVD3GqOWIgTg6z9EoDjaqZHE4H0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1763690458; c=relaxed/simple;
	bh=YY9mPoiqq5mS8+hnJbS4J35F/PfORQJLXUIK+tPEQUM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=uKJiEGmBUG5prCwuZ9/sT/B4FzQkxvSZvNovyXDuEaGFdpr8wjmhgifcPQzyww8vqs0kDlD7c62N1LFvF2cLqDbYHlUutOyvvuuwLBBKy9ToDnSEf3SN1FPhBxEtosySYP1TkhCRDNKaDXQVQ8LObNgzUNouwbudggb+17SoHQQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=gZyyBwUs; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="gZyyBwUs"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1763690455;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=bBU4yUzkpU7XOjpcV4FFa6IcIMZ2uo+9+EbDZIGoy/A=;
	b=gZyyBwUseuASw0cNkMxJ4Bg7DTg1nIjB7yryoUcH1l6LOO0zth9KyQtdwQxBX42nFHwt0f
	qCZzgb+LkGxbHA6APVGs91KqenbFG6VuDFhK+GUi+eqp9FyKXqLxWx6j3wg07aLT0DJz4E
	/IXmIWxB+Yzf7bdz4b5QYMe84U4HaV0=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-160-9KSi000MOJyU0uH2BNu2Fg-1; Thu,
 20 Nov 2025 21:00:53 -0500
X-MC-Unique: 9KSi000MOJyU0uH2BNu2Fg-1
X-Mimecast-MFC-AGG-ID: 9KSi000MOJyU0uH2BNu2Fg_1763690451
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id C67D41956066;
	Fri, 21 Nov 2025 02:00:51 +0000 (UTC)
Received: from localhost (unknown [10.72.116.211])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id EBFAF1956045;
	Fri, 21 Nov 2025 02:00:50 +0000 (UTC)
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org
Cc: Caleb Sander Mateos <csander@purestorage.com>,
	Uday Shankar <ushankar@purestorage.com>,
	Stefani Seibold <stefani@seibold.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Ming Lei <ming.lei@redhat.com>
Subject: [PATCH V4 27/27] selftests: ublk: support arbitrary threads/queues
 combination
Date: Fri, 21 Nov 2025 09:58:49 +0800
Message-ID: <20251121015851.3672073-28-ming.lei@redhat.com>
In-Reply-To: <20251121015851.3672073-1-ming.lei@redhat.com>
References: <20251121015851.3672073-1-ming.lei@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Enable flexible thread-to-queue mapping in batch I/O mode to support
arbitrary combinations of threads and queues, improving resource
utilization and scalability.

Key improvements:
- Support N:M thread-to-queue mapping (previously limited to 1:1)
- Dynamic buffer allocation based on actual queue assignment per thread
- Thread-safe queue preparation with spinlock protection
- Intelligent buffer index calculation for multi-queue scenarios
- Enhanced validation for thread/queue combination constraints

Implementation details:
- Add q_thread_map matrix to track queue-to-thread assignments
- Dynamic allocation of commit and fetch buffers per thread
- Round-robin queue assignment algorithm for load balancing
- Per-queue spinlock to prevent race conditions during prep
- Updated buffer index calculation using queue position within thread

This enables efficient configurations like:
- Any other N:M combinations for optimal resource matching

Testing:
- Added test_generic_15.sh: 4 threads vs 1 queue
- Added test_generic_16.sh: 1 thread vs 4 queues
- Validates correctness across different mapping scenarios

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |   2 +
 tools/testing/selftests/ublk/batch.c          | 199 +++++++++++++++---
 tools/testing/selftests/ublk/kublk.c          |  49 ++++-
 tools/testing/selftests/ublk/kublk.h          |  40 +++-
 .../testing/selftests/ublk/test_generic_15.sh |  30 +++
 .../testing/selftests/ublk/test_generic_16.sh |  30 +++
 6 files changed, 302 insertions(+), 48 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_generic_15.sh
 create mode 100755 tools/testing/selftests/ublk/test_generic_16.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftest=
s/ublk/Makefile
index cbf57113b1a6..3dbd9a857716 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -22,6 +22,8 @@ TEST_PROGS +=3D test_generic_11.sh
 TEST_PROGS +=3D test_generic_12.sh
 TEST_PROGS +=3D test_generic_13.sh
 TEST_PROGS +=3D test_generic_14.sh
+TEST_PROGS +=3D test_generic_15.sh
+TEST_PROGS +=3D test_generic_16.sh
=20
 TEST_PROGS +=3D test_null_01.sh
 TEST_PROGS +=3D test_null_02.sh
diff --git a/tools/testing/selftests/ublk/batch.c b/tools/testing/selftests=
/ublk/batch.c
index 7db91f910944..db0747f13768 100644
--- a/tools/testing/selftests/ublk/batch.c
+++ b/tools/testing/selftests/ublk/batch.c
@@ -76,6 +76,7 @@ static void free_batch_commit_buf(struct ublk_thread *t)
 		free(t->commit_buf);
 	}
 	allocator_deinit(&t->commit_buf_alloc);
+	free(t->commit);
 }
=20
 static int alloc_batch_commit_buf(struct ublk_thread *t)
@@ -84,7 +85,13 @@ static int alloc_batch_commit_buf(struct ublk_thread *t)
 	unsigned int total =3D buf_size * t->nr_commit_buf;
 	unsigned int page_sz =3D getpagesize();
 	void *buf =3D NULL;
-	int ret;
+	int i, ret, j =3D 0;
+
+	t->commit =3D calloc(t->nr_queues, sizeof(*t->commit));
+	for (i =3D 0; i < t->dev->dev_info.nr_hw_queues; i++) {
+		if (t->q_map[i])
+			t->commit[j++].q_id =3D i;
+	}
=20
 	allocator_init(&t->commit_buf_alloc, t->nr_commit_buf);
=20
@@ -107,6 +114,17 @@ static int alloc_batch_commit_buf(struct ublk_thread *=
t)
 	return ret;
 }
=20
+static unsigned int ublk_thread_nr_queues(const struct ublk_thread *t)
+{
+	int i;
+	int ret =3D 0;
+
+	for (i =3D 0; i < t->dev->dev_info.nr_hw_queues; i++)
+		ret +=3D !!t->q_map[i];
+
+	return ret;
+}
+
 void ublk_batch_prepare(struct ublk_thread *t)
 {
 	/*
@@ -119,10 +137,13 @@ void ublk_batch_prepare(struct ublk_thread *t)
 	 */
 	struct ublk_queue *q =3D &t->dev->q[0];
=20
+	/* cache nr_queues because we don't support dynamic load-balance yet */
+	t->nr_queues =3D ublk_thread_nr_queues(t);
+
 	t->commit_buf_elem_size =3D ublk_commit_elem_buf_size(t->dev);
 	t->commit_buf_size =3D ublk_commit_buf_size(t);
 	t->commit_buf_start =3D t->nr_bufs;
-	t->nr_commit_buf =3D 2;
+	t->nr_commit_buf =3D 2 * t->nr_queues;
 	t->nr_bufs +=3D t->nr_commit_buf;
=20
 	t->cmd_flags =3D 0;
@@ -144,11 +165,12 @@ static void free_batch_fetch_buf(struct ublk_thread *=
t)
 {
 	int i;
=20
-	for (i =3D 0; i < UBLKS_T_NR_FETCH_BUF; i++) {
+	for (i =3D 0; i < t->nr_fetch_bufs; i++) {
 		io_uring_free_buf_ring(&t->ring, t->fetch[i].br, 1, i);
 		munlock(t->fetch[i].fetch_buf, t->fetch[i].fetch_buf_size);
 		free(t->fetch[i].fetch_buf);
 	}
+	free(t->fetch);
 }
=20
 static int alloc_batch_fetch_buf(struct ublk_thread *t)
@@ -159,7 +181,12 @@ static int alloc_batch_fetch_buf(struct ublk_thread *t)
 	int ret;
 	int i =3D 0;
=20
-	for (i =3D 0; i < UBLKS_T_NR_FETCH_BUF; i++) {
+	/* double fetch buffer for each queue */
+	t->nr_fetch_bufs =3D t->nr_queues * 2;
+	t->fetch =3D calloc(t->nr_fetch_bufs, sizeof(*t->fetch));
+
+	/* allocate one buffer for each queue */
+	for (i =3D 0; i < t->nr_fetch_bufs; i++) {
 		t->fetch[i].fetch_buf_size =3D buf_size;
=20
 		if (posix_memalign((void **)&t->fetch[i].fetch_buf, pg_sz,
@@ -185,7 +212,7 @@ int ublk_batch_alloc_buf(struct ublk_thread *t)
 {
 	int ret;
=20
-	ublk_assert(t->nr_commit_buf < 16);
+	ublk_assert(t->nr_commit_buf < 2 * UBLK_MAX_QUEUES);
=20
 	ret =3D alloc_batch_commit_buf(t);
 	if (ret)
@@ -271,13 +298,20 @@ static void ublk_batch_queue_fetch(struct ublk_thread=
 *t,
 	t->fetch[buf_idx].fetch_buf_off =3D 0;
 }
=20
-void ublk_batch_start_fetch(struct ublk_thread *t,
-			    struct ublk_queue *q)
+void ublk_batch_start_fetch(struct ublk_thread *t)
 {
 	int i;
+	int j =3D 0;
+
+	for (i =3D 0; i < t->dev->dev_info.nr_hw_queues; i++) {
+		if (t->q_map[i]) {
+			struct ublk_queue *q =3D &t->dev->q[i];
=20
-	for (i =3D 0; i < UBLKS_T_NR_FETCH_BUF; i++)
-		ublk_batch_queue_fetch(t, q, i);
+			/* submit two fetch commands for each queue */
+			ublk_batch_queue_fetch(t, q, j++);
+			ublk_batch_queue_fetch(t, q, j++);
+		}
+	}
 }
=20
 static unsigned short ublk_compl_batch_fetch(struct ublk_thread *t,
@@ -317,7 +351,7 @@ static unsigned short ublk_compl_batch_fetch(struct ubl=
k_thread *t,
 	return buf_idx;
 }
=20
-int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue=
 *q)
+static int __ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct u=
blk_queue *q)
 {
 	unsigned short nr_elem =3D q->q_depth;
 	unsigned short buf_idx =3D ublk_alloc_commit_buf(t);
@@ -354,6 +388,22 @@ int ublk_batch_queue_prep_io_cmds(struct ublk_thread *=
t, struct ublk_queue *q)
 	return 0;
 }
=20
+int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue=
 *q)
+{
+	int ret =3D 0;
+
+	pthread_spin_lock(&q->lock);
+	if (q->flags & UBLKS_Q_PREPARED)
+		goto unlock;
+	ret =3D __ublk_batch_queue_prep_io_cmds(t, q);
+	if (!ret)
+		q->flags |=3D UBLKS_Q_PREPARED;
+unlock:
+	pthread_spin_unlock(&q->lock);
+
+	return ret;
+}
+
 static void ublk_batch_compl_commit_cmd(struct ublk_thread *t,
 					const struct io_uring_cqe *cqe,
 					unsigned op)
@@ -398,59 +448,89 @@ void ublk_batch_compl_cmd(struct ublk_thread *t,
 	}
 }
=20
-void ublk_batch_commit_io_cmds(struct ublk_thread *t)
+static void __ublk_batch_commit_io_cmds(struct ublk_thread *t,
+					struct batch_commit_buf *cb)
 {
 	struct io_uring_sqe *sqe;
 	unsigned short buf_idx;
-	unsigned short nr_elem =3D t->commit.done;
+	unsigned short nr_elem =3D cb->done;
=20
 	/* nothing to commit */
 	if (!nr_elem) {
-		ublk_free_commit_buf(t, t->commit.buf_idx);
+		ublk_free_commit_buf(t, cb->buf_idx);
 		return;
 	}
=20
 	ublk_io_alloc_sqes(t, &sqe, 1);
-	buf_idx =3D t->commit.buf_idx;
-	sqe->addr =3D (__u64)t->commit.elem;
+	buf_idx =3D cb->buf_idx;
+	sqe->addr =3D (__u64)cb->elem;
 	sqe->len =3D nr_elem * t->commit_buf_elem_size;
=20
 	/* commit isn't per-queue command */
-	ublk_init_batch_cmd(t, t->commit.q_id, sqe, UBLK_U_IO_COMMIT_IO_CMDS,
+	ublk_init_batch_cmd(t, cb->q_id, sqe, UBLK_U_IO_COMMIT_IO_CMDS,
 			t->commit_buf_elem_size, nr_elem, buf_idx);
 	ublk_setup_commit_sqe(t, sqe, buf_idx);
 }
=20
-static void ublk_batch_init_commit(struct ublk_thread *t,
-				   unsigned short buf_idx)
+void ublk_batch_commit_io_cmds(struct ublk_thread *t)
+{
+	int i;
+
+	for (i =3D 0; i < t->nr_queues; i++) {
+		struct batch_commit_buf *cb =3D &t->commit[i];
+
+		if (cb->buf_idx !=3D UBLKS_T_COMMIT_BUF_INV_IDX)
+			__ublk_batch_commit_io_cmds(t, cb);
+	}
+
+}
+
+static void __ublk_batch_init_commit(struct ublk_thread *t,
+				     struct batch_commit_buf *cb,
+				     unsigned short buf_idx)
 {
 	/* so far only support 1:1 queue/thread mapping */
-	t->commit.q_id =3D t->idx;
-	t->commit.buf_idx =3D buf_idx;
-	t->commit.elem =3D ublk_get_commit_buf(t, buf_idx);
-	t->commit.done =3D 0;
-	t->commit.count =3D t->commit_buf_size /
+	cb->buf_idx =3D buf_idx;
+	cb->elem =3D ublk_get_commit_buf(t, buf_idx);
+	cb->done =3D 0;
+	cb->count =3D t->commit_buf_size /
 		t->commit_buf_elem_size;
 }
=20
-void ublk_batch_prep_commit(struct ublk_thread *t)
+/* COMMIT_IO_CMDS is per-queue command, so use its own commit buffer */
+static void ublk_batch_init_commit(struct ublk_thread *t,
+				   struct batch_commit_buf *cb)
 {
 	unsigned short buf_idx =3D ublk_alloc_commit_buf(t);
=20
 	ublk_assert(buf_idx !=3D UBLKS_T_COMMIT_BUF_INV_IDX);
-	ublk_batch_init_commit(t, buf_idx);
+	ublk_assert(!ublk_batch_commit_prepared(cb));
+
+	__ublk_batch_init_commit(t, cb, buf_idx);
+}
+
+void ublk_batch_prep_commit(struct ublk_thread *t)
+{
+	int i;
+
+	for (i =3D 0; i < t->nr_queues; i++)
+		t->commit[i].buf_idx =3D UBLKS_T_COMMIT_BUF_INV_IDX;
 }
=20
 void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
 			    unsigned tag, int res)
 {
-	struct batch_commit_buf *cb =3D &t->commit;
-	struct ublk_batch_elem *elem =3D (struct ublk_batch_elem *)(cb->elem +
-			cb->done * t->commit_buf_elem_size);
+	unsigned q_t_idx =3D ublk_queue_idx_in_thread(t, q);
+	struct batch_commit_buf *cb =3D &t->commit[q_t_idx];
+	struct ublk_batch_elem *elem;
 	struct ublk_io *io =3D &q->ios[tag];
=20
-	ublk_assert(q->q_id =3D=3D t->commit.q_id);
+	if (!ublk_batch_commit_prepared(cb))
+		ublk_batch_init_commit(t, cb);
+
+	ublk_assert(q->q_id =3D=3D cb->q_id);
=20
+	elem =3D (struct ublk_batch_elem *)(cb->elem + cb->done * t->commit_buf_e=
lem_size);
 	elem->tag =3D tag;
 	elem->buf_index =3D ublk_batch_io_buf_idx(t, q, tag);
 	elem->result =3D res;
@@ -461,3 +541,64 @@ void ublk_batch_complete_io(struct ublk_thread *t, str=
uct ublk_queue *q,
 	cb->done +=3D 1;
 	ublk_assert(cb->done <=3D cb->count);
 }
+
+void ublk_batch_setup_map(unsigned char (*q_thread_map)[UBLK_MAX_QUEUES],
+			   int nthreads, int queues)
+{
+	int i, j;
+
+	/*
+	 * Setup round-robin queue-to-thread mapping for arbitrary N:M combinatio=
ns.
+	 *
+	 * This algorithm distributes queues across threads (and threads across q=
ueues)
+	 * in a balanced round-robin fashion to ensure even load distribution.
+	 *
+	 * Examples:
+	 * - 2 threads, 4 queues: T0=3D[Q0,Q2], T1=3D[Q1,Q3]
+	 * - 4 threads, 2 queues: T0=3D[Q0], T1=3D[Q1], T2=3D[Q0], T3=3D[Q1]
+	 * - 3 threads, 3 queues: T0=3D[Q0], T1=3D[Q1], T2=3D[Q2] (1:1 mapping)
+	 *
+	 * Phase 1: Mark which queues each thread handles (boolean mapping)
+	 */
+	for (i =3D 0, j =3D 0; i < queues || j < nthreads; i++, j++) {
+		q_thread_map[j % nthreads][i % queues] =3D 1;
+	}
+
+	/*
+	 * Phase 2: Convert boolean mapping to sequential indices within each thr=
ead.
+	 *
+	 * Transform from: q_thread_map[thread][queue] =3D 1 (handles queue)
+	 * To:             q_thread_map[thread][queue] =3D N (queue index within =
thread)
+	 *
+	 * This allows each thread to know the local index of each queue it handl=
es,
+	 * which is essential for buffer allocation and management. For example:
+	 * - Thread 0 handling queues [0,2] becomes: q_thread_map[0][0]=3D1, q_th=
read_map[0][2]=3D2
+	 * - Thread 1 handling queues [1,3] becomes: q_thread_map[1][1]=3D1, q_th=
read_map[1][3]=3D2
+	 */
+	for (j =3D 0; j < nthreads; j++) {
+		unsigned char seq =3D 1;
+
+		for (i =3D 0; i < queues; i++) {
+			if (q_thread_map[j][i])
+				q_thread_map[j][i] =3D seq++;
+		}
+	}
+
+#if 0
+	for (j =3D 0; j < nthreads; j++) {
+		printf("thread %0d: ", j);
+		for (i =3D 0; i < queues; i++) {
+			if (q_thread_map[j][i])
+				printf("%03u ", i);
+		}
+		printf("\n");
+	}
+	printf("\n");
+	for (j =3D 0; j < nthreads; j++) {
+		for (i =3D 0; i < queues; i++) {
+			printf("%03u ", q_thread_map[j][i]);
+		}
+		printf("\n");
+	}
+#endif
+}
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests=
/ublk/kublk.c
index 4c45482a847c..f88a12b5a368 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -442,6 +442,7 @@ static int ublk_queue_init(struct ublk_queue *q, unsign=
ed long long extra_flags)
 	int cmd_buf_size, io_buf_size;
 	unsigned long off;
=20
+	pthread_spin_init(&q->lock, PTHREAD_PROCESS_PRIVATE);
 	q->tgt_ops =3D dev->tgt.ops;
 	q->flags =3D 0;
 	q->q_depth =3D depth;
@@ -495,7 +496,7 @@ static int ublk_thread_init(struct ublk_thread *t, unsi=
gned long long extra_flag
=20
 	/* FETCH_IO_CMDS is multishot, so increase cq depth for BATCH_IO */
 	if (ublk_dev_batch_io(dev))
-		cq_depth +=3D dev->dev_info.queue_depth;
+		cq_depth +=3D dev->dev_info.queue_depth * 2;
=20
 	ret =3D ublk_setup_ring(&t->ring, ring_depth, cq_depth,
 			IORING_SETUP_COOP_TASKRUN |
@@ -878,6 +879,7 @@ struct ublk_thread_info {
 	sem_t 			*ready;
 	cpu_set_t 		*affinity;
 	unsigned long long	extra_flags;
+	unsigned char		(*q_thread_map)[UBLK_MAX_QUEUES];
 };
=20
 static void ublk_thread_set_sched_affinity(const struct ublk_thread_info *=
info)
@@ -891,14 +893,18 @@ static void ublk_batch_setup_queues(struct ublk_threa=
d *t)
 {
 	int i;
=20
-	/* setup all queues in the 1st thread */
 	for (i =3D 0; i < t->dev->dev_info.nr_hw_queues; i++) {
 		struct ublk_queue *q =3D &t->dev->q[i];
 		int ret;
=20
+		/*
+		 * Only prepare io commands in the mapped thread context,
+		 * otherwise io command buffer index may not work as expected
+		 */
+		if (t->q_map[i] =3D=3D 0)
+			continue;
+
 		ret =3D ublk_batch_queue_prep_io_cmds(t, q);
-		ublk_assert(ret =3D=3D 0);
-		ret =3D ublk_process_io(t);
 		ublk_assert(ret >=3D 0);
 	}
 }
@@ -912,6 +918,10 @@ static __attribute__((noinline)) int __ublk_io_handler=
_fn(struct ublk_thread_inf
 	int dev_id =3D info->dev->dev_info.dev_id;
 	int ret;
=20
+	/* Copy per-thread queue mapping into thread-local variable */
+	if (info->q_thread_map)
+		memcpy(t.q_map, info->q_thread_map[info->idx], sizeof(t.q_map));
+
 	ret =3D ublk_thread_init(&t, info->extra_flags);
 	if (ret) {
 		ublk_err("ublk dev %d thread %u init failed\n",
@@ -927,12 +937,8 @@ static __attribute__((noinline)) int __ublk_io_handler=
_fn(struct ublk_thread_inf
 		/* submit all io commands to ublk driver */
 		ublk_submit_fetch_commands(&t);
 	} else {
-		struct ublk_queue *q =3D &t.dev->q[t.idx];
-
-		/* prepare all io commands in the 1st thread context */
-		if (!t.idx)
-			ublk_batch_setup_queues(&t);
-		ublk_batch_start_fetch(&t, q);
+		ublk_batch_setup_queues(&t);
+		ublk_batch_start_fetch(&t);
 	}
=20
 	do {
@@ -1006,6 +1012,7 @@ static int ublk_start_daemon(const struct dev_ctx *ct=
x, struct ublk_dev *dev)
 	struct ublk_thread_info *tinfo;
 	unsigned long long extra_flags =3D 0;
 	cpu_set_t *affinity_buf;
+	unsigned char (*q_thread_map)[UBLK_MAX_QUEUES] =3D NULL;
 	void *thread_ret;
 	sem_t ready;
 	int ret, i;
@@ -1025,6 +1032,16 @@ static int ublk_start_daemon(const struct dev_ctx *c=
tx, struct ublk_dev *dev)
 	if (ret)
 		return ret;
=20
+	if (ublk_dev_batch_io(dev)) {
+		q_thread_map =3D calloc(dev->nthreads, sizeof(*q_thread_map));
+		if (!q_thread_map) {
+			ret =3D -ENOMEM;
+			goto fail;
+		}
+		ublk_batch_setup_map(q_thread_map, dev->nthreads,
+				     dinfo->nr_hw_queues);
+	}
+
 	if (ctx->auto_zc_fallback)
 		extra_flags =3D UBLKS_Q_AUTO_BUF_REG_FALLBACK;
 	if (ctx->no_ublk_fixed_fd)
@@ -1047,6 +1064,7 @@ static int ublk_start_daemon(const struct dev_ctx *ct=
x, struct ublk_dev *dev)
 		tinfo[i].idx =3D i;
 		tinfo[i].ready =3D &ready;
 		tinfo[i].extra_flags =3D extra_flags;
+		tinfo[i].q_thread_map =3D q_thread_map;
=20
 		/*
 		 * If threads are not tied 1:1 to queues, setting thread
@@ -1066,6 +1084,7 @@ static int ublk_start_daemon(const struct dev_ctx *ct=
x, struct ublk_dev *dev)
 	for (i =3D 0; i < dev->nthreads; i++)
 		sem_wait(&ready);
 	free(affinity_buf);
+	free(q_thread_map);
=20
 	/* everything is fine now, start us */
 	if (ctx->recovery)
@@ -1234,7 +1253,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
 		goto fail;
 	}
=20
-	if (nthreads !=3D nr_queues && !ctx->per_io_tasks) {
+	if (nthreads !=3D nr_queues && (!ctx->per_io_tasks &&
+				!(ctx->flags & UBLK_F_BATCH_IO))) {
 		ublk_err("%s: threads %u must be same as queues %u if "
 			"not using per_io_tasks\n",
 			__func__, nthreads, nr_queues);
@@ -1758,6 +1778,13 @@ int main(int argc, char *argv[])
 		return -EINVAL;
 	}
=20
+	if ((ctx.flags & UBLK_F_AUTO_BUF_REG) &&
+			(ctx.flags & UBLK_F_BATCH_IO) &&
+			(ctx.nthreads > ctx.nr_hw_queues)) {
+		ublk_err("too many threads for F_AUTO_BUF_REG & F_BATCH_IO\n");
+		return -EINVAL;
+	}
+
 	i =3D optind;
 	while (i < argc && ctx.nr_files < MAX_BACK_FILES) {
 		ctx.files[ctx.nr_files++] =3D argv[i++];
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests=
/ublk/kublk.h
index 222501048c24..565819cf2dfe 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -166,12 +166,16 @@ struct ublk_queue {
 	const struct ublk_tgt_ops *tgt_ops;
 	struct ublksrv_io_desc *io_cmd_buf;
=20
-/* borrow one bit of ublk uapi flags, which may never be used */
+/* borrow three bit of ublk uapi flags, which may never be used */
 #define UBLKS_Q_AUTO_BUF_REG_FALLBACK	(1ULL << 63)
 #define UBLKS_Q_NO_UBLK_FIXED_FD	(1ULL << 62)
+#define UBLKS_Q_PREPARED	(1ULL << 61)
 	__u64 flags;
 	int ublk_fd;	/* cached ublk char device fd */
 	struct ublk_io ios[UBLK_QUEUE_DEPTH];
+
+	/* used for prep io commands */
+	pthread_spinlock_t lock;
 };
=20
 /* align with `ublk_elem_header` */
@@ -198,8 +202,12 @@ struct batch_fetch_buf {
 };
=20
 struct ublk_thread {
+	/* Thread-local copy of queue-to-thread mapping for this thread */
+	unsigned char q_map[UBLK_MAX_QUEUES];
+
 	struct ublk_dev *dev;
-	unsigned idx;
+	unsigned short idx;
+	unsigned short nr_queues;
=20
 #define UBLKS_T_STOPPING	(1U << 0)
 #define UBLKS_T_IDLE	(1U << 1)
@@ -222,10 +230,10 @@ struct ublk_thread {
 	void *commit_buf;
 #define UBLKS_T_COMMIT_BUF_INV_IDX  ((unsigned short)-1)
 	struct allocator commit_buf_alloc;
-	struct batch_commit_buf commit;
+	struct batch_commit_buf *commit;
 	/* FETCH_IO_CMDS buffer */
-#define UBLKS_T_NR_FETCH_BUF 	2
-	struct batch_fetch_buf fetch[UBLKS_T_NR_FETCH_BUF];
+	unsigned short nr_fetch_bufs;
+	struct batch_fetch_buf *fetch;
=20
 	struct io_uring ring;
 };
@@ -465,6 +473,21 @@ static inline int ublk_queue_no_buf(const struct ublk_=
queue *q)
 	return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q);
 }
=20
+static inline int ublk_batch_commit_prepared(struct batch_commit_buf *cb)
+{
+	return cb->buf_idx !=3D UBLKS_T_COMMIT_BUF_INV_IDX;
+}
+
+static inline unsigned ublk_queue_idx_in_thread(const struct ublk_thread *=
t,
+						const struct ublk_queue *q)
+{
+	unsigned char idx;
+
+	idx =3D t->q_map[q->q_id];
+	ublk_assert(idx !=3D 0);
+	return idx - 1;
+}
+
 /*
  * Each IO's buffer index has to be calculated by this helper for
  * UBLKS_T_BATCH_IO
@@ -473,14 +496,13 @@ static inline unsigned short ublk_batch_io_buf_idx(
 		const struct ublk_thread *t, const struct ublk_queue *q,
 		unsigned tag)
 {
-	return tag;
+	return ublk_queue_idx_in_thread(t, q) * q->q_depth + tag;
 }
=20
 /* Queue UBLK_U_IO_PREP_IO_CMDS for a specific queue with batch elements */
 int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue=
 *q);
 /* Start fetching I/O commands using multishot UBLK_U_IO_FETCH_IO_CMDS */
-void ublk_batch_start_fetch(struct ublk_thread *t,
-			    struct ublk_queue *q);
+void ublk_batch_start_fetch(struct ublk_thread *t);
 /* Handle completion of batch I/O commands (prep/commit) */
 void ublk_batch_compl_cmd(struct ublk_thread *t,
 			  const struct io_uring_cqe *cqe);
@@ -498,6 +520,8 @@ void ublk_batch_commit_io_cmds(struct ublk_thread *t);
 /* Add a completed I/O operation to the current batch commit buffer */
 void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
 			    unsigned tag, int res);
+void ublk_batch_setup_map(unsigned char (*q_thread_map)[UBLK_MAX_QUEUES],
+			   int nthreads, int queues);
=20
 static inline int ublk_complete_io(struct ublk_thread *t, struct ublk_queu=
e *q,
 				   unsigned tag, int res)
diff --git a/tools/testing/selftests/ublk/test_generic_15.sh b/tools/testin=
g/selftests/ublk/test_generic_15.sh
new file mode 100755
index 000000000000..0afd037235cf
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_15.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID=3D"generic_15"
+ERR_CODE=3D0
+
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! _have_program fio; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "generic" "test UBLK_F_BATCH_IO with 4_threads vs. 1_queues"
+
+_create_backfile 0 512M
+
+dev_id=3D$(_add_ublk_dev -t loop -q 1 --nthreads 4 -b "${UBLK_BACKFILES[0]=
}")
+_check_add_dev $TID $?
+
+# run fio over the ublk disk
+fio --name=3Djob1 --filename=3D/dev/ublkb"${dev_id}" --ioengine=3Dlibaio -=
-rw=3Dreadwrite \
+	--iodepth=3D32 --size=3D100M --numjobs=3D4 > /dev/null 2>&1
+ERR_CODE=3D$?
+
+_cleanup_test "generic"
+_show_result $TID $ERR_CODE
diff --git a/tools/testing/selftests/ublk/test_generic_16.sh b/tools/testin=
g/selftests/ublk/test_generic_16.sh
new file mode 100755
index 000000000000..32bcf4a3d0b4
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_16.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID=3D"generic_16"
+ERR_CODE=3D0
+
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! _have_program fio; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "generic" "test UBLK_F_BATCH_IO with 1_threads vs. 4_queues"
+
+_create_backfile 0 512M
+
+dev_id=3D$(_add_ublk_dev -t loop -q 4 --nthreads 1 -b "${UBLK_BACKFILES[0]=
}")
+_check_add_dev $TID $?
+
+# run fio over the ublk disk
+fio --name=3Djob1 --filename=3D/dev/ublkb"${dev_id}" --ioengine=3Dlibaio -=
-rw=3Dreadwrite \
+	--iodepth=3D32 --size=3D100M --numjobs=3D4 > /dev/null 2>&1
+ERR_CODE=3D$?
+
+_cleanup_test "generic"
+_show_result $TID $ERR_CODE
--=20
2.47.0