[PATCH] ceph: bound num_split_inos and num_split_realms in ceph_handle_snap()

Michael Bommarito posted 1 patch 5 days, 13 hours ago
fs/ceph/snap.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
[PATCH] ceph: bound num_split_inos and num_split_realms in ceph_handle_snap()
Posted by Michael Bommarito 5 days, 13 hours ago
A peer that can deliver a CEPH_MSG_CLIENT_SNAP to the kernel CephFS
client (a compromised or malicious MDS, or an attacker who has
forged/replayed a cephx session on the cluster network) can cause an
out-of-bounds slab read in ceph_update_snap_trace() by sending
num_split_inos or num_split_realms as a small negative __le32.
ceph_handle_snap() parses both counts into signed int and then
advances the decode pointer with `p += sizeof(u64) * num_split_inos`;
the multiplication is in size_t, so the signed operand is widened
modulo 2**64 and a wire value like -32 produces an attacker-chosen
byte offset that walks p backwards into the slab. The subsequent
ceph_decode_need(&p, e, sizeof(*ri), bad) passes (end - p is huge),
ri = p, and the next 4-byte read inside ceph_update_snap_trace() is
performed from attacker-positioned memory. The same arithmetic and
the same pointer hand-off exist in the non-split branch.

Promote num_split_inos and num_split_realms to u32 to match the
on-wire __le32 fields, compute each array's byte length with
array_size() so a size_t overflow saturates to SIZE_MAX instead of
wrapping, sum the two lengths with check_add_overflow(), and verify
the total against the remaining front-buffer length before any
pointer bump. Re-use the validated byte counts for the bumps in both
the split and non-split branches.

The MDS is an authenticated peer under cephx, but the kernel client
is still expected to validate metadata it accepts over the wire;
this hardens the input-validation boundary that snap-message decode
crosses.

Fixes: 963b61eb041e8 ("ceph: snapshot management")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
Reproduced on x86_64 QEMU/KVM, KASAN_INLINE generic, two ways:

  - In-tree harness that allocates an upstream struct ceph_msg via
    ceph_msg_new(), writes num_split_inos = (u32)-32,
    num_split_realms = 0, op = CEPH_SNAP_OP_UPDATE into the front
    buffer, and calls ceph_handle_snap(&mdsc, &session, msg)
    directly.

  - End-to-end over a real TCP connection from a real ceph-mds
    daemon to the kernel CephFS client, with num_split_inos
    rewritten to (u32)-32 in the front buffer and the messenger
    v1 footer.front_crc recomputed so the kernel libceph receive
    path accepts the message. The KASAN report fires from the
    tcp_recvmsg softirq path through ceph_handle_snap+0x345 into
    ceph_update_snap_trace+0x23bf, confirming the bug is reached
    via the normal MDS->client receive path and not only by
    direct harness invocation.

A stock v7.1-rc3 kernel produces:

  BUG: KASAN: slab-out-of-bounds in ceph_update_snap_trace+0x23bf/0x31a0
  Read of size 4 at addr ffff8880012be1f8 by task init/1
    ceph_update_snap_trace+0x23bf/0x31a0
    ? ceph_handle_snap+0x312/0x900
    ceph_handle_snap+0x345/0x900
  The buggy address is located 248 bytes to the right of
   allocated 256-byte region [ffff8880012be000, ffff8880012be100)

With this patch applied, the same trigger (both via the harness
and via the wire path) hits the new validator's goto bad path,
logs "corrupt snap message from mds0", calls ceph_msg_dump(), and
returns cleanly with no KASAN report. Harness and wire-injection
scripts available on request.

The kernel ships no fs/ceph selftests and no ceph KUnit module that
exercises ceph_handle_snap, so no in-tree selftest delta to report.

fs/ceph/snap.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 52b4c2684f922..7c4487eb2708a 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -1027,9 +1027,10 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
 	void *p = msg->front.iov_base;
 	void *e = p + msg->front.iov_len;
 	struct ceph_mds_snap_head *h;
-	int num_split_inos, num_split_realms;
+	u32 num_split_inos, num_split_realms;
 	__le64 *split_inos = NULL, *split_realms = NULL;
-	int i;
+	size_t split_inos_bytes, split_realms_bytes, split_bytes;
+	u32 i;
 	int locked_rwsem = 0;
 	bool close_sessions = false;

@@ -1048,6 +1049,24 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
 	trace_len = le32_to_cpu(h->trace_len);
 	p += sizeof(*h);

+	/*
+	 * Validate that the two MDS-supplied counts cannot wrap when
+	 * multiplied by sizeof(u64), and that the two arrays together
+	 * fit in the remaining front buffer before any of the pointer
+	 * bumps below.  Without this, a malformed (or malicious) snap
+	 * message can cause 'p += sizeof(u64) * num_split_inos' to land
+	 * at an attacker-chosen offset via the size_t * int widening,
+	 * bypassing ceph_decode_need() and making the subsequent
+	 * 'ri = p; ri->created' read out of bounds.
+	 */
+	split_inos_bytes   = array_size(num_split_inos,   sizeof(u64));
+	split_realms_bytes = array_size(num_split_realms, sizeof(u64));
+	if (split_inos_bytes == SIZE_MAX || split_realms_bytes == SIZE_MAX ||
+	    check_add_overflow(split_inos_bytes, split_realms_bytes,
+			       &split_bytes) ||
+	    (size_t)(e - p) < split_bytes)
+		goto bad;
+
 	doutc(cl, "from mds%d op %s split %llx tracelen %d\n", mds,
 	      ceph_snap_op_name(op), split, trace_len);

@@ -1064,9 +1083,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
 		 * child.
 		 */
 		split_inos = p;
-		p += sizeof(u64) * num_split_inos;
+		p += split_inos_bytes;
 		split_realms = p;
-		p += sizeof(u64) * num_split_realms;
+		p += split_realms_bytes;
 		ceph_decode_need(&p, e, sizeof(*ri), bad);
 		/* we will peek at realm info here, but will _not_
 		 * advance p, as the realm update will occur below in
@@ -1144,8 +1163,8 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
 		 * positioned at the start of realm info, as expected by
 		 * ceph_update_snap_trace().
 		 */
-		p += sizeof(u64) * num_split_inos;
-		p += sizeof(u64) * num_split_realms;
+		p += split_inos_bytes;
+		p += split_realms_bytes;
 	}

 	/*
--
2.53.0
Re: [PATCH] ceph: bound num_split_inos and num_split_realms in ceph_handle_snap()
Posted by Viacheslav Dubeyko 5 days, 5 hours ago
On Tue, 2026-05-19 at 07:30 -0400, Michael Bommarito wrote:
> A peer that can deliver a CEPH_MSG_CLIENT_SNAP to the kernel CephFS
> client (a compromised or malicious MDS, or an attacker who has
> forged/replayed a cephx session on the cluster network) can cause an
> out-of-bounds slab read in ceph_update_snap_trace() by sending
> num_split_inos or num_split_realms as a small negative __le32.
> ceph_handle_snap() parses both counts into signed int and then
> advances the decode pointer with `p += sizeof(u64) * num_split_inos`;
> the multiplication is in size_t, so the signed operand is widened
> modulo 2**64 and a wire value like -32 produces an attacker-chosen
> byte offset that walks p backwards into the slab. The subsequent
> ceph_decode_need(&p, e, sizeof(*ri), bad) passes (end - p is huge),
> ri = p, and the next 4-byte read inside ceph_update_snap_trace() is
> performed from attacker-positioned memory. The same arithmetic and
> the same pointer hand-off exist in the non-split branch.
> 
> Promote num_split_inos and num_split_realms to u32 to match the
> on-wire __le32 fields, compute each array's byte length with
> array_size() so a size_t overflow saturates to SIZE_MAX instead of
> wrapping, sum the two lengths with check_add_overflow(), and verify
> the total against the remaining front-buffer length before any
> pointer bump. Re-use the validated byte counts for the bumps in both
> the split and non-split branches.
> 
> The MDS is an authenticated peer under cephx, but the kernel client
> is still expected to validate metadata it accepts over the wire;
> this hardens the input-validation boundary that snap-message decode
> crosses.
> 
> Fixes: 963b61eb041e8 ("ceph: snapshot management")
> Cc: stable@vger.kernel.org
> Assisted-by: Claude:claude-opus-4-7

I am not completely sure that it's good to mention it. Do we have final policy
accepted?

> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> Reproduced on x86_64 QEMU/KVM, KASAN_INLINE generic, two ways:
> 
>   - In-tree harness that allocates an upstream struct ceph_msg via
>     ceph_msg_new(), writes num_split_inos = (u32)-32,
>     num_split_realms = 0, op = CEPH_SNAP_OP_UPDATE into the front
>     buffer, and calls ceph_handle_snap(&mdsc, &session, msg)
>     directly.
> 
>   - End-to-end over a real TCP connection from a real ceph-mds
>     daemon to the kernel CephFS client, with num_split_inos
>     rewritten to (u32)-32 in the front buffer and the messenger
>     v1 footer.front_crc recomputed so the kernel libceph receive
>     path accepts the message. The KASAN report fires from the
>     tcp_recvmsg softirq path through ceph_handle_snap+0x345 into
>     ceph_update_snap_trace+0x23bf, confirming the bug is reached
>     via the normal MDS->client receive path and not only by
>     direct harness invocation.
> 
> A stock v7.1-rc3 kernel produces:
> 
>   BUG: KASAN: slab-out-of-bounds in ceph_update_snap_trace+0x23bf/0x31a0
>   Read of size 4 at addr ffff8880012be1f8 by task init/1
>     ceph_update_snap_trace+0x23bf/0x31a0
>     ? ceph_handle_snap+0x312/0x900
>     ceph_handle_snap+0x345/0x900
>   The buggy address is located 248 bytes to the right of
>    allocated 256-byte region [ffff8880012be000, ffff8880012be100)
> 
> With this patch applied, the same trigger (both via the harness
> and via the wire path) hits the new validator's goto bad path,
> logs "corrupt snap message from mds0", calls ceph_msg_dump(), and
> returns cleanly with no KASAN report. Harness and wire-injection
> scripts available on request.
> 
> The kernel ships no fs/ceph selftests and no ceph KUnit module that
> exercises ceph_handle_snap, so no in-tree selftest delta to report.

We introduced the self-tests recently. And you are welcomed to add KUnit based
unit-tests.

> 
> fs/ceph/snap.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index 52b4c2684f922..7c4487eb2708a 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -1027,9 +1027,10 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>  	void *p = msg->front.iov_base;
>  	void *e = p + msg->front.iov_len;
>  	struct ceph_mds_snap_head *h;
> -	int num_split_inos, num_split_realms;
> +	u32 num_split_inos, num_split_realms;
>  	__le64 *split_inos = NULL, *split_realms = NULL;
> -	int i;
> +	size_t split_inos_bytes, split_realms_bytes, split_bytes;
> +	u32 i;

I don't where patch uses i variable. What is the point of this change?

>  	int locked_rwsem = 0;
>  	bool close_sessions = false;
> 
> @@ -1048,6 +1049,24 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>  	trace_len = le32_to_cpu(h->trace_len);
>  	p += sizeof(*h);
> 
> +	/*
> +	 * Validate that the two MDS-supplied counts cannot wrap when
> +	 * multiplied by sizeof(u64), and that the two arrays together
> +	 * fit in the remaining front buffer before any of the pointer
> +	 * bumps below.  Without this, a malformed (or malicious) snap
> +	 * message can cause 'p += sizeof(u64) * num_split_inos' to land
> +	 * at an attacker-chosen offset via the size_t * int widening,
> +	 * bypassing ceph_decode_need() and making the subsequent
> +	 * 'ri = p; ri->created' read out of bounds.
> +	 */

I am not sure that we really need Claude AI generated comment here. Maybe, some
short comment is written by human being makes sense. But, currently, I prefer
completely remove this comment.

> +	split_inos_bytes   = array_size(num_split_inos,   sizeof(u64));

What is the point of this alignment? Claude AI like this? Please, double check
the generated code and follow to Linux kernel style. Have you run the
checkpatch.pl script for the patch?

> +	split_realms_bytes = array_size(num_split_realms, sizeof(u64));
> +	if (split_inos_bytes == SIZE_MAX || split_realms_bytes == SIZE_MAX ||

Could it be possible that split_inos_bytes or split_realms_bytes are lesser than
SIZE_MAX but we still could have overflow?

> +	    check_add_overflow(split_inos_bytes, split_realms_bytes,
> +			       &split_bytes) ||

All this check looks like a good candidate for static inline function.

> +	    (size_t)(e - p) < split_bytes)

The whole check looks complicated and confusing. It's really easy to miss
something in the logic. I believe that this code requires some refactoring. I am
not very like the pattern of calculating the split_bytes in the previous
condition check.

What about this?

  split_bytes = size_add(split_inos_bytes, split_realms_bytes);
  if (split_bytes == SIZE_MAX || (size_t)(e - p) < split_bytes)
      goto bad;

Thanks,
Slava.

> +		goto bad;
> +
>  	doutc(cl, "from mds%d op %s split %llx tracelen %d\n", mds,
>  	      ceph_snap_op_name(op), split, trace_len);
> 
> @@ -1064,9 +1083,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>  		 * child.
>  		 */
>  		split_inos = p;
> -		p += sizeof(u64) * num_split_inos;
> +		p += split_inos_bytes;
>  		split_realms = p;
> -		p += sizeof(u64) * num_split_realms;
> +		p += split_realms_bytes;
>  		ceph_decode_need(&p, e, sizeof(*ri), bad);
>  		/* we will peek at realm info here, but will _not_
>  		 * advance p, as the realm update will occur below in
> @@ -1144,8 +1163,8 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>  		 * positioned at the start of realm info, as expected by
>  		 * ceph_update_snap_trace().
>  		 */
> -		p += sizeof(u64) * num_split_inos;
> -		p += sizeof(u64) * num_split_realms;
> +		p += split_inos_bytes;
> +		p += split_realms_bytes;
>  	}
> 
>  	/*
> --
> 2.53.0
> 
Re: [PATCH] ceph: bound num_split_inos and num_split_realms in ceph_handle_snap()
Posted by Michael Bommarito 4 days ago
On Tue, May 19, 2026 at 3:47 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
> I am not completely sure that it's good to mention it. Do we have final policy
> accepted?

Yup, it's now required to disclose:
https://docs.kernel.org/process/coding-assistants.html

> We introduced the self-tests recently. And you are welcomed to add KUnit based
> unit-tests.

My bad.  I can add one with a patch set in v2.

> I don't where patch uses i variable. What is the point of this change?
> ...
> I am not sure that we really need Claude AI generated comment here. Maybe, some
> short comment is written by human being makes sense. But, currently, I prefer
> completely remove this comment.

Sure, agree.

> > +     split_inos_bytes   = array_size(num_split_inos,   sizeof(u64));
>
> What is the point of this alignment? Claude AI like this? Please, double check
> the generated code and follow to Linux kernel style. Have you run the
> checkpatch.pl script for the patch?
>

That's all my whitespace, not Claude :)

And checkpatch.pl does ret 0 with no warnings/errors.  But you're
right that it's not house style, I'll fix it.

>
> Could it be possible that split_inos_bytes or split_realms_bytes are lesser than
> SIZE_MAX but we still could have overflow?
>
> > +         check_add_overflow(split_inos_bytes, split_realms_bytes,
> > +                            &split_bytes) ||
>
> All this check looks like a good candidate for static inline function.
>
> > +         (size_t)(e - p) < split_bytes)
>
> The whole check looks complicated and confusing. It's really easy to miss
> something in the logic. I believe that this code requires some refactoring. I am
> not very like the pattern of calculating the split_bytes in the previous
> condition check.
>
> What about this?
>
>   split_bytes = size_add(split_inos_bytes, split_realms_bytes);
>   if (split_bytes == SIZE_MAX || (size_t)(e - p) < split_bytes)
>       goto bad;

Yours does looks more intuitive.  I'll refactor those and inline the
check in v2 tomorrow morning.

Thanks,
Mike