libceph: Handle sparse-read replies lacking data length

[RFC PATCH] libceph: Handle sparse-read replies lacking data length

Posted by Sam Edwards 3 weeks, 4 days ago

When the OSD replies to a sparse-read request, but no extents matched
the read (because the object is empty, the read requested a region
backed by no extents, ...) it is expected to reply with two 32-bit
zeroes: one indicating that there are no extents, the other that the
total bytes read is zero.

In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
is in an EC pool), the OSD sends back only one 32-bit zero. The
sparse-read state machine will end up reading something else (such as
the data CRC in the footer) and get stuck in a retry loop like:

  libceph:  [0] got 0 extents
  libceph: data len 142248331 != extent len 0
  libceph: osd0 (1)...:6801 socket error on read
  libceph: data len 142248331 != extent len 0
  libceph: osd0 (1)...:6801 socket error on read

This is probably a bug in the OSD, but even so, the kernel must handle
it to avoid misinterpreting replies and entering a retry loop.

Detect this condition when the extent count is zero by checking the
`payload_len` field of the op reply. If it is only big enough for the
extent count, conclude that the data length is omitted and skip to the
next op (which is what the state machine would have done immediately
upon reading and validating the data length, if it were present).

---

Hi list,

RFC: This patch is submitted for comment only. I've tested it for about
2 weeks now and am satisfied that it prevents the hang, but the current
approach decodes the entire op reply body while still in the
data-gathering step, which is suboptimal; feedback on cleaner
alternatives is welcome!

I have not searched for nor opened a report with Ceph proper; I'd like a
second pair of eyes to confirm that this is indeed an OSD bug before I
proceed with that.

Reproducer (Ceph 19.2.3, CephFS with an EC pool already created):
  mount -o sparseread ... /mnt/cephfs
  cd /mnt/cephfs
  mkdir ec/
  setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/
  echo 'Hello world' > ec/sparsely-packed
  truncate -s 1048576 ec/sparsely-packed
  # Read from a hole-backed region via sparse read
  dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd
  # The read hangs and triggers the retry loop described in the patch

Hope this works,
Sam

PS: I would also like to write a pair of patches to our messenger v1/v2
clients to check explicitly that sparse reads consume exactly the number
of bytes in the data section, as I see there have already been previous
bugs (including CVE-2023-52636) where the sparse-read machinery gets out
of sync with the incoming TCP stream. Has this already been proposed?
---
 net/ceph/osd_client.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1a7be2f615dc..e9e898a2415f 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5840,7 +5840,25 @@ static int osd_sparse_read(struct ceph_connection *con,
 			sr->sr_state = CEPH_SPARSE_READ_DATA_LEN;
 			break;
 		}
-		/* No extents? Read data len */
+
+		/*
+		 * No extents? Read data len (which we expect is 0) if present.
+		 *
+		 * Sometimes the OSD will omit this for zero-extent replies
+		 * (e.g. in Ceph 19.2.3 when the object is in an EC pool) which
+		 * is likely a bug in the OSD, but nonetheless we must handle
+		 * it to avoid misinterpreting the reply.
+		 */
+		struct MOSDOpReply m;
+		ret = decode_MOSDOpReply(con->in_msg, &m);
+		if (ret)
+			return ret;
+		if (m.outdata_len[o->o_sparse_op_idx] == sizeof(sr->sr_count)) {
+			dout("[%d] missing data length\n", o->o_osd);
+			sr->sr_state = CEPH_SPARSE_READ_HDR;
+			goto next_op;
+		}
+
 		fallthrough;
 	case CEPH_SPARSE_READ_DATA_LEN:
 		convert_extent_map(sr);
-- 
2.51.2

Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

Posted by Ilya Dryomov 3 weeks, 3 days ago

On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote:
>
> When the OSD replies to a sparse-read request, but no extents matched
> the read (because the object is empty, the read requested a region
> backed by no extents, ...) it is expected to reply with two 32-bit
> zeroes: one indicating that there are no extents, the other that the
> total bytes read is zero.
>
> In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> is in an EC pool), the OSD sends back only one 32-bit zero. The
> sparse-read state machine will end up reading something else (such as
> the data CRC in the footer) and get stuck in a retry loop like:
>
>   libceph:  [0] got 0 extents
>   libceph: data len 142248331 != extent len 0
>   libceph: osd0 (1)...:6801 socket error on read
>   libceph: data len 142248331 != extent len 0
>   libceph: osd0 (1)...:6801 socket error on read
>
> This is probably a bug in the OSD, but even so, the kernel must handle
> it to avoid misinterpreting replies and entering a retry loop.

Hi Sam,

Yes, this is definitely a bug in the OSD (and I also see another
related bug in the userspace client code above the OSD...).  The
triggering condition is a sparse read beyond the end of an existing
object on an EC pool.  19.2.3 isn't the problem -- main branch is
affected as well.

If this was one of the common paths, I'd support adding some sort of
a workaround to "handle" this in the kernel client.  However, sparse
reads are pretty useless on EC pools because they just get converted
into regular thick reads.  Sparse reads offer potential benefits only
on replicated pools, but the kernel client doesn't use them by default
there either.  The sparseread mount option that is necessary for the
reproducer to work isn't documented and was added purely for testing
purposes.

>
> Detect this condition when the extent count is zero by checking the
> `payload_len` field of the op reply. If it is only big enough for the
> extent count, conclude that the data length is omitted and skip to the
> next op (which is what the state machine would have done immediately
> upon reading and validating the data length, if it were present).
>
> ---
>
> Hi list,
>
> RFC: This patch is submitted for comment only. I've tested it for about
> 2 weeks now and am satisfied that it prevents the hang, but the current
> approach decodes the entire op reply body while still in the
> data-gathering step, which is suboptimal; feedback on cleaner
> alternatives is welcome!
>
> I have not searched for nor opened a report with Ceph proper; I'd like a
> second pair of eyes to confirm that this is indeed an OSD bug before I
> proceed with that.

Let me know if you want me to file a Ceph tracker ticket on your
behalf.  I have a draft patch for the bug in the OSD and would link it
in the PR, crediting you as a reporter.

>
> Reproducer (Ceph 19.2.3, CephFS with an EC pool already created):
>   mount -o sparseread ... /mnt/cephfs
>   cd /mnt/cephfs
>   mkdir ec/
>   setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/
>   echo 'Hello world' > ec/sparsely-packed
>   truncate -s 1048576 ec/sparsely-packed
>   # Read from a hole-backed region via sparse read
>   dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd
>   # The read hangs and triggers the retry loop described in the patch
>
> Hope this works,
> Sam
>
> PS: I would also like to write a pair of patches to our messenger v1/v2
> clients to check explicitly that sparse reads consume exactly the number
> of bytes in the data section, as I see there have already been previous
> bugs (including CVE-2023-52636) where the sparse-read machinery gets out
> of sync with the incoming TCP stream. Has this already been proposed?

Not that I'm aware of.  An additional safety net would be welcome as
long as it doesn't end up too invasive of course.

Thanks,

                Ilya

Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

Posted by Sam Edwards 3 weeks, 3 days ago

On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote:
> >
> > When the OSD replies to a sparse-read request, but no extents matched
> > the read (because the object is empty, the read requested a region
> > backed by no extents, ...) it is expected to reply with two 32-bit
> > zeroes: one indicating that there are no extents, the other that the
> > total bytes read is zero.
> >
> > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> > is in an EC pool), the OSD sends back only one 32-bit zero. The
> > sparse-read state machine will end up reading something else (such as
> > the data CRC in the footer) and get stuck in a retry loop like:
> >
> >   libceph:  [0] got 0 extents
> >   libceph: data len 142248331 != extent len 0
> >   libceph: osd0 (1)...:6801 socket error on read
> >   libceph: data len 142248331 != extent len 0
> >   libceph: osd0 (1)...:6801 socket error on read
> >
> > This is probably a bug in the OSD, but even so, the kernel must handle
> > it to avoid misinterpreting replies and entering a retry loop.
>
> Hi Sam,
>

Hey Ilya,

> Yes, this is definitely a bug in the OSD (and I also see another
> related bug in the userspace client code above the OSD...).  The
> triggering condition is a sparse read beyond the end of an existing
> object on an EC pool.  19.2.3 isn't the problem -- main branch is
> affected as well.
>
> If this was one of the common paths, I'd support adding some sort of
> a workaround to "handle" this in the kernel client.  However, sparse
> reads are pretty useless on EC pools because they just get converted
> into regular thick reads.  Sparse reads offer potential benefits only
> on replicated pools, but the kernel client doesn't use them by default
> there either.  The sparseread mount option that is necessary for the
> reproducer to work isn't documented and was added purely for testing
> purposes.

Note that the kernel client forces sparse reads when using fscrypt
(see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem
organically as a result. It may still make sense to apply a kernel
workaround.

On the other hand, it sounds like fscrypt+EC is a niche corner case,
we've now established that the OSD is definitely not following the
protocol, and working around this client-side is more involved than
just fixing this in the OSD. So I think simply telling affected users
to update their OSDs is also a reasonable way to handle this.

I'll defer to you.

>
> >
> > Detect this condition when the extent count is zero by checking the
> > `payload_len` field of the op reply. If it is only big enough for the
> > extent count, conclude that the data length is omitted and skip to the
> > next op (which is what the state machine would have done immediately
> > upon reading and validating the data length, if it were present).
> >
> > ---
> >
> > Hi list,
> >
> > RFC: This patch is submitted for comment only. I've tested it for about
> > 2 weeks now and am satisfied that it prevents the hang, but the current
> > approach decodes the entire op reply body while still in the
> > data-gathering step, which is suboptimal; feedback on cleaner
> > alternatives is welcome!
> >
> > I have not searched for nor opened a report with Ceph proper; I'd like a
> > second pair of eyes to confirm that this is indeed an OSD bug before I
> > proceed with that.
>
> Let me know if you want me to file a Ceph tracker ticket on your
> behalf.  I have a draft patch for the bug in the OSD and would link it
> in the PR, crediting you as a reporter.

Please do! I'm also interested in seeing the patch -- the OSD code is
pretty dense and I couldn't find the EC sparse read handler.

>
> >
> > Reproducer (Ceph 19.2.3, CephFS with an EC pool already created):
> >   mount -o sparseread ... /mnt/cephfs
> >   cd /mnt/cephfs
> >   mkdir ec/
> >   setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/
> >   echo 'Hello world' > ec/sparsely-packed
> >   truncate -s 1048576 ec/sparsely-packed
> >   # Read from a hole-backed region via sparse read
> >   dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd
> >   # The read hangs and triggers the retry loop described in the patch
> >
> > Hope this works,
> > Sam
> >
> > PS: I would also like to write a pair of patches to our messenger v1/v2
> > clients to check explicitly that sparse reads consume exactly the number
> > of bytes in the data section, as I see there have already been previous
> > bugs (including CVE-2023-52636) where the sparse-read machinery gets out
> > of sync with the incoming TCP stream. Has this already been proposed?
>
> Not that I'm aware of.  An additional safety net would be welcome as
> long as it doesn't end up too invasive of course.

Time permitting, I'll see about fixing read_partial_message() to use
con->v1.in_base_pos consistently, use that to count data bytes
consumed in sparse reads, and fail with a more specific error_msg when
a length mismatch is detected. (I do not have a plan for messenger v2
yet.)

Regards,
Sam

>
> Thanks,
>
>                 Ilya

Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

Posted by Ilya Dryomov 3 weeks, 3 days ago

On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@gmail.com> wrote:
>
> On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote:
> > >
> > > When the OSD replies to a sparse-read request, but no extents matched
> > > the read (because the object is empty, the read requested a region
> > > backed by no extents, ...) it is expected to reply with two 32-bit
> > > zeroes: one indicating that there are no extents, the other that the
> > > total bytes read is zero.
> > >
> > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> > > is in an EC pool), the OSD sends back only one 32-bit zero. The
> > > sparse-read state machine will end up reading something else (such as
> > > the data CRC in the footer) and get stuck in a retry loop like:
> > >
> > >   libceph:  [0] got 0 extents
> > >   libceph: data len 142248331 != extent len 0
> > >   libceph: osd0 (1)...:6801 socket error on read
> > >   libceph: data len 142248331 != extent len 0
> > >   libceph: osd0 (1)...:6801 socket error on read
> > >
> > > This is probably a bug in the OSD, but even so, the kernel must handle
> > > it to avoid misinterpreting replies and entering a retry loop.
> >
> > Hi Sam,
> >
>
> Hey Ilya,
>
> > Yes, this is definitely a bug in the OSD (and I also see another
> > related bug in the userspace client code above the OSD...).  The
> > triggering condition is a sparse read beyond the end of an existing
> > object on an EC pool.  19.2.3 isn't the problem -- main branch is
> > affected as well.
> >
> > If this was one of the common paths, I'd support adding some sort of
> > a workaround to "handle" this in the kernel client.  However, sparse
> > reads are pretty useless on EC pools because they just get converted
> > into regular thick reads.  Sparse reads offer potential benefits only
> > on replicated pools, but the kernel client doesn't use them by default
> > there either.  The sparseread mount option that is necessary for the
> > reproducer to work isn't documented and was added purely for testing
> > purposes.
>
> Note that the kernel client forces sparse reads when using fscrypt
> (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem
> organically as a result. It may still make sense to apply a kernel
> workaround.
>
> On the other hand, it sounds like fscrypt+EC is a niche corner case,
> we've now established that the OSD is definitely not following the
> protocol, and working around this client-side is more involved than
> just fixing this in the OSD. So I think simply telling affected users
> to update their OSDs is also a reasonable way to handle this.

fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work.  The
reason sparse reads are forced for fscrypt is that the client relies on
the sparseness metadata to be able tell if a given 4K block in the
encrypted file is a hole (in the PUNCH_HOLE sense) or not.  If it's
a hole, POSIX dictates that a read should return zeroes.  On an EC pool
where sparse reads are degraded into regular thick reads by the OSD,
a hole in the middle of an object wouldn't ever be signaled.  Instead,
the OSD would synthesize a bunch of zeroes and pass them to the client.
The client would then run them through the crypto engine (believing
it's a bona fide ciphertext) and return the resulting gibberish to the
user, thus violating POSIX and widespread assumptions about generic
filesystem behavior.

>
> I'll defer to you.
>
> >
> > >
> > > Detect this condition when the extent count is zero by checking the
> > > `payload_len` field of the op reply. If it is only big enough for the
> > > extent count, conclude that the data length is omitted and skip to the
> > > next op (which is what the state machine would have done immediately
> > > upon reading and validating the data length, if it were present).
> > >
> > > ---
> > >
> > > Hi list,
> > >
> > > RFC: This patch is submitted for comment only. I've tested it for about
> > > 2 weeks now and am satisfied that it prevents the hang, but the current
> > > approach decodes the entire op reply body while still in the
> > > data-gathering step, which is suboptimal; feedback on cleaner
> > > alternatives is welcome!
> > >
> > > I have not searched for nor opened a report with Ceph proper; I'd like a
> > > second pair of eyes to confirm that this is indeed an OSD bug before I
> > > proceed with that.
> >
> > Let me know if you want me to file a Ceph tracker ticket on your
> > behalf.  I have a draft patch for the bug in the OSD and would link it
> > in the PR, crediting you as a reporter.
>
> Please do! I'm also interested in seeing the patch -- the OSD code is
> pretty dense and I couldn't find the EC sparse read handler.

https://github.com/ceph/ceph/pull/66912

Thanks,

                Ilya

Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

Posted by Sam Edwards 3 weeks, 3 days ago

On Tue, Jan 13, 2026 at 12:15 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@gmail.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > >
> > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote:
> > > >
> > > > When the OSD replies to a sparse-read request, but no extents matched
> > > > the read (because the object is empty, the read requested a region
> > > > backed by no extents, ...) it is expected to reply with two 32-bit
> > > > zeroes: one indicating that there are no extents, the other that the
> > > > total bytes read is zero.
> > > >
> > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> > > > is in an EC pool), the OSD sends back only one 32-bit zero. The
> > > > sparse-read state machine will end up reading something else (such as
> > > > the data CRC in the footer) and get stuck in a retry loop like:
> > > >
> > > >   libceph:  [0] got 0 extents
> > > >   libceph: data len 142248331 != extent len 0
> > > >   libceph: osd0 (1)...:6801 socket error on read
> > > >   libceph: data len 142248331 != extent len 0
> > > >   libceph: osd0 (1)...:6801 socket error on read
> > > >
> > > > This is probably a bug in the OSD, but even so, the kernel must handle
> > > > it to avoid misinterpreting replies and entering a retry loop.
> > >
> > > Hi Sam,
> > >
> >
> > Hey Ilya,
> >
> > > Yes, this is definitely a bug in the OSD (and I also see another
> > > related bug in the userspace client code above the OSD...).  The
> > > triggering condition is a sparse read beyond the end of an existing
> > > object on an EC pool.  19.2.3 isn't the problem -- main branch is
> > > affected as well.
> > >
> > > If this was one of the common paths, I'd support adding some sort of
> > > a workaround to "handle" this in the kernel client.  However, sparse
> > > reads are pretty useless on EC pools because they just get converted
> > > into regular thick reads.  Sparse reads offer potential benefits only
> > > on replicated pools, but the kernel client doesn't use them by default
> > > there either.  The sparseread mount option that is necessary for the
> > > reproducer to work isn't documented and was added purely for testing
> > > purposes.
> >
> > Note that the kernel client forces sparse reads when using fscrypt
> > (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem
> > organically as a result. It may still make sense to apply a kernel
> > workaround.
> >
> > On the other hand, it sounds like fscrypt+EC is a niche corner case,
> > we've now established that the OSD is definitely not following the
> > protocol, and working around this client-side is more involved than
> > just fixing this in the OSD. So I think simply telling affected users
> > to update their OSDs is also a reasonable way to handle this.
>
> fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work.  The
> reason sparse reads are forced for fscrypt is that the client relies on
> the sparseness metadata to be able tell if a given 4K block in the
> encrypted file is a hole (in the PUNCH_HOLE sense) or not.  If it's
> a hole, POSIX dictates that a read should return zeroes.  On an EC pool
> where sparse reads are degraded into regular thick reads by the OSD,
> a hole in the middle of an object wouldn't ever be signaled.  Instead,
> the OSD would synthesize a bunch of zeroes and pass them to the client.
> The client would then run them through the crypto engine (believing
> it's a bona fide ciphertext) and return the resulting gibberish to the
> user, thus violating POSIX and widespread assumptions about generic
> filesystem behavior.

Oof, thanks for the heads-up! Fortunately my workload tolerates
garbage in holes... with the occasional (now-explained) warning, that
is. :)

I don't see the fscrypt+EC limitation mentioned in the kernel nor Ceph
docs, so I'm guessing this is more a "known major limitation" than an
out-of-scope use case. The CephFS client already blocks PUNCH_HOLE for
encrypted inodes, but by writing into the middle of an empty object, I
was able to form a hole organically and reproduce the garbage you
describe.

EC is complex, so I wouldn't have been surprised if it simply didn't
have a way to store objects with holes at all. But I was caught off
guard to learn that the hard part of this problem is communicating the
hole to the client. My intuition was that the read path must already
be detecting "no data here" in order to synthesize filler zeroes, but
it sounds like that information doesn't survive as explicit metadata.
Clearly I have more to learn about the EC read pipeline.

Cheers,
Sam

>
> >
> > I'll defer to you.
> >
> > >
> > > >
> > > > Detect this condition when the extent count is zero by checking the
> > > > `payload_len` field of the op reply. If it is only big enough for the
> > > > extent count, conclude that the data length is omitted and skip to the
> > > > next op (which is what the state machine would have done immediately
> > > > upon reading and validating the data length, if it were present).
> > > >
> > > > ---
> > > >
> > > > Hi list,
> > > >
> > > > RFC: This patch is submitted for comment only. I've tested it for about
> > > > 2 weeks now and am satisfied that it prevents the hang, but the current
> > > > approach decodes the entire op reply body while still in the
> > > > data-gathering step, which is suboptimal; feedback on cleaner
> > > > alternatives is welcome!
> > > >
> > > > I have not searched for nor opened a report with Ceph proper; I'd like a
> > > > second pair of eyes to confirm that this is indeed an OSD bug before I
> > > > proceed with that.
> > >
> > > Let me know if you want me to file a Ceph tracker ticket on your
> > > behalf.  I have a draft patch for the bug in the OSD and would link it
> > > in the PR, crediting you as a reporter.
> >
> > Please do! I'm also interested in seeing the patch -- the OSD code is
> > pretty dense and I couldn't find the EC sparse read handler.
>
> https://github.com/ceph/ceph/pull/66912
>
> Thanks,
>
>                 Ilya

Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

Posted by Ilya Dryomov 3 weeks, 3 days ago

On Wed, Jan 14, 2026 at 2:28 AM Sam Edwards <cfsworks@gmail.com> wrote:
>
> On Tue, Jan 13, 2026 at 12:15 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@gmail.com> wrote:
> > >
> > > On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@gmail.com> wrote:
> > > > >
> > > > > When the OSD replies to a sparse-read request, but no extents matched
> > > > > the read (because the object is empty, the read requested a region
> > > > > backed by no extents, ...) it is expected to reply with two 32-bit
> > > > > zeroes: one indicating that there are no extents, the other that the
> > > > > total bytes read is zero.
> > > > >
> > > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> > > > > is in an EC pool), the OSD sends back only one 32-bit zero. The
> > > > > sparse-read state machine will end up reading something else (such as
> > > > > the data CRC in the footer) and get stuck in a retry loop like:
> > > > >
> > > > >   libceph:  [0] got 0 extents
> > > > >   libceph: data len 142248331 != extent len 0
> > > > >   libceph: osd0 (1)...:6801 socket error on read
> > > > >   libceph: data len 142248331 != extent len 0
> > > > >   libceph: osd0 (1)...:6801 socket error on read
> > > > >
> > > > > This is probably a bug in the OSD, but even so, the kernel must handle
> > > > > it to avoid misinterpreting replies and entering a retry loop.
> > > >
> > > > Hi Sam,
> > > >
> > >
> > > Hey Ilya,
> > >
> > > > Yes, this is definitely a bug in the OSD (and I also see another
> > > > related bug in the userspace client code above the OSD...).  The
> > > > triggering condition is a sparse read beyond the end of an existing
> > > > object on an EC pool.  19.2.3 isn't the problem -- main branch is
> > > > affected as well.
> > > >
> > > > If this was one of the common paths, I'd support adding some sort of
> > > > a workaround to "handle" this in the kernel client.  However, sparse
> > > > reads are pretty useless on EC pools because they just get converted
> > > > into regular thick reads.  Sparse reads offer potential benefits only
> > > > on replicated pools, but the kernel client doesn't use them by default
> > > > there either.  The sparseread mount option that is necessary for the
> > > > reproducer to work isn't documented and was added purely for testing
> > > > purposes.
> > >
> > > Note that the kernel client forces sparse reads when using fscrypt
> > > (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem
> > > organically as a result. It may still make sense to apply a kernel
> > > workaround.
> > >
> > > On the other hand, it sounds like fscrypt+EC is a niche corner case,
> > > we've now established that the OSD is definitely not following the
> > > protocol, and working around this client-side is more involved than
> > > just fixing this in the OSD. So I think simply telling affected users
> > > to update their OSDs is also a reasonable way to handle this.
> >
> > fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work.  The
> > reason sparse reads are forced for fscrypt is that the client relies on
> > the sparseness metadata to be able tell if a given 4K block in the
> > encrypted file is a hole (in the PUNCH_HOLE sense) or not.  If it's
> > a hole, POSIX dictates that a read should return zeroes.  On an EC pool
> > where sparse reads are degraded into regular thick reads by the OSD,
> > a hole in the middle of an object wouldn't ever be signaled.  Instead,
> > the OSD would synthesize a bunch of zeroes and pass them to the client.
> > The client would then run them through the crypto engine (believing
> > it's a bona fide ciphertext) and return the resulting gibberish to the
> > user, thus violating POSIX and widespread assumptions about generic
> > filesystem behavior.
>
> Oof, thanks for the heads-up! Fortunately my workload tolerates
> garbage in holes... with the occasional (now-explained) warning, that
> is. :)
>
> I don't see the fscrypt+EC limitation mentioned in the kernel nor Ceph
> docs, so I'm guessing this is more a "known major limitation" than an
> out-of-scope use case.

Correct, it's tracked under https://tracker.ceph.com/issues/67507.

Thanks,

                Ilya