[PATCH] erofs: fix incorrect symlink detection in fast symlink

Gao Xiang posted 1 patch 2 months, 3 weeks ago
There is a newer version of this series
fs/erofs/inode.c | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)
[PATCH] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago
Fast symlink can be used if the on-disk symlink data is stored
in the same block as the on-disk inode, so we don’t need to trigger
another I/O for symlink data.  However, correctly fs correction could be
reported _incorrectly_ if inode xattrs are too large.

In fact, these should be valid images although they cannot be handled as
fast symlinks.

Many thanks to Colin for reporting this!

Reported-by: Colin Walters <walters@verbum.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
 fs/erofs/inode.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 5f6439a63af7..79a29841ae1c 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -178,12 +178,13 @@ static int erofs_fill_symlink(struct inode *inode, void *kaddr,
 			      unsigned int m_pofs)
 {
 	struct erofs_inode *vi = EROFS_I(inode);
-	unsigned int bsz = i_blocksize(inode);
+	loff_t off;
 	char *lnk;
 
-	/* if it cannot be handled with fast symlink scheme */
-	if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
-	    inode->i_size >= bsz || inode->i_size < 0) {
+	/* check if it cannot be handled with fast symlink scheme */
+	if (vi->datalayout != EROFS_INODE_FLAT_INLINE || inode->i_size < 0 ||
+	    check_add_overflow(m_pofs + vi->xattr_isize, inode->i_size, &off) ||
+	    off > i_blocksize(inode)) {
 		inode->i_op = &erofs_symlink_iops;
 		return 0;
 	}
@@ -192,16 +193,6 @@ static int erofs_fill_symlink(struct inode *inode, void *kaddr,
 	if (!lnk)
 		return -ENOMEM;
 
-	m_pofs += vi->xattr_isize;
-	/* inline symlink data shouldn't cross block boundary */
-	if (m_pofs + inode->i_size > bsz) {
-		kfree(lnk);
-		erofs_err(inode->i_sb,
-			  "inline data cross block boundary @ nid %llu",
-			  vi->nid);
-		DBG_BUGON(1);
-		return -EFSCORRUPTED;
-	}
 	memcpy(lnk, kaddr + m_pofs, inode->i_size);
 	lnk[inode->i_size] = '\0';
 
-- 
2.43.5

Re: [PATCH] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago

On 2024/9/9 10:28, Gao Xiang wrote:
> Fast symlink can be used if the on-disk symlink data is stored
> in the same block as the on-disk inode, so we don’t need to trigger
> another I/O for symlink data.  However, correctly fs correction could be
> reported _incorrectly_ if inode xattrs are too large.
> 
> In fact, these should be valid images although they cannot be handled as
> fast symlinks.
> 
> Many thanks to Colin for reporting this!
> 
> Reported-by: Colin Walters <walters@verbum.org>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Fixes: 431339ba9042 ("staging: erofs: add inode operations")

Thanks,
Gao Xiang
[PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago
Fast symlink can be used if the on-disk symlink data is stored
in the same block as the on-disk inode, so we don’t need to trigger
another I/O for symlink data.  However, correctly fs correction could be
reported _incorrectly_ if inode xattrs are too large.

In fact, these should be valid images although they cannot be handled as
fast symlinks.

Many thanks to Colin for reporting this!

Reported-by: Colin Walters <walters@verbum.org>
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
v2:
 - sent out a wrong version (`m_pofs += vi->xattr_isize` was missed).

 fs/erofs/inode.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 5f6439a63af7..f2cab9e4f3bc 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -178,12 +178,14 @@ static int erofs_fill_symlink(struct inode *inode, void *kaddr,
 			      unsigned int m_pofs)
 {
 	struct erofs_inode *vi = EROFS_I(inode);
-	unsigned int bsz = i_blocksize(inode);
+	loff_t off;
 	char *lnk;
 
-	/* if it cannot be handled with fast symlink scheme */
-	if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
-	    inode->i_size >= bsz || inode->i_size < 0) {
+	m_pofs += vi->xattr_isize;
+	/* check if it cannot be handled with fast symlink scheme */
+	if (vi->datalayout != EROFS_INODE_FLAT_INLINE || inode->i_size < 0 ||
+	    check_add_overflow(m_pofs, inode->i_size, &off) ||
+	    off > i_blocksize(inode)) {
 		inode->i_op = &erofs_symlink_iops;
 		return 0;
 	}
@@ -192,16 +194,6 @@ static int erofs_fill_symlink(struct inode *inode, void *kaddr,
 	if (!lnk)
 		return -ENOMEM;
 
-	m_pofs += vi->xattr_isize;
-	/* inline symlink data shouldn't cross block boundary */
-	if (m_pofs + inode->i_size > bsz) {
-		kfree(lnk);
-		erofs_err(inode->i_sb,
-			  "inline data cross block boundary @ nid %llu",
-			  vi->nid);
-		DBG_BUGON(1);
-		return -EFSCORRUPTED;
-	}
 	memcpy(lnk, kaddr + m_pofs, inode->i_size);
 	lnk[inode->i_size] = '\0';
 
-- 
2.43.5

Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Colin Walters 2 months, 3 weeks ago

On Sun, Sep 8, 2024, at 11:19 PM, Gao Xiang wrote:
> Fast symlink can be used if the on-disk symlink data is stored
> in the same block as the on-disk inode, so we don’t need to trigger
> another I/O for symlink data.  However, correctly fs correction could be
> reported _incorrectly_ if inode xattrs are too large.
>
> In fact, these should be valid images although they cannot be handled as
> fast symlinks.
>
> Many thanks to Colin for reporting this!

Yes, though feel free to also add
Reported-by: https://honggfuzz.dev/ 
or so...not totally sure how to credit it "kernel style" bit honggfuzz very much helped me find this bug (indirectly).



>
> Reported-by: Colin Walters <walters@verbum.org>
> Fixes: 431339ba9042 ("staging: erofs: add inode operations")
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
> v2:
>  - sent out a wrong version (`m_pofs += vi->xattr_isize` was missed).
>
>  fs/erofs/inode.c | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
> index 5f6439a63af7..f2cab9e4f3bc 100644
> --- a/fs/erofs/inode.c
> +++ b/fs/erofs/inode.c
> @@ -178,12 +178,14 @@ static int erofs_fill_symlink(struct inode 
> *inode, void *kaddr,
>  			      unsigned int m_pofs)
>  {
>  	struct erofs_inode *vi = EROFS_I(inode);
> -	unsigned int bsz = i_blocksize(inode);
> +	loff_t off;
>  	char *lnk;
> 
> -	/* if it cannot be handled with fast symlink scheme */
> -	if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
> -	    inode->i_size >= bsz || inode->i_size < 0) {
> +	m_pofs += vi->xattr_isize;
> +	/* check if it cannot be handled with fast symlink scheme */
> +	if (vi->datalayout != EROFS_INODE_FLAT_INLINE || inode->i_size < 0 ||
> +	    check_add_overflow(m_pofs, inode->i_size, &off) ||
> +	    off > i_blocksize(inode)) {
>  		inode->i_op = &erofs_symlink_iops;
>  		return 0;
>  	}

This change LGTM.

> @@ -192,16 +194,6 @@ static int erofs_fill_symlink(struct inode *inode, 
> void *kaddr,
>  	if (!lnk)
>  		return -ENOMEM;
> 
> -	m_pofs += vi->xattr_isize;
> -	/* inline symlink data shouldn't cross block boundary */
> -	if (m_pofs + inode->i_size > bsz) {

It can cross a boundary but it still can't be *bigger* right? IOW do we still need to check here for inode->i_size > bsz or is that handled by something else generic?
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago
Hi Colin,

On 2024/9/9 20:48, Colin Walters wrote:
> 
> 
> On Sun, Sep 8, 2024, at 11:19 PM, Gao Xiang wrote:
>> Fast symlink can be used if the on-disk symlink data is stored
>> in the same block as the on-disk inode, so we don’t need to trigger
>> another I/O for symlink data.  However, correctly fs correction could be
>> reported _incorrectly_ if inode xattrs are too large.
>>
>> In fact, these should be valid images although they cannot be handled as
>> fast symlinks.
>>
>> Many thanks to Colin for reporting this!
> 
> Yes, though feel free to also add
> Reported-by: https://honggfuzz.dev/
> or so...not totally sure how to credit it "kernel style" bit honggfuzz very much helped me find this bug (indirectly).

Ok, it sounds good to me.
I will add this later.

> 
> 
> 
>>
>> Reported-by: Colin Walters <walters@verbum.org>
>> Fixes: 431339ba9042 ("staging: erofs: add inode operations")
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>> v2:
>>   - sent out a wrong version (`m_pofs += vi->xattr_isize` was missed).
>>
>>   fs/erofs/inode.c | 20 ++++++--------------
>>   1 file changed, 6 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
>> index 5f6439a63af7..f2cab9e4f3bc 100644
>> --- a/fs/erofs/inode.c
>> +++ b/fs/erofs/inode.c
>> @@ -178,12 +178,14 @@ static int erofs_fill_symlink(struct inode
>> *inode, void *kaddr,
>>   			      unsigned int m_pofs)
>>   {
>>   	struct erofs_inode *vi = EROFS_I(inode);
>> -	unsigned int bsz = i_blocksize(inode);
>> +	loff_t off;
>>   	char *lnk;
>>
>> -	/* if it cannot be handled with fast symlink scheme */
>> -	if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
>> -	    inode->i_size >= bsz || inode->i_size < 0) {
>> +	m_pofs += vi->xattr_isize;
>> +	/* check if it cannot be handled with fast symlink scheme */
>> +	if (vi->datalayout != EROFS_INODE_FLAT_INLINE || inode->i_size < 0 ||
>> +	    check_add_overflow(m_pofs, inode->i_size, &off) ||
>> +	    off > i_blocksize(inode)) {
>>   		inode->i_op = &erofs_symlink_iops;
>>   		return 0;
>>   	}
> 
> This change LGTM.
> 
>> @@ -192,16 +194,6 @@ static int erofs_fill_symlink(struct inode *inode,
>> void *kaddr,
>>   	if (!lnk)
>>   		return -ENOMEM;
>>
>> -	m_pofs += vi->xattr_isize;
>> -	/* inline symlink data shouldn't cross block boundary */
>> -	if (m_pofs + inode->i_size > bsz) {
> 
> It can cross a boundary but it still can't be *bigger* right? IOW do we still need to check here for inode->i_size > bsz or is that handled by something else generic?

It can be bigger.

Like ext4, EROFS supports PAGE_SIZE symlink via page_get_link()
(non-fastsymlink cases), but mostly consider this as 4KiB though
regardless of on-disk block sizes.

Thanks,
Gao Xiang



Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Colin Walters 2 months, 3 weeks ago

On Mon, Sep 9, 2024, at 9:21 AM, Gao Xiang wrote:
> 
> It can be bigger.
>
> Like ext4, EROFS supports PAGE_SIZE symlink via page_get_link()
> (non-fastsymlink cases), but mostly consider this as 4KiB though
> regardless of on-disk block sizes.

But symlink targets can't be bigger than PATH_MAX which has always been 4KiB right? (Does Linux support systems with sub-4KiB pages?)

I guess let me ask it a different way: Since we're removing a sanity check here I just want to be sure that the constraints are still handled.

Hmm, skimming through the vfs code from vfs_readlink() I'm not seeing anything constraining the userspace buffer length which seems surprising.

Ah interesting, XFS has 
#define XFS_SYMLINK_MAXLEN	1024
and always constrains even its kmalloc invocation to that.
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago

On 2024/9/9 21:58, Colin Walters wrote:
> 
> 
> On Mon, Sep 9, 2024, at 9:21 AM, Gao Xiang wrote:
>>
>> It can be bigger.
>>
>> Like ext4, EROFS supports PAGE_SIZE symlink via page_get_link()
>> (non-fastsymlink cases), but mostly consider this as 4KiB though
>> regardless of on-disk block sizes.

Let me rephrase it more clearer...

For each EROFS logical block read (e.g 4KiB), EROFS will
only generate one filesystem physical block read instead
of two or more physical block read.

Symlink files just like regular files, so it doesn't allow
the inline tail crosses physical block boundary (which
means an 8KiB I/O is needed), see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/data.c?h=v6.10#n103

In that case,
EROFS_INODE_FLAT_INLINE (2) shouldn't be used and
EROFS_INODE_FLAT_PLAIN(0) need to be used instead in order
to contain full data.

But the fast symlink handling has an issue: it should
fall back to non-fastsymlink path instead of erroring out
directly.


> 
> But symlink targets can't be bigger than PATH_MAX which has always been 4KiB right? (Does Linux support systems with sub-4KiB pages?)

Yes, I think 4KiB is the minimum page size of Linux, so
in practice the maximum size of generic Linux is
PATH_MAX(4KiB).

> 
> I guess let me ask it a different way: Since we're removing a sanity check here I just want to be sure that the constraints are still handled.

It seems page_get_link() doesn't check this, but it
will add trailing `\0` to the end buffer (PAGE_SIZE - 1),
so at least it won't crash the kernel.

> 
> Hmm, skimming through the vfs code from vfs_readlink() I'm not seeing anything constraining the userspace buffer length which seems surprising.
> 
> Ah interesting, XFS has
> #define XFS_SYMLINK_MAXLEN	1024
> and always constrains even its kmalloc invocation to that.


Not quite sure about hard limitation in EROFS
runtime, we could define

#define EROFS_SYMLINK_MAXLEN	4096

But since symlink i_size > 4096 only due to crafted
images (and not generated by mkfs) and not crash, so
either way (to check or not check in kernel) is okay
to me.

Thanks,
Gao Xiang

>
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Colin Walters 2 months, 3 weeks ago

On Mon, Sep 9, 2024, at 10:14 AM, Gao Xiang wrote:
>
> Not quite sure about hard limitation in EROFS
> runtime, we could define
>
> #define EROFS_SYMLINK_MAXLEN	4096

Not sure that a new define is needed versus just reusing PATH_MAX, but that's obviously just a style thing that's much more your call than mine.

> But since symlink i_size > 4096 only due to crafted
> images (and not generated by mkfs) and not crash, so
> either way (to check or not check in kernel) is okay
> to me.

Yes, but my understanding was that EROFS (in contrast to other kernel read-write filesystems which are more complicated) was aiming to be robust against potentially malicious images.

Crafted/malicious images aside, there's also the IMO obvious angle here that we should avoid crashes or worse out-of-bound read/write if there happens to be *accidental* on-disk/memory corruption and having high bit(s) flip in a symlink inode size seems like an easy one to handle. Skimming the XFS code for example it looks like it's pretty robust in this area.
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago

On 2024/9/9 22:46, Colin Walters wrote:
> 
> 
> On Mon, Sep 9, 2024, at 10:14 AM, Gao Xiang wrote:
>>
>> Not quite sure about hard limitation in EROFS
>> runtime, we could define
>>
>> #define EROFS_SYMLINK_MAXLEN	4096
> 
> Not sure that a new define is needed versus just reusing PATH_MAX, but that's obviously just a style thing that's much more your call than mine.
> 
>> But since symlink i_size > 4096 only due to crafted
>> images (and not generated by mkfs) and not crash, so
>> either way (to check or not check in kernel) is okay
>> to me.
> 
> Yes, but my understanding was that EROFS (in contrast to other kernel read-write filesystems which are more complicated) was aiming to be robust against potentially malicious images.

Just my personal opinion, my understanding of rubustness
is stability and security.

But whether to check or not check this, it doesn't crash
the kernel or deadlock or livelock, so IMHO, it's already
rubustness.

Actually, I think EROFS for i_size > PAGE_SIZE, it's an
undefined or reserved behavior for now (just like CPU
reserved bits or don't care bits), just Linux
implementation treats it with PAGE_SIZE-1 trailing '\0',
but using erofs dump tool you could still dump large
symlinks.

Since PATH_MAX is a system-defined constant too, currently
Linux PATH_MAX is 4096, but how about other OSes? I've
seen some `PATH_MAX 8192` reference but I'm not sure which
OS uses this setting.

But I think it's a filesystem on-disk limitation, but if
i_size exceeds that, we return -EOPNOTSUPP or -EFSCORRUPTED?
For this symlink case, I tend to return -EFSCORRUPTED but
for other similar but complex cases, it could be hard to
decide.

Leaving them as undefined behaviors are also an option as
long as the behavior is secure.

> 
> Crafted/malicious images aside, there's also the IMO obvious angle here that we should avoid crashes or worse out-of-bound read/write if there happens to be *accidental* on-disk/memory corruption and having high bit(s) flip in a symlink inode size seems like an easy one to handle. Skimming the XFS code for example it looks like it's pretty robust in this area.

Yes, for this case it's much simple and easy so that's
fine, but I think for some other cases, leaving some
undefined or reserved behaviors are also good for later
extendability (again, like CPU register design.) as long
as it doesn't cause security issues.

Thanks,
Gao Xiang
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Colin Walters 2 months, 3 weeks ago

On Mon, Sep 9, 2024, at 11:40 AM, Gao Xiang wrote:
> 
> Just my personal opinion, my understanding of rubustness
> is stability and security.
>
> But whether to check or not check this, it doesn't crash
> the kernel or deadlock or livelock, so IMHO, it's already
> rubustness.

OK, if you're telling me this is already checked elsewhere then I think we can call this end of thread.

> Actually, I think EROFS for i_size > PAGE_SIZE, it's an
> undefined or reserved behavior for now (just like CPU
> reserved bits or don't care bits), just Linux
> implementation treats it with PAGE_SIZE-1 trailing '\0',
> but using erofs dump tool you could still dump large
> symlinks.
>
> Since PATH_MAX is a system-defined constant too, currently
> Linux PATH_MAX is 4096, but how about other OSes? I've
> seen some `PATH_MAX 8192` reference but I'm not sure which
> OS uses this setting.

Famously GNU Hurd tried to avoid having a PATH_MAX at all:
https://www.gnu.org/software/hurd/community/gsoc/project_ideas/maxpath.html

But I am pretty confident in saying that Linux isn't going to try to or really be able to meaningfully change its PATH_MAX in the forseeable future.

Now we're firmly off into the weeds but since it's kind of an interesting debate: Honestly: who would *want* to ever interact with such long file paths? And I think a much better evolutionary direction is already in flight - operate in terms of directory fds with openat() etc. While it'd be logistically complicated one could imagine a variant of a Unix filesystem which hard required using openat() to access sub-domains where the paths in that sub-domain are limited to the existing PATH_MAX. I guess that kind of already exists in subvolumes/snapshots, and we're effectively enabling this with composefs too.

> But I think it's a filesystem on-disk limitation, but if
> i_size exceeds that, we return -EOPNOTSUPP or -EFSCORRUPTED?
> For this symlink case, I tend to return -EFSCORRUPTED but
> for other similar but complex cases, it could be hard to
> decide.

-EFSCORRUPTED is what XFS does at least (in xfs_inactive_symlink()).
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 3 weeks ago

On 2024/9/10 08:12, Colin Walters wrote:
> 
> 
> On Mon, Sep 9, 2024, at 11:40 AM, Gao Xiang wrote:
>>
>> Just my personal opinion, my understanding of rubustness
>> is stability and security.
>>
>> But whether to check or not check this, it doesn't crash
>> the kernel or deadlock or livelock, so IMHO, it's already
>> rubustness.
> 
> OK, if you're telling me this is already checked elsewhere then I think we can call this end of thread.

I know you ask for an explicit check on symlink i_size, but
I've explained the current kernel behavior:
   - For symlink i_size < PAGE_SIZE (always >= 4096 on Linux),
     it behaves normally for EROFS Linux implementation;

   - For symlink i_size >= PAGE_SIZE, EROFS Linux
     implementation will mark '\0' at PAGE_SIZE - 1 in
     page_get_link() -> nd_terminate_link() so the behavior is also
     deterministic and not harmful to the system stability and security;

In order to verify this, you could also check the EROFS image
(encoded in gzip+base64) with a 16000-byte symlink file below:

H4sICPqj32YCA3Rlc3QuZXJvZnMA7dsvSwNhHMDx350IBotgt0wMwsnegDCYb8FiVbu4LAom
kygbgwURfR1Wm12Lf5JFTIpY9B58GCK2BYV9vvDhee64226sXPlFSBrXHu5f7k4u+isT9X46
GlHm8+9nt5tpHaxOxeS364+Wjh8PXtvP3bn9/nXvpjvq96fPfmpFLObjj7q07lzOxnq9Hubn
eLvaapa/3N8YruVwP59+Rb54IYqoqqqzsd3xZ0uSJGnsSy/GAAAAAAAAAAAAAADA/5bmb89P
I3aXv+YBiuzn/O3azF6zMD8AAADAHzHBP1qf7k2HRABQAAA=

Here the symlink is already recorded in the image (let's not think if
the symlink is reasonable or not for now), but Linux implementation will
just truncate it as a 4095-byte link path, that is the current release
behavior and it has no security issue inside.

In other words, currently i_size >= PAGE_SIZE is an undefined behavior
but Linux just truncates the link path.

So I understand that what you propose here is to check the size
explicitly, which means we need to introduce a formal on-disk hard
limitation, for example:

   - Introduce EROFS_SYMLINK_MAXLEN as 4095;

   - For symlink i_size < EROFS_SYMLINK_MAXLEN, it behaves normally;

   - For symlink i_size >= EROFS_SYMLINK_MAXLEN, it just return
     -EFSCORRUPTED to refuse such inodes;

So that is an added limitation (just like a "don't care" bit into
a "meaningful" bit).

For this case, to be clear I'm totally fine with the limitation,
but I need to decide whether I should make "EROFS_SYMLINK_MAXLEN"
as 4095 or "EROFS_SYMLINK_MAXLEN" as 4096 but also accepts
`link[4095] == '\0'`.

No matter which is finalled selected, it's a new hard on-disk
limitation, which means we cannot change the limitation anymore
(-EFSCORRUPTED) unless some strong reason as a bugfix.

> 
>> Actually, I think EROFS for i_size > PAGE_SIZE, it's an
>> undefined or reserved behavior for now (just like CPU
>> reserved bits or don't care bits), just Linux
>> implementation treats it with PAGE_SIZE-1 trailing '\0',
>> but using erofs dump tool you could still dump large
>> symlinks.
>>
>> Since PATH_MAX is a system-defined constant too, currently
>> Linux PATH_MAX is 4096, but how about other OSes? I've
>> seen some `PATH_MAX 8192` reference but I'm not sure which
>> OS uses this setting.
> 
> Famously GNU Hurd tried to avoid having a PATH_MAX at all:
> https://www.gnu.org/software/hurd/community/gsoc/project_ideas/maxpath.html
> 
> But I am pretty confident in saying that Linux isn't going to try to or really be able to meaningfully change its PATH_MAX in the forseeable future.
> 
> Now we're firmly off into the weeds but since it's kind of an interesting debate: Honestly: who would *want* to ever interact with such long file paths? And I think a much better evolutionary direction is already in flight - operate in terms of directory fds with openat() etc. While it'd be logistically complicated one could imagine a variant of a Unix filesystem which hard required using openat() to access sub-domains where the paths in that sub-domain are limited to the existing PATH_MAX. I guess that kind of already exists in subvolumes/snapshots, and we're effectively enabling this with composefs too.

Yes, but that is just off-topic.  I just need to confirm
that `PATH_MAX >= 4096 is absolutely nonsense for all OSes
on earth`.

If there is a path larger than 4096 in some OS, we maybe
(just maybe) run into an awkward situation:  EROFS can have
some limitation to be used as an _archive format_ on this
OS.

Similar to EXT4->XFS, if XFS is used as an _archive format_
there could be a possiability that a EXT4 symlink cannot be
represented correctly according to its on-disk format.  So
as an _archive format_, XFS is more limited now (Please
don't misunderstand, I'm not saying XFS is not a great
filesystem. I like XFS.)

Thanks,
Gao Xiang

> 
>> But I think it's a filesystem on-disk limitation, but if
>> i_size exceeds that, we return -EOPNOTSUPP or -EFSCORRUPTED?
>> For this symlink case, I tend to return -EFSCORRUPTED but
>> for other similar but complex cases, it could be hard to
>> decide.
> 
> -EFSCORRUPTED is what XFS does at least (in xfs_inactive_symlink()).
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Colin Walters 2 months, 2 weeks ago

On Mon, Sep 9, 2024, at 10:18 PM, Gao Xiang wrote:

> I know you ask for an explicit check on symlink i_size, but
> I've explained the current kernel behavior:
>    - For symlink i_size < PAGE_SIZE (always >= 4096 on Linux),
>      it behaves normally for EROFS Linux implementation;
>
>    - For symlink i_size >= PAGE_SIZE, EROFS Linux
>      implementation will mark '\0' at PAGE_SIZE - 1 in
>      page_get_link() -> nd_terminate_link() so the behavior is also
>      deterministic and not harmful to the system stability and security;

Got it, OK.

> In other words, currently i_size >= PAGE_SIZE is an undefined behavior
> but Linux just truncates the link path.

I think where we had a miscommunication is that when I see "undefined behavior" I thought you were using the formal term: https://en.wikipedia.org/wiki/Undefined_behavior

The term for what you're talking about in my experience is usually "unspecified behavior" or "implementation defined behavior" which (assuming a reasonable implementor) would include silent truncation or an explicit error, but *not* walking off the end of a buffer and writing to arbitrary other kernel memory etc.

(Hmm really given the widespread use of nd_terminate_link I guess this is kind of more of a "Linux convention" than just an EROFS one, with XFS as a notable exception?)

> For this case, to be clear I'm totally fine with the limitation,
> but I need to decide whether I should make "EROFS_SYMLINK_MAXLEN"
> as 4095 or "EROFS_SYMLINK_MAXLEN" as 4096 but also accepts
> `link[4095] == '\0'`.

Mmmm...I think PATH_MAX is conventionally taken to include the NUL; yeah see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c?id=b40c8e7a033ff2cafd33adbe50e2a516f88fa223#n123
Re: [PATCH v2] erofs: fix incorrect symlink detection in fast symlink
Posted by Gao Xiang 2 months, 2 weeks ago

On 2024/9/11 04:51, Colin Walters wrote:
> 
> 
> On Mon, Sep 9, 2024, at 10:18 PM, Gao Xiang wrote:
> 
>> I know you ask for an explicit check on symlink i_size, but
>> I've explained the current kernel behavior:
>>     - For symlink i_size < PAGE_SIZE (always >= 4096 on Linux),
>>       it behaves normally for EROFS Linux implementation;
>>
>>     - For symlink i_size >= PAGE_SIZE, EROFS Linux
>>       implementation will mark '\0' at PAGE_SIZE - 1 in
>>       page_get_link() -> nd_terminate_link() so the behavior is also
>>       deterministic and not harmful to the system stability and security;
> 
> Got it, OK.
> 
>> In other words, currently i_size >= PAGE_SIZE is an undefined behavior
>> but Linux just truncates the link path.
> 
> I think where we had a miscommunication is that when I see "undefined behavior" I thought you were using the formal term: https://en.wikipedia.org/wiki/Undefined_behavior
> 
> The term for what you're talking about in my experience is usually "unspecified behavior" or "implementation defined behavior" which (assuming a reasonable implementor) would include silent truncation or an explicit error, but *not* walking off the end of a buffer and writing to arbitrary other kernel memory etc.

Yeah, agreed. "implementation defined behavior" sounds a better term.

Sorry about my limited English corpus, because the environment I'm
living mostly is used to professional terms translated in Chinese..

> 
> (Hmm really given the widespread use of nd_terminate_link I guess this is kind of more of a "Linux convention" than just an EROFS one, with XFS as a notable exception?)

I'm not sure if other kernel fses have their own internal issues
(so they need to check i_size > PAGE_SIZE to cover up their own
format design in advance), but I think (and tested with crafted
images) EROFS with pure only Linux VFS nd_terminate_link()
implementation (since 2.6.x era)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebd09abbd9699f328165aee50a070403fbf55a37

is already safe on i_size > PAGE_SIZE since EROFS symlink on-disk
format is just like its regular inode format.

As for XFS, I think it's a history on-disk behavior (1024-byte
hard limitation) so they have to follow until now, see the related
commit message:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6eb0b8df9f74f33d1a69100117630a7a87a9cc96

> 
>> For this case, to be clear I'm totally fine with the limitation,
>> but I need to decide whether I should make "EROFS_SYMLINK_MAXLEN"
>> as 4095 or "EROFS_SYMLINK_MAXLEN" as 4096 but also accepts
>> `link[4095] == '\0'`.
> 
> Mmmm...I think PATH_MAX is conventionally taken to include the NUL; yeah see
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c?id=b40c8e7a033ff2cafd33adbe50e2a516f88fa223#n123

Agreed, but honestly I have some concern if some OS or tar format
or other popular archive formats support large symlinks but EROFS
have no way to keep them due to on-disk limitation.

If you don't have some strong opinion on this, I do hope let's
hold off our decision about this to ensure compatibility.

Thanks,
Gao Xiang