fs/fuse/file.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
Our user space filesystem relies on fuse to provide POSIX interface.
In our test, a known string is written into a file and the content
is read back later to verify correct data returned. We observed wrong
data returned in read buffer in rare cases although correct data are
stored in our filesystem.
Fuse kernel module calls iov_iter_get_pages2() to get the physical
pages of the user-space read buffer passed in read(). The pages are
not pinned to avoid page migration. When page migration occurs, the
consequence are two-folds.
1) Applications do not receive correct data in read buffer.
2) fuse kernel writes data into a wrong place.
Using iov_iter_extract_pages() to pin pages fixes the issue in our
test.
An auxiliary variable "struct page **pt_pages" is used in the patch
to prepare the 2nd parameter for iov_iter_extract_pages() since
iov_iter_get_pages2() uses a different type for the 2nd parameter.
Signed-off-by: Lei Huang <lei.huang@linux.intel.com>
---
fs/fuse/file.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index bc41152..715de3b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -670,7 +670,7 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
for (i = 0; i < ap->num_pages; i++) {
if (should_dirty)
set_page_dirty_lock(ap->pages[i]);
- put_page(ap->pages[i]);
+ unpin_user_page(ap->pages[i]);
}
}
@@ -1428,10 +1428,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
while (nbytes < *nbytesp && ap->num_pages < max_pages) {
unsigned npages;
size_t start;
- ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
- *nbytesp - nbytes,
- max_pages - ap->num_pages,
- &start);
+ struct page **pt_pages;
+
+ pt_pages = &ap->pages[ap->num_pages];
+ ret = iov_iter_extract_pages(ii, &pt_pages,
+ *nbytesp - nbytes,
+ max_pages - ap->num_pages,
+ 0, &start);
if (ret < 0)
break;
--
1.8.3.1
On Tue, 29 Aug 2023 at 20:37, Lei Huang <lei.huang@linux.intel.com> wrote: > > Our user space filesystem relies on fuse to provide POSIX interface. > In our test, a known string is written into a file and the content > is read back later to verify correct data returned. We observed wrong > data returned in read buffer in rare cases although correct data are > stored in our filesystem. > > Fuse kernel module calls iov_iter_get_pages2() to get the physical > pages of the user-space read buffer passed in read(). The pages are > not pinned to avoid page migration. When page migration occurs, the > consequence are two-folds. > > 1) Applications do not receive correct data in read buffer. > 2) fuse kernel writes data into a wrong place. > > Using iov_iter_extract_pages() to pin pages fixes the issue in our > test. > > An auxiliary variable "struct page **pt_pages" is used in the patch > to prepare the 2nd parameter for iov_iter_extract_pages() since > iov_iter_get_pages2() uses a different type for the 2nd parameter. > > Signed-off-by: Lei Huang <lei.huang@linux.intel.com> Applied, with a modification to only unpin if iov_iter_extract_will_pin() returns true. Thanks, Miklos
On 3/6/24 11:01, Miklos Szeredi wrote: > On Tue, 29 Aug 2023 at 20:37, Lei Huang <lei.huang@linux.intel.com> wrote: >> >> Our user space filesystem relies on fuse to provide POSIX interface. >> In our test, a known string is written into a file and the content >> is read back later to verify correct data returned. We observed wrong >> data returned in read buffer in rare cases although correct data are >> stored in our filesystem. >> >> Fuse kernel module calls iov_iter_get_pages2() to get the physical >> pages of the user-space read buffer passed in read(). The pages are >> not pinned to avoid page migration. When page migration occurs, the >> consequence are two-folds. >> >> 1) Applications do not receive correct data in read buffer. >> 2) fuse kernel writes data into a wrong place. >> >> Using iov_iter_extract_pages() to pin pages fixes the issue in our >> test. >> >> An auxiliary variable "struct page **pt_pages" is used in the patch >> to prepare the 2nd parameter for iov_iter_extract_pages() since >> iov_iter_get_pages2() uses a different type for the 2nd parameter. >> >> Signed-off-by: Lei Huang <lei.huang@linux.intel.com> > > Applied, with a modification to only unpin if > iov_iter_extract_will_pin() returns true. Hi Miklos, do you have an idea if this needs to be back ported and to which kernel version? I had tried to reproduce data corruption with 4.18 - Lei wrote that he could see issues with older kernels as well, but I never managed to trigger anything on 4.18-RHEL. Typically I use ql-fstest (https://github.com/bsbernd/ql-fstest) and even added random DIO as an option - nothing report with weeks of run time. I could try again with more recent kernels that have folios. Thanks, Bernd
On Wed, 6 Mar 2024 at 12:16, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > On 3/6/24 11:01, Miklos Szeredi wrote: > > On Tue, 29 Aug 2023 at 20:37, Lei Huang <lei.huang@linux.intel.com> wrote: > >> > >> Our user space filesystem relies on fuse to provide POSIX interface. > >> In our test, a known string is written into a file and the content > >> is read back later to verify correct data returned. We observed wrong > >> data returned in read buffer in rare cases although correct data are > >> stored in our filesystem. > >> > >> Fuse kernel module calls iov_iter_get_pages2() to get the physical > >> pages of the user-space read buffer passed in read(). The pages are > >> not pinned to avoid page migration. When page migration occurs, the > >> consequence are two-folds. > >> > >> 1) Applications do not receive correct data in read buffer. > >> 2) fuse kernel writes data into a wrong place. > >> > >> Using iov_iter_extract_pages() to pin pages fixes the issue in our > >> test. > >> > >> An auxiliary variable "struct page **pt_pages" is used in the patch > >> to prepare the 2nd parameter for iov_iter_extract_pages() since > >> iov_iter_get_pages2() uses a different type for the 2nd parameter. > >> > >> Signed-off-by: Lei Huang <lei.huang@linux.intel.com> > > > > Applied, with a modification to only unpin if > > iov_iter_extract_will_pin() returns true. > > Hi Miklos, > > do you have an idea if this needs to be back ported and to which kernel > version? > I had tried to reproduce data corruption with 4.18 - Lei wrote that he > could see issues with older kernels as well, but I never managed to > trigger anything on 4.18-RHEL. Typically I use ql-fstest > (https://github.com/bsbernd/ql-fstest) and even added random DIO as an > option - nothing report with weeks of run time. I could try again with > more recent kernels that have folios. I don't think that corruption will happen in real life. So I'm not sure we need to bother with backporting, and definitely not before when the infrastructure was introduced. Thanks, Miklos
Thank you very much, Miklos! Yes. It is not easy to reproduce the issues in real applications. We only observed the issue in our own testing tool which runs multiple tests concurrently. We have not been able reproduce it with simple code yet. -lei On 3/6/24 07:05, Miklos Szeredi wrote: > On Wed, 6 Mar 2024 at 12:16, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: >> >> >> >> On 3/6/24 11:01, Miklos Szeredi wrote: >>> On Tue, 29 Aug 2023 at 20:37, Lei Huang <lei.huang@linux.intel.com> wrote: >>>> >>>> Our user space filesystem relies on fuse to provide POSIX interface. >>>> In our test, a known string is written into a file and the content >>>> is read back later to verify correct data returned. We observed wrong >>>> data returned in read buffer in rare cases although correct data are >>>> stored in our filesystem. >>>> >>>> Fuse kernel module calls iov_iter_get_pages2() to get the physical >>>> pages of the user-space read buffer passed in read(). The pages are >>>> not pinned to avoid page migration. When page migration occurs, the >>>> consequence are two-folds. >>>> >>>> 1) Applications do not receive correct data in read buffer. >>>> 2) fuse kernel writes data into a wrong place. >>>> >>>> Using iov_iter_extract_pages() to pin pages fixes the issue in our >>>> test. >>>> >>>> An auxiliary variable "struct page **pt_pages" is used in the patch >>>> to prepare the 2nd parameter for iov_iter_extract_pages() since >>>> iov_iter_get_pages2() uses a different type for the 2nd parameter. >>>> >>>> Signed-off-by: Lei Huang <lei.huang@linux.intel.com> >>> >>> Applied, with a modification to only unpin if >>> iov_iter_extract_will_pin() returns true. >> >> Hi Miklos, >> >> do you have an idea if this needs to be back ported and to which kernel >> version? >> I had tried to reproduce data corruption with 4.18 - Lei wrote that he >> could see issues with older kernels as well, but I never managed to >> trigger anything on 4.18-RHEL. Typically I use ql-fstest >> (https://github.com/bsbernd/ql-fstest) and even added random DIO as an >> option - nothing report with weeks of run time. I could try again with >> more recent kernels that have folios. > > I don't think that corruption will happen in real life. So I'm not > sure we need to bother with backporting, and definitely not before > when the infrastructure was introduced. > > Thanks, > Miklos
On 8/29/23 20:36, Lei Huang wrote:
> Our user space filesystem relies on fuse to provide POSIX interface.
> In our test, a known string is written into a file and the content
> is read back later to verify correct data returned. We observed wrong
> data returned in read buffer in rare cases although correct data are
> stored in our filesystem.
>
> Fuse kernel module calls iov_iter_get_pages2() to get the physical
> pages of the user-space read buffer passed in read(). The pages are
> not pinned to avoid page migration. When page migration occurs, the
> consequence are two-folds.
>
> 1) Applications do not receive correct data in read buffer.
> 2) fuse kernel writes data into a wrong place.
>
> Using iov_iter_extract_pages() to pin pages fixes the issue in our
> test.
Hmm, iov_iter_extract_pages does not exists for a long time and the code
in fuse_get_user_pages didn't change much. So if you are right, there
would be a long term data corruption for page migrations? And a back
port to old kernels would not be obvious?
What confuses me further is that
commit 85dd2c8ff368 does not mention migration or corruption, although
lists several other advantages for iov_iter_extract_pages. Other commits
using iov_iter_extract_pages point to fork - i.e. would your data
corruption be possibly related that?
Thanks,
Bernd
>
> An auxiliary variable "struct page **pt_pages" is used in the patch
> to prepare the 2nd parameter for iov_iter_extract_pages() since
> iov_iter_get_pages2() uses a different type for the 2nd parameter.
>
> Signed-off-by: Lei Huang <lei.huang@linux.intel.com>
> ---
> fs/fuse/file.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index bc41152..715de3b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -670,7 +670,7 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
> for (i = 0; i < ap->num_pages; i++) {
> if (should_dirty)
> set_page_dirty_lock(ap->pages[i]);
> - put_page(ap->pages[i]);
> + unpin_user_page(ap->pages[i]);
> }
> }
>
> @@ -1428,10 +1428,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> while (nbytes < *nbytesp && ap->num_pages < max_pages) {
> unsigned npages;
> size_t start;
> - ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
> - *nbytesp - nbytes,
> - max_pages - ap->num_pages,
> - &start);
> + struct page **pt_pages;
> +
> + pt_pages = &ap->pages[ap->num_pages];
> + ret = iov_iter_extract_pages(ii, &pt_pages,
> + *nbytesp - nbytes,
> + max_pages - ap->num_pages,
> + 0, &start);
> if (ret < 0)
> break;
>
Hi Bernd,
Thank you very much for your reply!
> Hmm, iov_iter_extract_pages does not exists for a long time and the code
> in fuse_get_user_pages didn't change much. So if you are right, there
> would be a long term data corruption for page migrations? And a back
> port to old kernels would not be obvious?
Right. The issue has been reproduced under various versions of kernels,
ranging from 3.10.0 to 6.3.6 in my tests. It would be different to make
a patch under older kernels like 3.10.0. One way I tested, one can query
the physical pages associated with read buffer after data is ready
(right before writing the data into read buffer). This seems resolving
the issue in my tests.
> What confuses me further is that
> commit 85dd2c8ff368 does not mention migration or corruption, although
> lists several other advantages for iov_iter_extract_pages. Other commits
> using iov_iter_extract_pages point to fork - i.e. would your data
> corruption be possibly related that?
As I mentioned above, the issue seems resolved if we query the physical
pages as late as right before writing data into read buffer. I think the
root cause is page migration.
Best Regards,
-lei
On 8/29/23 17:57, Bernd Schubert wrote:
>
>
> On 8/29/23 20:36, Lei Huang wrote:
>> Our user space filesystem relies on fuse to provide POSIX interface.
>> In our test, a known string is written into a file and the content
>> is read back later to verify correct data returned. We observed wrong
>> data returned in read buffer in rare cases although correct data are
>> stored in our filesystem.
>>
>> Fuse kernel module calls iov_iter_get_pages2() to get the physical
>> pages of the user-space read buffer passed in read(). The pages are
>> not pinned to avoid page migration. When page migration occurs, the
>> consequence are two-folds.
>>
>> 1) Applications do not receive correct data in read buffer.
>> 2) fuse kernel writes data into a wrong place.
>>
>> Using iov_iter_extract_pages() to pin pages fixes the issue in our
>> test.
>
> Hmm, iov_iter_extract_pages does not exists for a long time and the code
> in fuse_get_user_pages didn't change much. So if you are right, there
> would be a long term data corruption for page migrations? And a back
> port to old kernels would not be obvious?
>
> What confuses me further is that
> commit 85dd2c8ff368 does not mention migration or corruption, although
> lists several other advantages for iov_iter_extract_pages. Other commits
> using iov_iter_extract_pages point to fork - i.e. would your data
> corruption be possibly related that?
>
>
> Thanks,
> Bernd
>
>
>>
>> An auxiliary variable "struct page **pt_pages" is used in the patch
>> to prepare the 2nd parameter for iov_iter_extract_pages() since
>> iov_iter_get_pages2() uses a different type for the 2nd parameter.
>>
>> Signed-off-by: Lei Huang <lei.huang@linux.intel.com>
>> ---
>> fs/fuse/file.c | 13 ++++++++-----
>> 1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index bc41152..715de3b 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -670,7 +670,7 @@ static void fuse_release_user_pages(struct
>> fuse_args_pages *ap,
>> for (i = 0; i < ap->num_pages; i++) {
>> if (should_dirty)
>> set_page_dirty_lock(ap->pages[i]);
>> - put_page(ap->pages[i]);
>> + unpin_user_page(ap->pages[i]);
>> }
>> }
>> @@ -1428,10 +1428,13 @@ static int fuse_get_user_pages(struct
>> fuse_args_pages *ap, struct iov_iter *ii,
>> while (nbytes < *nbytesp && ap->num_pages < max_pages) {
>> unsigned npages;
>> size_t start;
>> - ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
>> - *nbytesp - nbytes,
>> - max_pages - ap->num_pages,
>> - &start);
>> + struct page **pt_pages;
>> +
>> + pt_pages = &ap->pages[ap->num_pages];
>> + ret = iov_iter_extract_pages(ii, &pt_pages,
>> + *nbytesp - nbytes,
>> + max_pages - ap->num_pages,
>> + 0, &start);
>> if (ret < 0)
>> break;
Hi Lei, On 8/30/23 03:03, Lei Huang wrote: > Hi Bernd, > > Thank you very much for your reply! > > > Hmm, iov_iter_extract_pages does not exists for a long time and the code > > in fuse_get_user_pages didn't change much. So if you are right, there > > would be a long term data corruption for page migrations? And a back > > port to old kernels would not be obvious? > > Right. The issue has been reproduced under various versions of kernels, > ranging from 3.10.0 to 6.3.6 in my tests. It would be different to make > a patch under older kernels like 3.10.0. One way I tested, one can query > the physical pages associated with read buffer after data is ready > (right before writing the data into read buffer). This seems resolving > the issue in my tests. > > > > What confuses me further is that > > commit 85dd2c8ff368 does not mention migration or corruption, although > > lists several other advantages for iov_iter_extract_pages. Other commits > > using iov_iter_extract_pages point to fork - i.e. would your data > > corruption be possibly related that? > > As I mentioned above, the issue seems resolved if we query the physical > pages as late as right before writing data into read buffer. I think the > root cause is page migration. > out of interest, what is your exact reproducer and how much time does i take? I'm just trying passthrough_hp(*) and ql-fstest (**) and don't get and issue after about 1h run time. I let it continue over the weekend. The system is an older dual socket xeon. (*) with slight modification for passthrough_hp to disable O_DIRECT for the underlying file system. It is running on xfs on an nvme. (**) https://github.com/bsbernd/ql-fstest Pinning the pages is certainly a good idea, I would just like to understand how severe the issue is. And would like to test backports/different patch on older kernels. Thanks, Bernd
Hi Bernd, Actually, we do not have a simple code to reproduce this issue yet. We observed the issues in a suite of tests which shedules multiple (6~8) tests concurrently. I could observe about 5~10 cases with wrong data in one day. I tried repeating a single test for a couple of days and did not observe the issue. I think concurrently running multiple processes which can make bad memory fragmentation could help to observe the issue. Please let me know if you need more information. Thank you very much for looking into this issue. Have a great weekend! Best Regards, -lei On 9/1/23 14:05, Bernd Schubert wrote: > Hi Lei, > > On 8/30/23 03:03, Lei Huang wrote: >> Hi Bernd, >> >> Thank you very much for your reply! >> >> > Hmm, iov_iter_extract_pages does not exists for a long time and the >> code >> > in fuse_get_user_pages didn't change much. So if you are right, there >> > would be a long term data corruption for page migrations? And a back >> > port to old kernels would not be obvious? >> >> Right. The issue has been reproduced under various versions of >> kernels, ranging from 3.10.0 to 6.3.6 in my tests. It would be >> different to make a patch under older kernels like 3.10.0. One way I >> tested, one can query >> the physical pages associated with read buffer after data is ready >> (right before writing the data into read buffer). This seems resolving >> the issue in my tests. >> >> >> > What confuses me further is that >> > commit 85dd2c8ff368 does not mention migration or corruption, although >> > lists several other advantages for iov_iter_extract_pages. Other >> commits >> > using iov_iter_extract_pages point to fork - i.e. would your data >> > corruption be possibly related that? >> >> As I mentioned above, the issue seems resolved if we query the >> physical pages as late as right before writing data into read buffer. >> I think the root cause is page migration. >> > > out of interest, what is your exact reproducer and how much time does i > take? I'm just trying passthrough_hp(*) and ql-fstest (**) and don't get > and issue after about 1h run time. I let it continue over the weekend. > The system is an older dual socket xeon. > > (*) with slight modification for passthrough_hp to disable O_DIRECT for > the underlying file system. It is running on xfs on an nvme. > > (**) https://github.com/bsbernd/ql-fstest > > > Pinning the pages is certainly a good idea, I would just like to > understand how severe the issue is. And would like to test > backports/different patch on older kernels. > > > Thanks, > Bernd > >
Hi Bernd, Actually, we do not have a simple code to reproduce this issue yet. We observed the issues in a suite of tests. There are up to 8 tests scheduled to run concurrently with many small tests. I could observe about 5~10 cases with wrong data in one day. I tried repeating a single test for very long time and did not observe the issue. I guess it would help to run simultaneously multiple processes causing bad memory fragmentation. Please let me know if you need more information. Thank you very much for looking into this issue. Have a great weekend! Best Regards, -lei On 9/1/23 14:05, Bernd Schubert wrote: > Hi Lei, > > On 8/30/23 03:03, Lei Huang wrote: >> Hi Bernd, >> >> Thank you very much for your reply! >> >> > Hmm, iov_iter_extract_pages does not exists for a long time and the >> code >> > in fuse_get_user_pages didn't change much. So if you are right, there >> > would be a long term data corruption for page migrations? And a back >> > port to old kernels would not be obvious? >> >> Right. The issue has been reproduced under various versions of >> kernels, ranging from 3.10.0 to 6.3.6 in my tests. It would be >> different to make a patch under older kernels like 3.10.0. One way I >> tested, one can query >> the physical pages associated with read buffer after data is ready >> (right before writing the data into read buffer). This seems resolving >> the issue in my tests. >> >> >> > What confuses me further is that >> > commit 85dd2c8ff368 does not mention migration or corruption, although >> > lists several other advantages for iov_iter_extract_pages. Other >> commits >> > using iov_iter_extract_pages point to fork - i.e. would your data >> > corruption be possibly related that? >> >> As I mentioned above, the issue seems resolved if we query the >> physical pages as late as right before writing data into read buffer. >> I think the root cause is page migration. >> > > out of interest, what is your exact reproducer and how much time does i > take? I'm just trying passthrough_hp(*) and ql-fstest (**) and don't get > and issue after about 1h run time. I let it continue over the weekend. > The system is an older dual socket xeon. > > (*) with slight modification for passthrough_hp to disable O_DIRECT for > the underlying file system. It is running on xfs on an nvme. > > (**) https://github.com/bsbernd/ql-fstest > > > Pinning the pages is certainly a good idea, I would just like to > understand how severe the issue is. And would like to test > backports/different patch on older kernels. > > > Thanks, > Bernd > >
Hi Lei, On 8/30/23 03:03, Lei Huang wrote: > Hi Bernd, > > Thank you very much for your reply! > > > Hmm, iov_iter_extract_pages does not exists for a long time and the code > > in fuse_get_user_pages didn't change much. So if you are right, there > > would be a long term data corruption for page migrations? And a back > > port to old kernels would not be obvious? > > Right. The issue has been reproduced under various versions of kernels, > ranging from 3.10.0 to 6.3.6 in my tests. It would be different to make > a patch under older kernels like 3.10.0. One way I tested, one can query > the physical pages associated with read buffer after data is ready > (right before writing the data into read buffer). This seems resolving > the issue in my tests. > > > > What confuses me further is that > > commit 85dd2c8ff368 does not mention migration or corruption, although > > lists several other advantages for iov_iter_extract_pages. Other commits > > using iov_iter_extract_pages point to fork - i.e. would your data > > corruption be possibly related that? > > As I mentioned above, the issue seems resolved if we query the physical > pages as late as right before writing data into read buffer. I think the > root cause is page migration. > out of interest, what is your exact reproducer and how much time does i take? I'm just trying passthrough_hp(*) and ql-fstest (**) and don't get and issue after about 1h run time. I let it continue over the weekend. The system is an older dual socket xeon. (*) with slight modification for passthrough_hp to disable O_DIRECT for the underlying file system. It is running on xfs on an nvme. (**) https://github.com/bsbernd/ql-fstest Pinning the pages is certainly a good idea, I would just like to understand how severe the issue is. And would like to test backports/different patch on older kernels. Thanks, Bernd
© 2016 - 2025 Red Hat, Inc.