[PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

Dharmendra Singh posted 3 patches 4 years ago
There is a newer version of this series
fs/fuse/dir.c             | 211 +++++++++++++++++++++++++++++++++++---
fs/fuse/file.c            |  30 +++++-
fs/fuse/fuse_i.h          |  16 ++-
fs/fuse/inode.c           |   4 +-
fs/fuse/ioctl.c           |   2 +-
include/uapi/linux/fuse.h |   5 +
6 files changed, 246 insertions(+), 22 deletions(-)
[PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Dharmendra Singh 4 years ago
In FUSE, as of now, uncached lookups are expensive over the wire.
E.g additional latencies and stressing (meta data) servers from
thousands of clients. These lookup calls possibly can be avoided
in some cases. Incoming three patches address this issue.


Fist patch handles the case where we are creating a file with O_CREAT.
Before we go for file creation, we do a lookup on the file which is most
likely non-existent. After this lookup is done, we again go into libfuse
to create file. Such lookups where file is most likely non-existent, can
be avoided.

Second patch handles the case where we open first time a file/dir
but do a lookup first on it. After lookup is performed we make another
call into libfuse to open the file. Now these two separate calls into
libfuse can be combined and performed as a single call into libfuse.

Third patch handles the case when we are opening an already existing file
(positive dentry). Before this open call, we re-validate the inode and
this re-validation does a lookup on the file and verify the inode.
This separate lookup also can be avoided (for non-dir) and combined
with open call into libfuse. After open returns we can revalidate the inode.
This optimisation is performed only when we do not have default permissions
enabled.

Here is the link to performance numbers
https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/


Dharmendra Singh (3):
  FUSE: Implement atomic lookup + create
  Implement atomic lookup + open
  Avoid lookup in d_revalidate()

 fs/fuse/dir.c             | 211 +++++++++++++++++++++++++++++++++++---
 fs/fuse/file.c            |  30 +++++-
 fs/fuse/fuse_i.h          |  16 ++-
 fs/fuse/inode.c           |   4 +-
 fs/fuse/ioctl.c           |   2 +-
 include/uapi/linux/fuse.h |   5 +
 6 files changed, 246 insertions(+), 22 deletions(-)

---
v4: Addressed all comments and refactored the code into 3 separate patches
    respectively for Atomic create, Atomic open, optimizing lookup in
    d_revalidate().
---
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Vivek Goyal 4 years ago
On Mon, May 02, 2022 at 03:55:18PM +0530, Dharmendra Singh wrote:
> In FUSE, as of now, uncached lookups are expensive over the wire.
> E.g additional latencies and stressing (meta data) servers from
> thousands of clients. These lookup calls possibly can be avoided
> in some cases. Incoming three patches address this issue.

BTW, these patches are designed to improve performance by cutting down
on number of fuse commands sent. Are there any performance numbers
which demonstrate what kind of improvement you are seeing.

Say, If I do kernel build, is the performance improvement observable?

Thanks
Vivek

> 
> 
> Fist patch handles the case where we are creating a file with O_CREAT.
> Before we go for file creation, we do a lookup on the file which is most
> likely non-existent. After this lookup is done, we again go into libfuse
> to create file. Such lookups where file is most likely non-existent, can
> be avoided.
> 
> Second patch handles the case where we open first time a file/dir
> but do a lookup first on it. After lookup is performed we make another
> call into libfuse to open the file. Now these two separate calls into
> libfuse can be combined and performed as a single call into libfuse.
> 
> Third patch handles the case when we are opening an already existing file
> (positive dentry). Before this open call, we re-validate the inode and
> this re-validation does a lookup on the file and verify the inode.
> This separate lookup also can be avoided (for non-dir) and combined
> with open call into libfuse. After open returns we can revalidate the inode.
> This optimisation is performed only when we do not have default permissions
> enabled.
> 
> Here is the link to performance numbers
> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
> 
> 
> Dharmendra Singh (3):
>   FUSE: Implement atomic lookup + create
>   Implement atomic lookup + open
>   Avoid lookup in d_revalidate()
> 
>  fs/fuse/dir.c             | 211 +++++++++++++++++++++++++++++++++++---
>  fs/fuse/file.c            |  30 +++++-
>  fs/fuse/fuse_i.h          |  16 ++-
>  fs/fuse/inode.c           |   4 +-
>  fs/fuse/ioctl.c           |   2 +-
>  include/uapi/linux/fuse.h |   5 +
>  6 files changed, 246 insertions(+), 22 deletions(-)
> 
> ---
> v4: Addressed all comments and refactored the code into 3 separate patches
>     respectively for Atomic create, Atomic open, optimizing lookup in
>     d_revalidate().
> ---
>
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Dharmendra Hans 4 years ago
On Thu, May 5, 2022 at 12:48 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, May 02, 2022 at 03:55:18PM +0530, Dharmendra Singh wrote:
> > In FUSE, as of now, uncached lookups are expensive over the wire.
> > E.g additional latencies and stressing (meta data) servers from
> > thousands of clients. These lookup calls possibly can be avoided
> > in some cases. Incoming three patches address this issue.
>
> BTW, these patches are designed to improve performance by cutting down
> on number of fuse commands sent. Are there any performance numbers
> which demonstrate what kind of improvement you are seeing.
>
> Say, If I do kernel build, is the performance improvement observable?

Here are the numbers I took last time. These were taken on tmpfs to
actually see the effect of reduced calls. On local file systems it
might not be that much visible. But we have observed that on systems
where we have thousands of clients hammering the metadata servers, it
helps a lot (We did not take numbers yet as  we are required to change
a lot of our client code but would be doing it later on).

Note that for a change in performance number due to the new version of
these patches, we have just refactored the code and functionality has
remained the same since then.

here is the link to the performance numbers
https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Vivek Goyal 4 years ago
On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
> On Thu, May 5, 2022 at 12:48 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, May 02, 2022 at 03:55:18PM +0530, Dharmendra Singh wrote:
> > > In FUSE, as of now, uncached lookups are expensive over the wire.
> > > E.g additional latencies and stressing (meta data) servers from
> > > thousands of clients. These lookup calls possibly can be avoided
> > > in some cases. Incoming three patches address this issue.
> >
> > BTW, these patches are designed to improve performance by cutting down
> > on number of fuse commands sent. Are there any performance numbers
> > which demonstrate what kind of improvement you are seeing.
> >
> > Say, If I do kernel build, is the performance improvement observable?
> 
> Here are the numbers I took last time. These were taken on tmpfs to
> actually see the effect of reduced calls. On local file systems it
> might not be that much visible. But we have observed that on systems
> where we have thousands of clients hammering the metadata servers, it
> helps a lot (We did not take numbers yet as  we are required to change
> a lot of our client code but would be doing it later on).
> 
> Note that for a change in performance number due to the new version of
> these patches, we have just refactored the code and functionality has
> remained the same since then.
> 
> here is the link to the performance numbers
> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/

There is a lot going in that table. Trying to understand it. 

- Why care about No-Flush. I mean that's independent of these changes,
  right?  I am assuming this means that upon file close do not send
  a flush to fuse server. Not sure how bringing No-Flush into the
  mix is helpful here.

- What is "Patched Libfuse"? I am assuming that these are changes
  needed in libfuse to support atomic create + atomic open. Similarly
  assuming "Patched FuseK" means patched kernel with your changes.

  If this is correct, I would probably only be interested in 
  looking at "Patched Libfuse + Patched FuseK" numbers to figure out
  what's the effect of your changes w.r.t vanilla kernel + libfuse.
  Am I understanding it right?

- I am wondering why do we measure "Sequential" and "Random" patterns. 
  These optimizations are primarily for file creation + file opening
  and I/O pattern should not matter. 

- Also wondering why performance of Read/s improves. Assuming once
  file has been opened, I think your optimizations get out of the
  way (no create, no open) and we are just going through data path of
  reading file data and no lookups happening. If that's the case, why
  do Read/s numbers show an improvement.

- Why do we measure "Patched Libfuse". It shows performance regression
  of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
  any optimization kicking in, it has a performance cost.

Thanks
Vivek
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Bernd Schubert 4 years ago

On 5/5/22 14:54, Vivek Goyal wrote:
> On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
>> Here are the numbers I took last time. These were taken on tmpfs to
>> actually see the effect of reduced calls. On local file systems it
>> might not be that much visible. But we have observed that on systems
>> where we have thousands of clients hammering the metadata servers, it
>> helps a lot (We did not take numbers yet as  we are required to change
>> a lot of our client code but would be doing it later on).
>>
>> Note that for a change in performance number due to the new version of
>> these patches, we have just refactored the code and functionality has
>> remained the same since then.
>>
>> here is the link to the performance numbers
>> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
> 
> There is a lot going in that table. Trying to understand it.
> 
> - Why care about No-Flush. I mean that's independent of these changes,
>    right?  I am assuming this means that upon file close do not send
>    a flush to fuse server. Not sure how bringing No-Flush into the
>    mix is helpful here.


It is a basically removing another call from kernel to user space. The 
calls there are, the lower is the resulting percentage for atomic-open.


> 
> - What is "Patched Libfuse"? I am assuming that these are changes
>    needed in libfuse to support atomic create + atomic open. Similarly
>    assuming "Patched FuseK" means patched kernel with your changes.

Yes, I did that to ensure there is no regression with the patches, when 
the other side is not patched.

> 
>    If this is correct, I would probably only be interested in
>    looking at "Patched Libfuse + Patched FuseK" numbers to figure out
>    what's the effect of your changes w.r.t vanilla kernel + libfuse.
>    Am I understanding it right?

Yes.

> 
> - I am wondering why do we measure "Sequential" and "Random" patterns.
>    These optimizations are primarily for file creation + file opening
>    and I/O pattern should not matter.

bonnie++ does this automatically and it just convenient to take the 
bonnie++ csv value and to paste them into a table.

In our HPC world mdtest is more common, but it has MPI as requirement - 
make it harder to run. Reproducing the values with bonnie++ should be 
rather easy for you.

Only issue with bonnie++ is that bonnie++ by default does not run 
multi-threaded and the old 3rd party perl scripts I had to let it run 
with multiple processes and to sum up the values don't work anymore with 
recent perl versions. I need to find some time to fix that.


> 
> - Also wondering why performance of Read/s improves. Assuming once
>    file has been opened, I think your optimizations get out of the
>    way (no create, no open) and we are just going through data path of
>    reading file data and no lookups happening. If that's the case, why
>    do Read/s numbers show an improvement.

That is now bonnie++ works. It creates the files, closes them (which 
causes the flush) and then re-opens for stat and read - atomic open 
comes into the picture here. Also read() is totally skipped when the 
files are empty - which is why one should use something like 1B files.

If you have another metadata benchmark - please let us know.

> 
> - Why do we measure "Patched Libfuse". It shows performance regression
>    of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
>    any optimization kicking in, it has a performance cost.

Yes, I'm not sure yet. There is not so much code that has changed on the 
libfuse side.
However the table needs to be redone with fixed libfuse - limiting the 
number of threads caused a permanent libfuse thread creation and destruction

https://github.com/libfuse/libfuse/pull/652

The numbers in table are also with paththrough_ll, which has its own 
issue due to linear inode search. paththrough_hp uses a C++ map and 
avoids that. I noticed too late when I started to investigate why there 
are regressions....

Also the table made me to investigate/profile all the fuse operations, 
which resulted in my waitq question. Please see that thread for more 
details 
https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/

Regarding atomic-open/create with avoiding lookup/revalidate, our 
primary goal is to reduce network calls. A file system that handles it 
locally only reduces the number of fuse kernel/user space crossing. A 
network file system that fully supports it needs to do the atomic open 
(or in old terms lookup-intent-open) on the server side of the network 
and needs to transfer attributes together with the open result.

Lustre does this, although I cannot easily point you to the right code. 
It all started almost two decades ago:
https://groups.google.com/g/lucky.linux.fsdevel/c/iYNFIIrkJ1s


BeeGFS does this as well
https://git.beegfs.io/pub/v7/-/blob/master/client_module/source/filesystem/FhgfsOpsInode.c
See for examaple FhgfsOps_atomicOpen() and FhgfsOps_createIntent()

(FhGFS is the old name when I was still involved in the project.)

 From my head I'm not sure if NFS does it over the wire, maybe v4.


Thanks,
Bernd
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Vivek Goyal 4 years ago
On Thu, May 05, 2022 at 05:13:00PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/5/22 14:54, Vivek Goyal wrote:
> > On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
> > > Here are the numbers I took last time. These were taken on tmpfs to
> > > actually see the effect of reduced calls. On local file systems it
> > > might not be that much visible. But we have observed that on systems
> > > where we have thousands of clients hammering the metadata servers, it
> > > helps a lot (We did not take numbers yet as  we are required to change
> > > a lot of our client code but would be doing it later on).
> > > 
> > > Note that for a change in performance number due to the new version of
> > > these patches, we have just refactored the code and functionality has
> > > remained the same since then.
> > > 
> > > here is the link to the performance numbers
> > > https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
> > 
> > There is a lot going in that table. Trying to understand it.
> > 
> > - Why care about No-Flush. I mean that's independent of these changes,
> >    right?  I am assuming this means that upon file close do not send
> >    a flush to fuse server. Not sure how bringing No-Flush into the
> >    mix is helpful here.
> 
> 
> It is a basically removing another call from kernel to user space. The calls
> there are, the lower is the resulting percentage for atomic-open.

Ok. You want to get rid of FUSE_FLUSH call so that % of FUSE_LOOKUP calls
go up and that will effectively show higher % of improvement due to
this. 

> 
> 
> > 
> > - What is "Patched Libfuse"? I am assuming that these are changes
> >    needed in libfuse to support atomic create + atomic open. Similarly
> >    assuming "Patched FuseK" means patched kernel with your changes.
> 
> Yes, I did that to ensure there is no regression with the patches, when the
> other side is not patched.
> 
> > 
> >    If this is correct, I would probably only be interested in
> >    looking at "Patched Libfuse + Patched FuseK" numbers to figure out
> >    what's the effect of your changes w.r.t vanilla kernel + libfuse.
> >    Am I understanding it right?
> 
> Yes.
> 
> > 
> > - I am wondering why do we measure "Sequential" and "Random" patterns.
> >    These optimizations are primarily for file creation + file opening
> >    and I/O pattern should not matter.
> 
> bonnie++ does this automatically and it just convenient to take the bonnie++
> csv value and to paste them into a table.
> 
> In our HPC world mdtest is more common, but it has MPI as requirement - make
> it harder to run. Reproducing the values with bonnie++ should be rather easy
> for you.
> 
> Only issue with bonnie++ is that bonnie++ by default does not run
> multi-threaded and the old 3rd party perl scripts I had to let it run with
> multiple processes and to sum up the values don't work anymore with recent
> perl versions. I need to find some time to fix that.
> 
> 
> > 
> > - Also wondering why performance of Read/s improves. Assuming once
> >    file has been opened, I think your optimizations get out of the
> >    way (no create, no open) and we are just going through data path of
> >    reading file data and no lookups happening. If that's the case, why
> >    do Read/s numbers show an improvement.
> 
> That is now bonnie++ works. It creates the files, closes them (which causes
> the flush) and then re-opens for stat and read - atomic open comes into the
> picture here. Also read() is totally skipped when the files are empty -
> which is why one should use something like 1B files.
> 
> If you have another metadata benchmark - please let us know.

Once I was pointed at smallfile.

https://github.com/distributed-system-analysis/smallfile

I ran this when I was posting the patches for virtiofs.

https://patchwork.kernel.org/project/linux-fsdevel/cover/20181210171318.16998-1-vgoyal@redhat.com/

See if this is something interesting to you.


> 
> > 
> > - Why do we measure "Patched Libfuse". It shows performance regression
> >    of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
> >    any optimization kicking in, it has a performance cost.
> 
> Yes, I'm not sure yet. There is not so much code that has changed on the
> libfuse side.
> However the table needs to be redone with fixed libfuse - limiting the
> number of threads caused a permanent libfuse thread creation and destruction
> 
> https://github.com/libfuse/libfuse/pull/652
> 
> The numbers in table are also with paththrough_ll, which has its own issue
> due to linear inode search. paththrough_hp uses a C++ map and avoids that. I
> noticed too late when I started to investigate why there are regressions....
> 
> Also the table made me to investigate/profile all the fuse operations, which
> resulted in my waitq question. Please see that thread for more details https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
> 
> Regarding atomic-open/create with avoiding lookup/revalidate, our primary
> goal is to reduce network calls. A file system that handles it locally only
> reduces the number of fuse kernel/user space crossing. A network file system
> that fully supports it needs to do the atomic open (or in old terms
> lookup-intent-open) on the server side of the network and needs to transfer
> attributes together with the open result.

Oh, I have no issues with the intent. I will like to see cut in network
traffic too (if we can do this without introducing problems). My primary
interest is that this kind of change should benefit virtiofs as well.

I am just trying to understand how much performance improvement is
actually there. And also trying to improve the quality of implementation.
Frankly speaking, it all seems very twisted and hard to read (and
hence maintain) code at this point fo time.

That's why I am going into the details to understand and suggest some
improvements.

Thanks
Vivek

> 
> Lustre does this, although I cannot easily point you to the right code. It
> all started almost two decades ago:
> https://groups.google.com/g/lucky.linux.fsdevel/c/iYNFIIrkJ1s
> 
> 
> BeeGFS does this as well
> https://git.beegfs.io/pub/v7/-/blob/master/client_module/source/filesystem/FhgfsOpsInode.c
> See for examaple FhgfsOps_atomicOpen() and FhgfsOps_createIntent()
> 
> (FhGFS is the old name when I was still involved in the project.)
> 
> From my head I'm not sure if NFS does it over the wire, maybe v4.
> 
> 
> Thanks,
> Bernd
> 
> 
> 
> 
> 
>
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Miklos Szeredi 3 years, 12 months ago
On Thu, 5 May 2022 at 21:59, Vivek Goyal <vgoyal@redhat.com> wrote:

> Oh, I have no issues with the intent. I will like to see cut in network
> traffic too (if we can do this without introducing problems). My primary
> interest is that this kind of change should benefit virtiofs as well.

One issue with that appears to be checking permissions.   AFAIU this
patchset only enables the optimization if default_permissions is
turned off (i.e. all permission checking is done by the server).  But
virtiofs uses the default_permissions model.

I'm not quite sure about this limitation, guessing that it's related
to the fact that the permissions may be stale at the time of checking?

Thanks,
Miklos
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Vivek Goyal 3 years, 12 months ago
On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> On Thu, 5 May 2022 at 21:59, Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > Oh, I have no issues with the intent. I will like to see cut in network
> > traffic too (if we can do this without introducing problems). My primary
> > interest is that this kind of change should benefit virtiofs as well.
> 
> One issue with that appears to be checking permissions.   AFAIU this
> patchset only enables the optimization if default_permissions is
> turned off (i.e. all permission checking is done by the server).  But
> virtiofs uses the default_permissions model.

IIUC, only 3rd patch mentions that default_permission should be turned
off. IOW, first patch where lookup + create + open is a single operation
and second patch which does "lookup + open" in a single operation does
not seem to require that default_permissions are not in effect.

So if first two patches work fine, I think virtiofs should benefit too.
(IMHO, 3rd patch is too hacky anyway)

W.r.t permission checks, looks like may_open() will finally be called
after ->atomic_open(). So even if we open the file, we should still be
able to check whether we have permissions to open the file or not
after the fact.

fs/namei.c

path_openat()
{
	open_last_lookups()  <--- This calls ->atomic_open()
	do_open()  <--- This calls may_open()
}

Thanks
Vivek

> 
> I'm not quite sure about this limitation, guessing that it's related
> to the fact that the permissions may be stale at the time of checking?
> 
> Thanks,
> Miklos
>
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Vivek Goyal 3 years, 12 months ago
On Wed, May 11, 2022 at 01:21:13PM -0400, Vivek Goyal wrote:
> On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> > On Thu, 5 May 2022 at 21:59, Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > Oh, I have no issues with the intent. I will like to see cut in network
> > > traffic too (if we can do this without introducing problems). My primary
> > > interest is that this kind of change should benefit virtiofs as well.
> > 
> > One issue with that appears to be checking permissions.   AFAIU this
> > patchset only enables the optimization if default_permissions is
> > turned off (i.e. all permission checking is done by the server).  But
> > virtiofs uses the default_permissions model.
> 
> IIUC, only 3rd patch mentions that default_permission should be turned
> off. IOW, first patch where lookup + create + open is a single operation
> and second patch which does "lookup + open" in a single operation does
> not seem to require that default_permissions are not in effect.
> 
> So if first two patches work fine, I think virtiofs should benefit too.
> (IMHO, 3rd patch is too hacky anyway)
> 
> W.r.t permission checks, looks like may_open() will finally be called
> after ->atomic_open(). So even if we open the file, we should still be
> able to check whether we have permissions to open the file or not
> after the fact.
> 
> fs/namei.c
> 
> path_openat()
> {
> 	open_last_lookups()  <--- This calls ->atomic_open()
> 	do_open()  <--- This calls may_open()
> }

Actually I am not sure about it. I was playing with 

open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)

This succeeds if file is newly created but if file already existed, this
fails with -EACCESS

So man 2 open says following. Thanks to Andy Price for pointing me to it.

    Note that mode applies only to future accesses of the newly cre‐
    ated  file;  the  open()  call that creates a read-only file may
    well return a read/write file descriptor.


Now I am wondering how will it look like with first patch. Assume file
already exists on the server (But there is no negative dentry present)
and I do following. And assume file only has read permission for user
and I am trying to open it read-write.

open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)

In normal circumstances, user will expect -EACCESS as file is read-only
and user is trying to open it read-write.

I am wondering how will it look like with this first patch.

Current fuse ->atomic_open() looks up the dentry and does not open
the file if dentry is positive.

New implementation will skip lookup and open the file anyway and
set file->f_mode |= FMODE_CREATED; (First patch in series)

So first of all this seems wrong. I thought FMODE_CREATED should be
set only if file was newly created. Is that a correct understanding.

And I am looking at do_open() code. It does bunch of things based
on FMODE_CREATED flag. One of the things it does is reset acc_mode =0

        if (file->f_mode & FMODE_CREATED) {
                /* Don't check for write permission, don't truncate */
                open_flag &= ~O_TRUNC;
                acc_mode = 0;
	}
	error = may_open(mnt_userns, &nd->path, acc_mode, open_flag);

I suspect this is the code which allows opening a newly created read-only
file as O_RDWR. (Though I am not 100% sure).

I suspect with first patch this will be broken. We will set FMODE_CREATED
even if file already existed and VFS will assume a new file has been
created and do bunch of things which is wrong.

So looks like fuse ->atomic_open() should set FMODE_CREATED only if
it really created the file.

Thanks
Vivek

Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Dharmendra Hans 3 years, 12 months ago
On Thu, May 12, 2022 at 1:00 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, May 11, 2022 at 01:21:13PM -0400, Vivek Goyal wrote:
> > On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> > > On Thu, 5 May 2022 at 21:59, Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > > Oh, I have no issues with the intent. I will like to see cut in network
> > > > traffic too (if we can do this without introducing problems). My primary
> > > > interest is that this kind of change should benefit virtiofs as well.
> > >
> > > One issue with that appears to be checking permissions.   AFAIU this
> > > patchset only enables the optimization if default_permissions is
> > > turned off (i.e. all permission checking is done by the server).  But
> > > virtiofs uses the default_permissions model.
> >
> > IIUC, only 3rd patch mentions that default_permission should be turned
> > off. IOW, first patch where lookup + create + open is a single operation
> > and second patch which does "lookup + open" in a single operation does
> > not seem to require that default_permissions are not in effect.
> >
> > So if first two patches work fine, I think virtiofs should benefit too.
> > (IMHO, 3rd patch is too hacky anyway)
> >
> > W.r.t permission checks, looks like may_open() will finally be called
> > after ->atomic_open(). So even if we open the file, we should still be
> > able to check whether we have permissions to open the file or not
> > after the fact.
> >
> > fs/namei.c
> >
> > path_openat()
> > {
> >       open_last_lookups()  <--- This calls ->atomic_open()
> >       do_open()  <--- This calls may_open()
> > }
>
> Actually I am not sure about it. I was playing with
>
> open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
>
> This succeeds if file is newly created but if file already existed, this
> fails with -EACCESS
>
> So man 2 open says following. Thanks to Andy Price for pointing me to it.
>
>     Note that mode applies only to future accesses of the newly cre‐
>     ated  file;  the  open()  call that creates a read-only file may
>     well return a read/write file descriptor.
>
>
> Now I am wondering how will it look like with first patch. Assume file
> already exists on the server (But there is no negative dentry present)
> and I do following. And assume file only has read permission for user
> and I am trying to open it read-write.
>
> open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
>
> In normal circumstances, user will expect -EACCESS as file is read-only
> and user is trying to open it read-write.
>
> I am wondering how will it look like with this first patch.
>
> Current fuse ->atomic_open() looks up the dentry and does not open
> the file if dentry is positive.
>
> New implementation will skip lookup and open the file anyway and
> set file->f_mode |= FMODE_CREATED; (First patch in series)
>
> So first of all this seems wrong. I thought FMODE_CREATED should be
> set only if file was newly created. Is that a correct understanding.

You are right. we should mark in f_mode only if the file was actually created.
Thanks for catching this. Appreciate your efforts:)

>
> And I am looking at do_open() code. It does bunch of things based
> on FMODE_CREATED flag. One of the things it does is reset acc_mode =0
>
>         if (file->f_mode & FMODE_CREATED) {
>                 /* Don't check for write permission, don't truncate */
>                 open_flag &= ~O_TRUNC;
>                 acc_mode = 0;
>         }
>         error = may_open(mnt_userns, &nd->path, acc_mode, open_flag);
>
> I suspect this is the code which allows opening a newly created read-only
> file as O_RDWR. (Though I am not 100% sure).

Correct. I could see it.

>
> I suspect with first patch this will be broken. We will set FMODE_CREATED
> even if file already existed and VFS will assume a new file has been
> created and do bunch of things which is wrong.
>
> So looks like fuse ->atomic_open() should set FMODE_CREATED only if
> it really created the file.

This looks like an obvious bug with new patches. But If I do not miss
anything then its a bug on distributed file systems with current code
as well.
It could happen this way(Taking your example which is perfect to trace
this on distributed systems):
Lets start with File is non-existent yet on the file system. There
comes the first client which does a lookup on the file, it does not
find the file. Meanwhile another client created the file on the File
system. Now  when this first client goes to create the file, before
going down it sets FMODE_CREATED on the file handle and then goes down
the lower layers. It comes back with inode but f->mode as
FMODE_CREATED which is incorrect. This mode then results in acc_mode
being set to zero which then allows access to the file as O_RDWR.

I think Miklos proposed to return the flag from user space if the file
was actually created, this would solve two problems 1) this access
problem and code execution going the wrong path 2) correct update if
parent dir changed or not.
Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Vivek Goyal 3 years, 12 months ago
On Thu, May 12, 2022 at 01:46:12PM +0530, Dharmendra Hans wrote:
> On Thu, May 12, 2022 at 1:00 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, May 11, 2022 at 01:21:13PM -0400, Vivek Goyal wrote:
> > > On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> > > > On Thu, 5 May 2022 at 21:59, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > > Oh, I have no issues with the intent. I will like to see cut in network
> > > > > traffic too (if we can do this without introducing problems). My primary
> > > > > interest is that this kind of change should benefit virtiofs as well.
> > > >
> > > > One issue with that appears to be checking permissions.   AFAIU this
> > > > patchset only enables the optimization if default_permissions is
> > > > turned off (i.e. all permission checking is done by the server).  But
> > > > virtiofs uses the default_permissions model.
> > >
> > > IIUC, only 3rd patch mentions that default_permission should be turned
> > > off. IOW, first patch where lookup + create + open is a single operation
> > > and second patch which does "lookup + open" in a single operation does
> > > not seem to require that default_permissions are not in effect.
> > >
> > > So if first two patches work fine, I think virtiofs should benefit too.
> > > (IMHO, 3rd patch is too hacky anyway)
> > >
> > > W.r.t permission checks, looks like may_open() will finally be called
> > > after ->atomic_open(). So even if we open the file, we should still be
> > > able to check whether we have permissions to open the file or not
> > > after the fact.
> > >
> > > fs/namei.c
> > >
> > > path_openat()
> > > {
> > >       open_last_lookups()  <--- This calls ->atomic_open()
> > >       do_open()  <--- This calls may_open()
> > > }
> >
> > Actually I am not sure about it. I was playing with
> >
> > open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
> >
> > This succeeds if file is newly created but if file already existed, this
> > fails with -EACCESS
> >
> > So man 2 open says following. Thanks to Andy Price for pointing me to it.
> >
> >     Note that mode applies only to future accesses of the newly cre‐
> >     ated  file;  the  open()  call that creates a read-only file may
> >     well return a read/write file descriptor.
> >
> >
> > Now I am wondering how will it look like with first patch. Assume file
> > already exists on the server (But there is no negative dentry present)
> > and I do following. And assume file only has read permission for user
> > and I am trying to open it read-write.
> >
> > open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
> >
> > In normal circumstances, user will expect -EACCESS as file is read-only
> > and user is trying to open it read-write.
> >
> > I am wondering how will it look like with this first patch.
> >
> > Current fuse ->atomic_open() looks up the dentry and does not open
> > the file if dentry is positive.
> >
> > New implementation will skip lookup and open the file anyway and
> > set file->f_mode |= FMODE_CREATED; (First patch in series)
> >
> > So first of all this seems wrong. I thought FMODE_CREATED should be
> > set only if file was newly created. Is that a correct understanding.
> 
> You are right. we should mark in f_mode only if the file was actually created.
> Thanks for catching this. Appreciate your efforts:)
> 
> >
> > And I am looking at do_open() code. It does bunch of things based
> > on FMODE_CREATED flag. One of the things it does is reset acc_mode =0
> >
> >         if (file->f_mode & FMODE_CREATED) {
> >                 /* Don't check for write permission, don't truncate */
> >                 open_flag &= ~O_TRUNC;
> >                 acc_mode = 0;
> >         }
> >         error = may_open(mnt_userns, &nd->path, acc_mode, open_flag);
> >
> > I suspect this is the code which allows opening a newly created read-only
> > file as O_RDWR. (Though I am not 100% sure).
> 
> Correct. I could see it.
> 
> >
> > I suspect with first patch this will be broken. We will set FMODE_CREATED
> > even if file already existed and VFS will assume a new file has been
> > created and do bunch of things which is wrong.
> >
> > So looks like fuse ->atomic_open() should set FMODE_CREATED only if
> > it really created the file.
> 
> This looks like an obvious bug with new patches. But If I do not miss
> anything then its a bug on distributed file systems with current code
> as well.
> It could happen this way(Taking your example which is perfect to trace
> this on distributed systems):
> Lets start with File is non-existent yet on the file system. There
> comes the first client which does a lookup on the file, it does not
> find the file. Meanwhile another client created the file on the File
> system. Now  when this first client goes to create the file, before
> going down it sets FMODE_CREATED on the file handle and then goes down
> the lower layers. It comes back with inode but f->mode as
> FMODE_CREATED which is incorrect. This mode then results in acc_mode
> being set to zero which then allows access to the file as O_RDWR.

I think with current code (FUSE_CREATE), it is a race with shared
filesystems where multiple clients might be sharing this directory.

We are looking up dentry first and issue FUSE_CREATE only if dentry
is negative. Now it is possible that between lookup() + FUSE_CREATE,
another client dropped in same file in the directory. But one could
argue, that's something not detectable by this client. So from user's
perspective, this client created the file.

If we have a negative dentry, then we will not do the lookup and
assumption is that we created file (even if another client created
it between previous lookup and this creation).

So agreed, this is a race but might not be very harmful one because
looks like we are providing weaker coherency with FUSE as opposed
to local filesystems. In fact this case fuse server running on a
directory which is shared by multiple clients bothers me a lot. There
seem to be many corner cases where it is not clear what will happen
if another client is doing operation.

> 
> I think Miklos proposed to return the flag from user space if the file
> was actually created, this would solve two problems 1) this access
> problem and code execution going the wrong path 2) correct update if
> parent dir changed or not.

Agreed that we could use new command FUSE_ATOMIC_CREATE/FUSE_CREATE_EXT
to figure out if file was newly created or not and then set FMODE_CREATED
accordingly.

W.r.t dir change, I am assuming we are invalidating dir attributes just
because if we created file, it could change dir's mtime/ctime. So yes,
FUSE_ATOMIC_CREATE/FUSE_CREATE_EXT could return this info whether file
was actually crated or not and invalidate dir's attribute accordingly.

Thanks
Vivek

Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create
Posted by Bernd Schubert 3 years, 12 months ago

On 5/11/22 11:40, Miklos Szeredi wrote:
> On Thu, 5 May 2022 at 21:59, Vivek Goyal <vgoyal@redhat.com> wrote:
> 
>> Oh, I have no issues with the intent. I will like to see cut in network
>> traffic too (if we can do this without introducing problems). My primary
>> interest is that this kind of change should benefit virtiofs as well.
> 
> One issue with that appears to be checking permissions.   AFAIU this
> patchset only enables the optimization if default_permissions is
> turned off (i.e. all permission checking is done by the server).  But
> virtiofs uses the default_permissions model.
> 
> I'm not quite sure about this limitation, guessing that it's related
> to the fact that the permissions may be stale at the time of checking?

Yes exactly, I had actually pointed this out in one of the mails.

<quote myself from 2022-04-05 >
Adding in a vfs ->open_revalidate might have the advantage that we could 
also support 'default_permissions' - ->open_revalidate  needs to 
additionally check the retrieved file permissions and and needs to call 
into generic_permissions for that. Right now that is not easily 
feasible, without adding some code dup to convert flags in MAY_* flags - 
a vfs change would be needed here to pass the right flags.
</quote>


Avoiding lookup for create should work without default_permissions, though.


With the current patches this comment should be added in 
fuse_dentry_revalidate() to clarify things (somehow it got lost in the 
(internal) review process).

/* Default permissions actually also would not need revalidate, if 
atomic_open() could verify attributes after opening the file. But the 
MAY_ flags are missing and the vfs build_open_flags() to recreate these 
flags not exported. Thus, default_permissions() cannot be called from 
here - vfs updates would be required */


Thanks,
Bernd