[PATCH overlayfs-next v3 0/4] overlayfs: Optimize override/revert creds

Vinicius Costa Gomes posted 4 patches 2 weeks, 4 days ago
There is a newer version of this series
fs/backing-file.c        | 20 ++++++++++----------
fs/overlayfs/copy_up.c   |  2 +-
fs/overlayfs/dir.c       | 17 +++++++++++------
fs/overlayfs/file.c      | 14 +++++++-------
fs/overlayfs/inode.c     | 20 ++++++++++----------
fs/overlayfs/namei.c     | 10 +++++-----
fs/overlayfs/overlayfs.h |  1 +
fs/overlayfs/readdir.c   |  8 ++++----
fs/overlayfs/util.c      | 11 ++++++++---
fs/overlayfs/xattrs.c    |  9 ++++-----
include/linux/cred.h     | 18 ++++++++++++++++++
kernel/cred.c            |  6 +++---
12 files changed, 82 insertions(+), 54 deletions(-)
[PATCH overlayfs-next v3 0/4] overlayfs: Optimize override/revert creds
Posted by Vinicius Costa Gomes 2 weeks, 4 days ago
Hi,

This series is rebased on top of Amir's overlayfs-next branch.

Changes from v2:
 - Removed the "convert to guard()/scoped_guard()" patches (Miklos Szeredi);
 - In the overlayfs code, convert all users of override_creds()/revert_creds() to the _light() versions by:
      1. making ovl_override_creds() use override_creds_light();
      2. introduce ovl_revert_creds() which calls revert_creds_light();
      3. convert revert_creds() to ovl_revert_creds()
   (Amir Goldstein);
 - Fix an potential reference counting issue, as the lifetime
   expectations of the mounter credentials are different (Christian
   Brauner);

The series is now much simpler:

Patch 1: Introduce the _light() version of the override/revert cred operations;
Patch 2: Convert backing-file.c to use those;
Patch 3: Do the conversion to use the _light() version internally;
Patch 4: Fix a potential refcounting issue

Changes from v1:
 - Re-organized the series to be easier to follow, more details below
   (Miklos Szeredi and Amir Goldstein);

The series now reads as follows:

Patch 1: Introduce the _light() version of the override/revert cred operations;
Patch 2: Convert backing-file.c to use those;
Patch 3: Introduce the overlayfs specific _light() helper;
Patch 4: Document the cases that the helper cannot be used (critical
      section may change the cred->usage counter);
Patch 5: Convert the "rest" of overlayfs to the _light() helpers (mostly mechanical);
Patch 6: Introduce the GUARD() helpers;
Patch 7: Convert backing-file.c to the GUARD() helpers;
Patch 8-15: Convert each overlayfs/ file to use the GUARD() helpers,
      also explain the cases in which the scoped_guard() helper is
      used. Note that a 'goto' jump that crosses the guard() should
      fail to compile, gcc has a bug that fails to detect the
      error[1].
Patch 16: Remove the helper introduced in Patch 3 to close the series,
      as it is no longer used, everything was converted to use the
      safer/shorter GUARD() helpers.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951

This bug was also noticed here:

https://lore.kernel.org/all/20240730050927.GC5334@ZenIV/

Link to v1:

https://lore.kernel.org/r/20240403021808.309900-1-vinicius.gomes@intel.com/

Changes from RFC v3:
 - Removed the warning "fixes" patches, as they could hide potencial
   bugs (Christian Brauner);
 - Added "cred-specific" macros (Christian Brauner), from my side,
   added a few '_' to the guards to signify that the newly introduced
   helper macros are preferred.
 - Changed a few guard() to scoped_guard() to fix the clang (17.0.6)
   compilation error about 'goto' bypassing variable initialization;

Link to RFC v3:

https://lore.kernel.org/r/20240216051640.197378-1-vinicius.gomes@intel.com/

Changes from RFC v2:
 - Added separate patches for the warnings for the discarded const
   when using the cleanup macros: one for DEFINE_GUARD() and one for
   DEFINE_LOCK_GUARD_1() (I am uncertain if it's better to squash them
   together);
 - Reordered the series so the backing file patch is the first user of
   the introduced helpers (Amir Goldstein);
 - Change the definition of the cleanup "class" from a GUARD to a
   LOCK_GUARD_1, which defines an implicit container, that allows us
   to remove some variable declarations to store the overriden
   credentials (Amir Goldstein);
 - Replaced most of the uses of scoped_guard() with guard(), to reduce
   the code churn, the remaining ones I wasn't sure if I was changing
   the behavior: either they were nested (overrides "inside"
   overrides) or something calls current_cred() (Amir Goldstein).

New questions:
 - The backing file callbacks are now called with the "light"
   overriden credentials, so they are kind of restricted in what they
   can do with their credentials, is this acceptable in general?
 - in ovl_rename() I had to manually call the "light" the overrides,
   both using the guard() macro or using the non-light version causes
   the workload to crash the kernel. I still have to investigate why
   this is happening. Hints are appreciated.

Link to the RFC v2:

https://lore.kernel.org/r/20240125235723.39507-1-vinicius.gomes@intel.com/

Original cover letter (lightly edited):

It was noticed that some workloads suffer from contention on
increasing/decrementing the ->usage counter in their credentials,
those refcount operations are associated with overriding/reverting the
current task credentials. (the linked thread adds more context)

In some specialized cases, overlayfs is one of them, the credentials
in question have a longer lifetime than the override/revert "critical
section". In the overlayfs case, the credentials are created when the
fs is mounted and destroyed when it's unmounted. In this case of long
lived credentials, the usage counter doesn't need to be
incremented/decremented.

Add a lighter version of credentials override/revert to be used in
these specialized cases. To make sure that the override/revert calls
are paired, add a cleanup guard macro. This was suggested here:

https://lore.kernel.org/all/20231219-marken-pochen-26d888fb9bb9@brauner/

With a small number of tweaks:
 - Used inline functions instead of macros;
 - A small change to store the credentials into the passed argument,
   the guard is now defined as (note the added '_T ='):

      DEFINE_GUARD(cred, const struct cred *, _T = override_creds_light(_T),
		  revert_creds_light(_T));

 - Allow "const" arguments to be used with these kind of guards;

Some comments:
 - If patch 1/5 and 2/5 are not a good idea (adding the cast), the
   alternative I can see is using some kind of container for the
   credentials;
 - The only user for the backing file ops is overlayfs, so these
   changes make sense, but may not make sense in the most general
   case;

For the numbers, some from 'perf c2c', before this series:
(edited to fit)

#
#        ----- HITM -----                                        Shared                          
#   Num  RmtHitm  LclHitm                      Symbol            Object         Source:Line  Node
# .....  .......  .......  ..........................  ................  ..................  ....
#
  -------------------------
      0      412     1028  
  -------------------------
	  41.50%   42.22%  [k] revert_creds            [kernel.vmlinux]  atomic64_64.h:39     0  1
	  15.05%   10.60%  [k] override_creds          [kernel.vmlinux]  atomic64_64.h:25     0  1
	   0.73%    0.58%  [k] init_file               [kernel.vmlinux]  atomic64_64.h:25     0  1
	   0.24%    0.10%  [k] revert_creds            [kernel.vmlinux]  cred.h:266           0  1
	  32.28%   37.16%  [k] generic_permission      [kernel.vmlinux]  mnt_idmapping.h:81   0  1
	   9.47%    8.75%  [k] generic_permission      [kernel.vmlinux]  mnt_idmapping.h:81   0  1
	   0.49%    0.58%  [k] inode_owner_or_capable  [kernel.vmlinux]  mnt_idmapping.h:81   0  1
	   0.24%    0.00%  [k] generic_permission      [kernel.vmlinux]  namei.c:354          0

  -------------------------
      1       50      103  
  -------------------------
	 100.00%  100.00%  [k] update_cfs_group  [kernel.vmlinux]  atomic64_64.h:15   0  1

  -------------------------
      2       50       98  
  -------------------------
	  96.00%   96.94%  [k] update_cfs_group  [kernel.vmlinux]  atomic64_64.h:15   0  1
	   2.00%    1.02%  [k] update_load_avg   [kernel.vmlinux]  atomic64_64.h:25   0  1
	   0.00%    2.04%  [k] update_load_avg   [kernel.vmlinux]  fair.c:4118        0
	   2.00%    0.00%  [k] update_cfs_group  [kernel.vmlinux]  fair.c:3932        0  1

after this series:

#
#        ----- HITM -----                                   Shared                        
#   Num  RmtHitm  LclHitm                 Symbol            Object       Source:Line  Node
# .....  .......  .......   ....................  ................  ................  ....
#
  -------------------------
      0       54       88  
  -------------------------
	 100.00%  100.00%   [k] update_cfs_group  [kernel.vmlinux]  atomic64_64.h:15   0  1

  -------------------------
      1       48       83  
  -------------------------
	  97.92%   97.59%   [k] update_cfs_group  [kernel.vmlinux]  atomic64_64.h:15   0  1
	   2.08%    1.20%   [k] update_load_avg   [kernel.vmlinux]  atomic64_64.h:25   0  1
	   0.00%    1.20%   [k] update_load_avg   [kernel.vmlinux]  fair.c:4118        0  1

  -------------------------
      2       28       44  
  -------------------------
	  85.71%   79.55%   [k] generic_permission      [kernel.vmlinux]  mnt_idmapping.h:81   0  1
	  14.29%   20.45%   [k] generic_permission      [kernel.vmlinux]  mnt_idmapping.h:81   0  1

The contention is practically gone.

Link: https://lore.kernel.org/all/20231018074553.41333-1-hu1.chen@intel.com/


Vinicius Costa Gomes (4):
  cred: Add a light version of override/revert_creds()
  fs/backing-file: Convert to revert/override_creds_light()
  fs/overlayfs: Optimize override/revert creds
  fs/overlayfs: Drop creds usage decrement for ovl_setup_for_create()

 fs/backing-file.c        | 20 ++++++++++----------
 fs/overlayfs/copy_up.c   |  2 +-
 fs/overlayfs/dir.c       | 17 +++++++++++------
 fs/overlayfs/file.c      | 14 +++++++-------
 fs/overlayfs/inode.c     | 20 ++++++++++----------
 fs/overlayfs/namei.c     | 10 +++++-----
 fs/overlayfs/overlayfs.h |  1 +
 fs/overlayfs/readdir.c   |  8 ++++----
 fs/overlayfs/util.c      | 11 ++++++++---
 fs/overlayfs/xattrs.c    |  9 ++++-----
 include/linux/cred.h     | 18 ++++++++++++++++++
 kernel/cred.c            |  6 +++---
 12 files changed, 82 insertions(+), 54 deletions(-)

-- 
2.47.0
Re: [PATCH overlayfs-next v3 0/4] overlayfs: Optimize override/revert creds
Posted by Amir Goldstein 2 weeks, 4 days ago
On Tue, Nov 5, 2024 at 8:35 PM Vinicius Costa Gomes
<vinicius.gomes@intel.com> wrote:
>
> Hi,
>
> This series is rebased on top of Amir's overlayfs-next branch.
>
> Changes from v2:
>  - Removed the "convert to guard()/scoped_guard()" patches (Miklos Szeredi);
>  - In the overlayfs code, convert all users of override_creds()/revert_creds() to the _light() versions by:
>       1. making ovl_override_creds() use override_creds_light();
>       2. introduce ovl_revert_creds() which calls revert_creds_light();
>       3. convert revert_creds() to ovl_revert_creds()
>    (Amir Goldstein);
>  - Fix an potential reference counting issue, as the lifetime
>    expectations of the mounter credentials are different (Christian
>    Brauner);
>

Hi Vicius,

The end result looks good to me, but we still need to do the series a
bit differently.

> The series is now much simpler:
>
> Patch 1: Introduce the _light() version of the override/revert cred operations;
> Patch 2: Convert backing-file.c to use those;
> Patch 3: Do the conversion to use the _light() version internally;

This patch mixes a small logic change and a large mechanical change
that is not a good mix.

I took the liberty to split out the large mechanical change to
ovl: use wrapper ovl_revert_creds()
and pushed it to branch
https://github.com/amir73il/linux/commits/ovl_creds

I then rebased overlayfs-next over this commit and resolved the
conflicts with the pure mechanical change.

Now you can rebase your patches over ovl_creds and they should
not be conflicting with overlayfs-next changes.

The reason I wanted to do this is that Christian could take your changes
as well as my ovl_creds branch through the vfs tree if he chooses to do so.

> Patch 4: Fix a potential refcounting issue

This patch cannot be separated from patch #3 because it would introduce the
refcount leak mid series.

But after I took out all the mechanical changes out of patch #3,
there should be no problem for you to squash patches #3 and #4 together.

One more nit: please use "ovl: ..." for commit titles instead of
"fs/overlayfs: ...".

Thanks,
Amir.
Re: [PATCH overlayfs-next v3 0/4] overlayfs: Optimize override/revert creds
Posted by Vinicius Costa Gomes 2 weeks, 4 days ago
Hi,

Amir Goldstein <amir73il@gmail.com> writes:

> On Tue, Nov 5, 2024 at 8:35 PM Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
>>
>> Hi,
>>
>> This series is rebased on top of Amir's overlayfs-next branch.
>>
>> Changes from v2:
>>  - Removed the "convert to guard()/scoped_guard()" patches (Miklos Szeredi);
>>  - In the overlayfs code, convert all users of override_creds()/revert_creds() to the _light() versions by:
>>       1. making ovl_override_creds() use override_creds_light();
>>       2. introduce ovl_revert_creds() which calls revert_creds_light();
>>       3. convert revert_creds() to ovl_revert_creds()
>>    (Amir Goldstein);
>>  - Fix an potential reference counting issue, as the lifetime
>>    expectations of the mounter credentials are different (Christian
>>    Brauner);
>>
>
> Hi Vicius,
>
> The end result looks good to me, but we still need to do the series a
> bit differently.
>
>> The series is now much simpler:
>>
>> Patch 1: Introduce the _light() version of the override/revert cred operations;
>> Patch 2: Convert backing-file.c to use those;
>> Patch 3: Do the conversion to use the _light() version internally;
>
> This patch mixes a small logic change and a large mechanical change
> that is not a good mix.
>
> I took the liberty to split out the large mechanical change to
> ovl: use wrapper ovl_revert_creds()
> and pushed it to branch
> https://github.com/amir73il/linux/commits/ovl_creds
>
> I then rebased overlayfs-next over this commit and resolved the
> conflicts with the pure mechanical change.
>
> Now you can rebase your patches over ovl_creds and they should
> not be conflicting with overlayfs-next changes.
>
> The reason I wanted to do this is that Christian could take your changes
> as well as my ovl_creds branch through the vfs tree if he chooses to do so.
>

Makes sense.

>> Patch 4: Fix a potential refcounting issue
>
> This patch cannot be separated from patch #3 because it would introduce the
> refcount leak mid series.
>
> But after I took out all the mechanical changes out of patch #3,
> there should be no problem for you to squash patches #3 and #4 together.
>

Done. 

> One more nit: please use "ovl: ..." for commit titles instead of
> "fs/overlayfs: ...".
>

Also done. Will give the series a round of testing, just to be sure, and
will send the next version tomorrow.

> Thanks,
> Amir.


Cheers,
-- 
Vinicius