[v1] power: wire-up filesystem freeze/thaw with suspend/resume

[PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months, 2 weeks ago

The whole shebang can also be found at:
https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze

I know nothing about power or hibernation. I've tested it as best as I
could. Works for me (TM).

I need to catch some actual sleep now...

---

Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.

If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (3):
      fs: add owner of freeze/thaw
      fs: allow pagefault based writers to be frozen
      power: freeze filesystems during suspend/resume

Luis Chamberlain (3):
      ext4: replace kthread freezing with auto fs freezing
      btrfs: replace kthread freezing with auto fs freezing
      xfs: replace kthread freezing with auto fs freezing

 fs/btrfs/disk-io.c          |  4 +--
 fs/btrfs/scrub.c            |  2 +-
 fs/ext4/mballoc.c           |  2 +-
 fs/ext4/super.c             |  3 --
 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++-----
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/super.c                  | 82 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_discard.c        |  2 +-
 fs/xfs/xfs_log.c            |  3 +-
 fs/xfs/xfs_log_cil.c        |  2 +-
 fs/xfs/xfs_mru_cache.c      |  2 +-
 fs/xfs/xfs_notify_failure.c |  6 ++--
 fs/xfs/xfs_pwork.c          |  2 +-
 fs/xfs/xfs_super.c          | 14 ++++----
 fs/xfs/xfs_trans_ail.c      |  3 --
 fs/xfs/xfs_zone_gc.c        |  2 --
 include/linux/fs.h          | 16 ++++++---
 kernel/power/hibernate.c    | 13 ++++++-
 kernel/power/suspend.c      |  8 +++++
 22 files changed, 139 insertions(+), 69 deletions(-)
---
base-commit: a68c99192db8060f383a2680333866c0be688ece
change-id: 20250401-work-freeze-693b5b5a78e0

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by James Bottomley 10 months, 1 week ago

On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as
> I could. Works for me (TM).

I'm testing the latest you have in work.freeze and it doesn't currently
work for me.  Patch 7b315c39b67d ("power: freeze filesystems during
suspend/resume") doesn't set filesystems_freeze_ptr so it ends up being
NULL and tripping over this check 

+static inline bool may_unfreeze(struct super_block *sb, enum
freeze_holder who,
+                               const void *freeze_owner)
+{
+       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+       if (who & FREEZE_EXCL) {
+               if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+                       return false;

in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to resume
from hibernate.  Setting it to __builtin_return_address(0) in
filesystems_freeze() makes everything work as expected, so that's what
I'm testing now.

I suppose one minor, minor nit is that the vagaries of English grammar
mean that the verbs fail and succeed don't take the same grammatical
construction, so failed can take the infinitive (failed to thaw)
perfectly well, but succeeded takes a prepositional gerund construction
instead: "succeeded at/in thawing" instead of the infinitive "succeeded
to thaw" ... I've no idea why, but I'd probably blame the Victorians
...

Regards,

James

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months, 1 week ago

On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > 
> > I know nothing about power or hibernation. I've tested it as best as
> > I could. Works for me (TM).
> 
> I'm testing the latest you have in work.freeze and it doesn't currently
> work for me.  Patch 7b315c39b67d ("power: freeze filesystems during
> suspend/resume") doesn't set filesystems_freeze_ptr so it ends up being
> NULL and tripping over this check 

I haven't pushed the new version there. Sorry about that. I only have it
locally.

> 
> +static inline bool may_unfreeze(struct super_block *sb, enum
> freeze_holder who,
> +                               const void *freeze_owner)
> +{
> +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> +
> +       if (who & FREEZE_EXCL) {
> +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> +                       return false;
> 
> 
> in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to resume
> from hibernate.  Setting it to __builtin_return_address(0) in
> filesystems_freeze() makes everything work as expected, so that's what
> I'm testing now.

+1

I'll send the final version out in a bit.

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by James Bottomley 10 months ago

On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > 
> > > I know nothing about power or hibernation. I've tested it as best
> > > as I could. Works for me (TM).
> > 
> > I'm testing the latest you have in work.freeze and it doesn't
> > currently work for me.  Patch 7b315c39b67d ("power: freeze
> > filesystems during suspend/resume") doesn't set
> > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > this check 
> 
> I haven't pushed the new version there. Sorry about that. I only have
> it locally.
> 
> > 
> > +static inline bool may_unfreeze(struct super_block *sb, enum
> > freeze_holder who,
> > +                               const void *freeze_owner)
> > +{
> > +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > +
> > +       if (who & FREEZE_EXCL) {
> > +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > NULL))
> > +                       return false;
> > 
> > 
> > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > resume from hibernate.  Setting it to __builtin_return_address(0)
> > in filesystems_freeze() makes everything work as expected, so
> > that's what I'm testing now.
> 
> +1
> 
> I'll send the final version out in a bit.

I've now done some extensive testing on loop nested filesystems with
fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
Hibernate/Resume has currently worked on these without a hitch (and the
fio load burps a bit but then starts running at full speed within a few
seconds). What I'm doing is a single round of hibernate/resume followed
by a reboot. I'm relying on the fschecks to detect any filesystem
corruption. I've also tried doing a couple of fresh starts of the
hibernated image to check that we did correctly freeze the filesystems.

The problems I've noticed are:

   1. I'm using 9p to push host directories throught and that
      completely hangs after a resume. This is expected because the
      virtio server is out of sync, but it does indicate a need to
      address Jeff's question of what we should be doing for network
      filesystems (and is also the reason I have to reboot after
      resuming).
   2. Top doesn't show any CPU activity after resume even though fio is
      definitely running.  This seems to be a suspend issue and
      unrelated to filesystems, but I'll continue investigating.

Regards,

James

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months ago

On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > > The whole shebang can also be found at:
> > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > > 
> > > > I know nothing about power or hibernation. I've tested it as best
> > > > as I could. Works for me (TM).
> > > 
> > > I'm testing the latest you have in work.freeze and it doesn't
> > > currently work for me.  Patch 7b315c39b67d ("power: freeze
> > > filesystems during suspend/resume") doesn't set
> > > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > > this check 
> > 
> > I haven't pushed the new version there. Sorry about that. I only have
> > it locally.
> > 
> > > 
> > > +static inline bool may_unfreeze(struct super_block *sb, enum
> > > freeze_holder who,
> > > +                               const void *freeze_owner)
> > > +{
> > > +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > > +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > > +
> > > +       if (who & FREEZE_EXCL) {
> > > +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > > NULL))
> > > +                       return false;
> > > 
> > > 
> > > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > > resume from hibernate.  Setting it to __builtin_return_address(0)
> > > in filesystems_freeze() makes everything work as expected, so
> > > that's what I'm testing now.
> > 
> > +1
> > 
> > I'll send the final version out in a bit.
> 
> I've now done some extensive testing on loop nested filesystems with
> fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
> Hibernate/Resume has currently worked on these without a hitch (and the
> fio load burps a bit but then starts running at full speed within a few
> seconds). What I'm doing is a single round of hibernate/resume followed
> by a reboot. I'm relying on the fschecks to detect any filesystem
> corruption. I've also tried doing a couple of fresh starts of the
> hibernated image to check that we did correctly freeze the filesystems.
> 
> The problems I've noticed are:
> 
>    1. I'm using 9p to push host directories throught and that
>       completely hangs after a resume. This is expected because the
>       virtio server is out of sync, but it does indicate a need to
>       address Jeff's question of what we should be doing for network
>       filesystems (and is also the reason I have to reboot after
>       resuming).

No network filesystem supports freeze/thaw so they're not frozen/thawed
during hibernation.

virtiofs doesn't support filesystem freezing either and it warns about
virtio_driver based freezing not being implemented.

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Luis Chamberlain 10 months ago

On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > > The whole shebang can also be found at:
> > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > > 
> > > > I know nothing about power or hibernation. I've tested it as best
> > > > as I could. Works for me (TM).
> > > 
> > > I'm testing the latest you have in work.freeze and it doesn't
> > > currently work for me.  Patch 7b315c39b67d ("power: freeze
> > > filesystems during suspend/resume") doesn't set
> > > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > > this check 
> > 
> > I haven't pushed the new version there. Sorry about that. I only have
> > it locally.
> > 
> > > 
> > > +static inline bool may_unfreeze(struct super_block *sb, enum
> > > freeze_holder who,
> > > +                               const void *freeze_owner)
> > > +{
> > > +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > > +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > > +
> > > +       if (who & FREEZE_EXCL) {
> > > +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > > NULL))
> > > +                       return false;
> > > 
> > > 
> > > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > > resume from hibernate.  Setting it to __builtin_return_address(0)
> > > in filesystems_freeze() makes everything work as expected, so
> > > that's what I'm testing now.
> > 
> > +1
> > 
> > I'll send the final version out in a bit.
> 
> I've now done some extensive testing on loop nested filesystems with
> fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
> Hibernate/Resume has currently worked on these without a hitch (and the
> fio load burps a bit but then starts running at full speed within a few
> seconds). What I'm doing is a single round of hibernate/resume followed
> by a reboot. I'm relying on the fschecks to detect any filesystem
> corruption. I've also tried doing a couple of fresh starts of the
> hibernated image to check that we did correctly freeze the filesystems.
> 
> The problems I've noticed are:
> 
>    1. I'm using 9p to push host directories throught and that
>       completely hangs after a resume. This is expected because the
>       virtio server is out of sync, but it does indicate a need to
>       address Jeff's question of what we should be doing for network
>       filesystems (and is also the reason I have to reboot after
>       resuming).
>    2. Top doesn't show any CPU activity after resume even though fio is
>       definitely running.  This seems to be a suspend issue and
>       unrelated to filesystems, but I'll continue investigating.

To be clear, on the fio run -- are you running fio *while*
suspend/resume cycle on XFS? That used to stall / break suspend
resume. We may want to test dd against a drive too, that will use
the block device cache, and I forget if we have a freeze/thaw for it.

  Luis

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by James Bottomley 10 months ago

On Tue, 2025-04-08 at 10:09 -0700, Luis Chamberlain wrote:
> On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
[...]
> > I've now done some extensive testing on loop nested filesystems
> > with fio load on the upper. I've tried xfs on ext4 and ext4 on
> > ext4. Hibernate/Resume has currently worked on these without a
> > hitch (and the fio load burps a bit but then starts running at full
> > speed within a few seconds). What I'm doing is a single round of
> > hibernate/resume followed by a reboot. I'm relying on the fschecks
> > to detect any filesystem corruption. I've also tried doing a couple
> > of fresh starts of the hibernated image to check that we did
> > correctly freeze the filesystems.
> > 
> > The problems I've noticed are:
> > 
> >    1. I'm using 9p to push host directories throught and that
> >       completely hangs after a resume. This is expected because the
> >       virtio server is out of sync, but it does indicate a need to
> >       address Jeff's question of what we should be doing for
> > network
> >       filesystems (and is also the reason I have to reboot after
> >       resuming).
> >    2. Top doesn't show any CPU activity after resume even though
> > fio is
> >       definitely running.  This seems to be a suspend issue and
> >       unrelated to filesystems, but I'll continue investigating.
> 
> To be clear, on the fio run -- are you running fio *while*
> suspend/resume cycle on XFS?

Yes, that's why I said "the fio load burps a bit" (as in after resume)
"but then starts running full speed after a few seconds".

>  That used to stall / break suspend resume. We may want to test dd
> against a drive too, that will use the block device cache, and I
> forget if we have a freeze/thaw for it.

fio is running a read/write test, but I think all my caches are write
through for safety (although I have verified that the device cache
flush is sent as the last sequence of hibernate).

Regards,

James

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Luis Chamberlain 10 months ago

And in case its useful, to test this on a VM you'll need on libvirt:

</os>                                                                                                                                                                                                                                      
<pm>                                                                                                                                                                                                                                     
<suspend-to-mem enabled='yes'/>                                                                                                                                                                                                          
<suspend-to-disk enabled='yes'/>                                                                                                                                                                                                         
</pm>   

The litmus test which used to stall without ever returning was:

while true; do                                                                                                                                                                                                                               
dd if=/dev/zero of=/path/to/xfs/foo bs=1M count=1024 &> /dev/null
done 

To suspend you can use one of these two:

echo mem > /sys/power/state
systemctl suspend

Verify you can resume from suspend:
virsh qemu-monitor-command guest_foo 'query-current-machine'
{"return":{"wakeup-suspend-support":true},"id":"libvirt-415"}

To resume the guest:

virsh dompmwakeup guest_foo

  Luis

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by James Bottomley 10 months ago

On Tue, 2025-04-08 at 10:20 -0700, Luis Chamberlain wrote:
> And in case its useful, to test this on a VM you'll need on libvirt:

Just so we're clear, I'm only doing hibernate, not suspend tests.  I
figure they should be a bit harsher, but you never know.  The reason is
I set up my test rig for efivarfs, which only has a testable problem on
hibernate.

Regards,

James

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Peter Zijlstra 10 months, 1 week ago

On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as I
> could. Works for me (TM).
> 
> I need to catch some actual sleep now...
> 
> ---
> 
> Now all the pieces are in place to actually allow the power subsystem to
> freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.

Urgh, I was relying on all kthreads to be freezable for live-patching:

  https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net

So I understand the problem with freezing filesystems, but can't we
leave the TASK_FREEZABLE in the kthreads? The way I understand it, the
power subsystem will first freeze the filesystems before it goes freeze
threads anyway. So them remaining freezable should not affect anything,
right?

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months, 1 week ago

On Tue, Apr 01, 2025 at 04:14:07PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > 
> > I know nothing about power or hibernation. I've tested it as best as I
> > could. Works for me (TM).
> > 
> > I need to catch some actual sleep now...
> > 
> > ---
> > 
> > Now all the pieces are in place to actually allow the power subsystem to
> > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > frozen and thawed if the power subsystem does actually own the freeze.
> 
> Urgh, I was relying on all kthreads to be freezable for live-patching:
> 
>   https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net
> 
> So I understand the problem with freezing filesystems, but can't we
> leave the TASK_FREEZABLE in the kthreads? The way I understand it, the

Yeah, we can.

> power subsystem will first freeze the filesystems before it goes freeze
> threads anyway. So them remaining freezable should not affect anything,
> right?

Yes. I've dropped the other patches. I've discussed this later
downthread with Jan.
>

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Peter Zijlstra 10 months, 1 week ago

On Tue, Apr 01, 2025 at 04:40:33PM +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 04:14:07PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > 
> > > I know nothing about power or hibernation. I've tested it as best as I
> > > could. Works for me (TM).
> > > 
> > > I need to catch some actual sleep now...
> > > 
> > > ---
> > > 
> > > Now all the pieces are in place to actually allow the power subsystem to
> > > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > > frozen and thawed if the power subsystem does actually own the freeze.
> > 
> > Urgh, I was relying on all kthreads to be freezable for live-patching:
> > 
> >   https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net
> > 
> > So I understand the problem with freezing filesystems, but can't we
> > leave the TASK_FREEZABLE in the kthreads? The way I understand it, the
> 
> Yeah, we can.
> 
> > power subsystem will first freeze the filesystems before it goes freeze
> > threads anyway. So them remaining freezable should not affect anything,
> > right?
> 
> Yes. I've dropped the other patches. I've discussed this later
> downthread with Jan.

Thanks!

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Jan Kara 10 months, 1 week ago

On Tue 01-04-25 02:32:45, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as I
> could. Works for me (TM).
> 
> I need to catch some actual sleep now...
> 
> ---
> 
> Now all the pieces are in place to actually allow the power subsystem to
> freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.
> 
> Othwerwise it risks thawing filesystems it didn't own. This could be
> done differently be e.g., keeping the filesystems that were actually
> frozen on a list and then unfreezing them from that list. This is
> disgustingly unclean though and reeks of an ugly hack.
> 
> If the filesystem is already frozen by the time we've frozen all
> userspace processes we don't care to freeze it again. That's userspace's
> job once the process resumes. We only actually freeze filesystems if we
> absolutely have to and we ignore other failures to freeze.

Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST |
FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I
remember we have specifically designed nesting of freeze counters so that
this way power subsystem can be sure freezing succeeds even if the
filesystem is already frozen (by userspace or the kernel) and similarly
power subsystem cannot thaw a filesystem frozen by somebody else. It will
just drop its freeze refcount... What am I missing?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months, 1 week ago

On Tue, Apr 01, 2025 at 11:32:49AM +0200, Jan Kara wrote:
> On Tue 01-04-25 02:32:45, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > 
> > I know nothing about power or hibernation. I've tested it as best as I
> > could. Works for me (TM).
> > 
> > I need to catch some actual sleep now...
> > 
> > ---
> > 
> > Now all the pieces are in place to actually allow the power subsystem to
> > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > frozen and thawed if the power subsystem does actually own the freeze.
> > 
> > Othwerwise it risks thawing filesystems it didn't own. This could be
> > done differently be e.g., keeping the filesystems that were actually
> > frozen on a list and then unfreezing them from that list. This is
> > disgustingly unclean though and reeks of an ugly hack.
> > 
> > If the filesystem is already frozen by the time we've frozen all
> > userspace processes we don't care to freeze it again. That's userspace's
> > job once the process resumes. We only actually freeze filesystems if we
> > absolutely have to and we ignore other failures to freeze.
> 
> Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST |
> FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I
> remember we have specifically designed nesting of freeze counters so that
> this way power subsystem can be sure freezing succeeds even if the
> filesystem is already frozen (by userspace or the kernel) and similarly
> power subsystem cannot thaw a filesystem frozen by somebody else. It will
> just drop its freeze refcount... What am I missing?

If we have 10 filesystems and suspend/hibernate manges to freeze 5 and
then fails on the 6th for whatever odd reason (current or future) then
power needs to undo the freeze of the first 5 filesystems. We can't just
walk the list again because while it's unlikely that a new filesystem
got added in the meantime we still cannot tell what filesystems the
power subsystem actually managed to get a freeze reference count on that
we need to drop during thaw.

There's various ways out of this ugliness. Either we record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or we make sure that the power subsystem just actually
exclusively freezes things it can freeze and marking such filesystems as
being owned by power for the duration of the suspend or resume cycle. I
opted for the latter as that seemed the clean thing to do even if it
means more code changes. What are your thoughts on this?

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Jan Kara 10 months, 1 week ago

On Tue 01-04-25 15:03:33, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 11:32:49AM +0200, Jan Kara wrote:
> > On Tue 01-04-25 02:32:45, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > 
> > > I know nothing about power or hibernation. I've tested it as best as I
> > > could. Works for me (TM).
> > > 
> > > I need to catch some actual sleep now...
> > > 
> > > ---
> > > 
> > > Now all the pieces are in place to actually allow the power subsystem to
> > > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > > frozen and thawed if the power subsystem does actually own the freeze.
> > > 
> > > Othwerwise it risks thawing filesystems it didn't own. This could be
> > > done differently be e.g., keeping the filesystems that were actually
> > > frozen on a list and then unfreezing them from that list. This is
> > > disgustingly unclean though and reeks of an ugly hack.
> > > 
> > > If the filesystem is already frozen by the time we've frozen all
> > > userspace processes we don't care to freeze it again. That's userspace's
> > > job once the process resumes. We only actually freeze filesystems if we
> > > absolutely have to and we ignore other failures to freeze.
> > 
> > Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I
> > remember we have specifically designed nesting of freeze counters so that
> > this way power subsystem can be sure freezing succeeds even if the
> > filesystem is already frozen (by userspace or the kernel) and similarly
> > power subsystem cannot thaw a filesystem frozen by somebody else. It will
> > just drop its freeze refcount... What am I missing?
> 
> If we have 10 filesystems and suspend/hibernate manges to freeze 5 and
> then fails on the 6th for whatever odd reason (current or future) then
> power needs to undo the freeze of the first 5 filesystems. We can't just
> walk the list again because while it's unlikely that a new filesystem
> got added in the meantime we still cannot tell what filesystems the
> power subsystem actually managed to get a freeze reference count on that
> we need to drop during thaw.
> 
> There's various ways out of this ugliness. Either we record the
> filesystems the power subsystem managed to freeze on a temporary list in
> the callbacks and then walk that list backwards during thaw to undo the
> freezing or we make sure that the power subsystem just actually
> exclusively freezes things it can freeze and marking such filesystems as
> being owned by power for the duration of the suspend or resume cycle. I
> opted for the latter as that seemed the clean thing to do even if it
> means more code changes. What are your thoughts on this?

Ah, I see. Thanks for explanation. So failure to freeze filesystem should
be rare (mostly only due to IO errors or similar serious issues) hence
I'd consider failing hibernation in case we fail to freeze some filesystem
appropriate. The function that's walking all superblocks and freezing them
could just walk from the superblock where freezing failed towards the end
and thaw all filesystems. That way the function also has the nice property
that it either freezes everything or keeps things as they were.

But you've touched on an interesting case I didn't consider: New
superblocks can be added to the end of the list while we are walking it.
These superblocks will not be frozen and on resume (or error recovery) this
will confuse things. Your "freeze owner" stuff deals with this problem
nicely. Somewhat lighter fix for this may be to provide the superblock to
start from / end with to these loops iterating and freezing / thawing
superblocks. It doesn't seem too hacky but if you prefer your freeze owner
approach I won't object.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months, 1 week ago

Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.

If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Drop all patches that remove TASK_FREEZABLE.
- Expand commit messages a bit.
- Link to v1: https://lore.kernel.org/r/20250401-work-freeze-v1-0-d000611d4ab0@kernel.org

---
Christian Brauner (4):
      fs: add owner of freeze/thaw
      fs: allow all writers to be frozen
      power: freeze filesystems during suspend/resume
      kernfs: add warning about implementing freeze/thaw

 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++-----
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/kernfs/mount.c           | 15 +++++++++
 fs/super.c                  | 82 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_notify_failure.c |  6 ++--
 include/linux/fs.h          | 16 +++++----
 kernel/power/hibernate.c    | 16 ++++++++-
 kernel/power/main.c         | 31 +++++++++++++++++
 kernel/power/power.h        |  4 +++
 kernel/power/suspend.c      |  7 ++++
 13 files changed, 174 insertions(+), 45 deletions(-)
---
base-commit: 62dfd8d59e2d16873398ede5b1835e302df789b3
change-id: 20250401-work-freeze-693b5b5a78e0

Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Askar Safin 6 months, 3 weeks ago

Hi, Christian Brauner, Jan Kara and other contributors of this patchset.

I did experiments on my laptop, and these experiments show that this patchset does not solve various longstanding problems related to suspend and filesystems. (Even if I enable /sys/power/freeze_filesystems )

Now let me describe problems I had in the past (and still have!) and then experiments I did and their results.

So, I had these 3 problems:

- Suspend doesn't work if fstrim in progress (note that I use btrfs as root file system)

- Suspend doesn't work if scrub in progress

- Suspend doesn't work if we try to read from fuse-sshfs filesystem while network is down

Let me describe third problem in more detail. To reproduce you need to do this:

- Mount remote filesystem using sshfs (it is based on ssh and fuse)

- Disable internet

- Run command "ls" in that sshfs filesystem (this command will, of course, hang, because network is down)

- Then suspend

Suspend will not work.

Does your patchset supposed to fix these problems?

Okay, so just now I was able to reproduce all 3 problems on latest mainline ( f4a40a4282f467ec99745c6ba62cb84346e42139 ), which (as well as I understand) has this patchset applied.

I reproduced them with /sys/power/freeze_filesystems set to both 0 and 1 (thus I did 3 * 2 = 6 experiments).

I'm available for further testing.

--
Askar Safin

Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Jan Kara 6 months, 3 weeks ago

Hi!

On Sun 20-07-25 22:23:36, Askar Safin wrote:
> I did experiments on my laptop, and these experiments show that this
> patchset does not solve various longstanding problems related to suspend
> and filesystems. (Even if I enable /sys/power/freeze_filesystems )
> 
> Now let me describe problems I had in the past (and still have!) and then
> experiments I did and their results.
> 
> So, I had these 3 problems:
> 
> - Suspend doesn't work if fstrim in progress (note that I use btrfs as
> root file system)

Right, this is expected because the FITRIM ioctl (syscall as any other)
likely takes too long and so the suspend code looses its patience. There's
nothing VFS can do about this. You can talk to btrfs developers to
periodically check for pending signal / freezing event like e.g. ext4 does
in ext4_trim_interrupted() to avoid suspend failures when FITRIM is
running.

> - Suspend doesn't work if scrub in progress

Similar situation as with FITRIM. This is fully in control of the
filesystem and unless the filesystem adds checks and early abort paths, VFS
cannot do anything.

> - Suspend doesn't work if we try to read from fuse-sshfs filesystem while
> network is down

On the surface the problem is the same as the above two but the details
here are subtly different. Here I expect (although I'm not 100% sure) the
blocked process is inside the FUSE filesystem waiting for the FUSE daemon
to reply (a /proc/<pid>/stack of the blocked process would be useful here).
In theory, FUSE filesystem should be able to make the wait for reply in
TASK_FREEZABLE state which would fix your issue. In any case this is very
likely work for FUSE developers.

So I'm sorry but the patch set you speak about isn't supposed to fix any of
the above issues you hit.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Miklos Szeredi 6 months, 1 week ago

On Mon, 21 Jul 2025 at 14:09, Jan Kara <jack@suse.cz> wrote:
>
> Hi!
>
> On Sun 20-07-25 22:23:36, Askar Safin wrote:

> > - Suspend doesn't work if we try to read from fuse-sshfs filesystem while
> > network is down
>
> On the surface the problem is the same as the above two but the details
> here are subtly different. Here I expect (although I'm not 100% sure) the
> blocked process is inside the FUSE filesystem waiting for the FUSE daemon
> to reply (a /proc/<pid>/stack of the blocked process would be useful here).
> In theory, FUSE filesystem should be able to make the wait for reply in
> TASK_FREEZABLE state which would fix your issue. In any case this is very
> likely work for FUSE developers.

This is a known problem with an unknown solution.

We can fix some of the cases, but changing all filesystem locks to be
freezable is likely not workable.

Thanks,
Miklos

Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Askar Safin 6 months, 1 week ago

 ---- On Mon, 04 Aug 2025 09:31:00 +0400  Miklos Szeredi <miklos@szeredi.hu> wrote --- 
 > This is a known problem with an unknown solution.
 > 
 > We can fix some of the cases, but changing all filesystem locks to be
 > freezable is likely not workable.

So what to do in case of networked FUSE, such as sshfs?
We should put workaround to other place?
Where? To libfuse or to each networked FUSE daemon, such as sshfs?
I hit this bug in the past, and I'm very angry.
A lot of other people hit this FUSE bug, too: https://github.com/systemd/systemd/issues/37590 .

What about this timeout solution: https://lore.kernel.org/linux-fsdevel/20250122215528.1270478-1-joannelkoong@gmail.com/ ?
Will it work? As well as I understand, currently kernel waits 20 seconds, when it tries to freeze processes when suspending.
So, what if we use this Joanne Koong's timeout patch, and set timeout to 19 seconds?

So, in short, if we cannot properly fix kernel, then where fix belongs? What should we do?
--
Askar Safin
https://types.pl/@safinaskar

Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Sergey Senozhatsky 6 months, 1 week ago

On (25/08/04 10:02), Askar Safin wrote:
> What about this timeout solution: https://lore.kernel.org/linux-fsdevel/20250122215528.1270478-1-joannelkoong@gmail.com/ ?
> Will it work? As well as I understand, currently kernel waits 20 seconds, when it tries to freeze processes when suspending.
> So, what if we use this Joanne Koong's timeout patch, and set timeout to 19 seconds?

I think the problem with this approach is that not all fuse connections
are remote (over the network).  One can use fuse to mount vfat/ntfs or
even archives.  Destroying those connections for suspend is unlikely to
be loved by users.

[PATCH v2 1/4] fs: add owner of freeze/thaw

Posted by Christian Brauner 10 months, 1 week ago

For some kernel subsystems it is paramount that they are guaranteed that
they are the owner of the freeze to avoid any risk of deadlocks. This is
the case for the power subsystem. Enable it to recognize whether it did
actually freeze the filesystem.

If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
and then fails on the 6th for whatever odd reason (current or future)
then power needs to undo the freeze of the first 5 filesystems. It can't
just walk the list again because while it's unlikely that a new
filesystem got added in the meantime it still cannot tell which
filesystems the power subsystem actually managed to get a freeze
reference count on that needs to be dropped during thaw.

There's various ways out of this ugliness. For example, record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or make sure that the power subsystem just actually exclusively
freezes things it can freeze and marking such filesystems as being owned
by power for the duration of the suspend or resume cycle. I opted for
the latter as that seemed the clean thing to do even if it means more
code changes.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++------
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/super.c                  | 76 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_notify_failure.c |  6 ++--
 include/linux/fs.h          | 13 +++++---
 8 files changed, 95 insertions(+), 42 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 2b8f9239bede..3e8af62c9e15 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
 	if (err)
 		return err;
 
-	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 	if (err)
 		return err;
 
 	if (f2fs_readonly(sbi->sb)) {
-		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 		if (err)
 			return err;
 		return -EROFS;
@@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
 out_err:
 	f2fs_up_write(&sbi->cp_global_sem);
 	f2fs_up_write(&sbi->gc_lock);
-	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 	return err;
 }
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 44e5658b896c..519943189109 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
 	return sdp->sd_log_error;
 }
 
-static int gfs2_do_thaw(struct gfs2_sbd *sdp)
+static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
 {
 	struct super_block *sb = sdp->sd_vfs;
 	int error;
@@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
 	error = gfs2_freeze_lock_shared(sdp);
 	if (error)
 		goto fail;
-	error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+	error = thaw_super(sb, who, freeze_owner);
 	if (!error)
 		return 0;
 
@@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
 	if (test_bit(SDF_FROZEN, &sdp->sd_flags))
 		goto freeze_failed;
 
-	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 	if (error)
 		goto freeze_failed;
 
 	gfs2_freeze_unlock(sdp);
 	set_bit(SDF_FROZEN, &sdp->sd_flags);
 
-	error = gfs2_do_thaw(sdp);
+	error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
 	if (error)
 		goto out;
 
@@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
  *
  */
 
-static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	int error;
@@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
 	}
 
 	for (;;) {
-		error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+		error = freeze_super(sb, who, freeze_owner);
 		if (error) {
 			fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
 				error);
@@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
 			break;
 		}
 
-		error = gfs2_do_thaw(sdp);
+		error = gfs2_do_thaw(sdp, who, freeze_owner);
 		if (error)
 			goto out;
 
@@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
  *
  */
 
-static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
+			   const void *freeze_owner)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	int error;
@@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
 	atomic_inc(&sb->s_active);
 	gfs2_freeze_unlock(sdp);
 
-	error = gfs2_do_thaw(sdp);
+	error = gfs2_do_thaw(sdp, who, freeze_owner);
 
 	if (!error) {
 		clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index ecc699f8d9fc..748125653d6c 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 
 	switch (n) {
 	case 0:
-		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
 		break;
 	case 1:
-		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
 		break;
 	default:
 		return -EINVAL;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index c91fd2b46a77..bedc83fc2f20 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
 
 	/* Freeze */
 	if (sb->s_op->freeze_super)
-		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
-	return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+	return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int ioctl_fsthaw(struct file *filp)
@@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
 
 	/* Thaw */
 	if (sb->s_op->thaw_super)
-		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
-	return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+	return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int ioctl_file_dedupe_range(struct file *file,
diff --git a/fs/super.c b/fs/super.c
index 3c4a496d6438..3ddded4360c6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -39,7 +39,8 @@
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
 	if (IS_ENABLED(CONFIG_BLOCK))
 		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
 			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
-	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
 	return;
 }
 
@@ -1195,9 +1196,9 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->freeze_super)
-		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 	else
-		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 
 	deactivate_super(sb);
 }
@@ -1217,9 +1218,9 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->thaw_super)
-		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 	else
-		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 
 	deactivate_super(sb);
 }
@@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
 
 	if (sb->s_op->freeze_super)
 		error = sb->s_op->freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	else
 		error = freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	if (!error)
 		error = sync_blockdev(bdev);
 	deactivate_super(sb);
@@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
 
 	if (sb->s_op->thaw_super)
 		error = sb->s_op->thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	else
 		error = thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	deactivate_super(sb);
 	return error;
 }
@@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
 }
 
 #define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
-#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
+#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
 
 static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
 {
@@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
+	if (who & FREEZE_EXCL) {
+		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+			return false;
+
+		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+			return false;
+
+		return (sb->s_writers.freeze_kcount +
+			sb->s_writers.freeze_ucount) == 0;
+	}
+
+	/* This filesystem is already exclusively frozen. */
+	if (sb->s_writers.freeze_owner)
+		return false;
+
 	if (who & FREEZE_HOLDER_KERNEL)
 		return (who & FREEZE_MAY_NEST) ||
 		       sb->s_writers.freeze_kcount == 0;
@@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 	return false;
 }
 
+static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
+				const void *freeze_owner)
+{
+	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+	if (who & FREEZE_EXCL) {
+		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+			return false;
+		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+			return false;
+		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+			return false;
+		return sb->s_writers.freeze_owner == freeze_owner;
+	}
+
+	return sb->s_writers.freeze_owner == NULL;
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
  * @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
  *
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs may return
@@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
  * Return: If the freeze was successful zero is returned. If the freeze
  *         failed a negative error code is returned.
  */
-int freeze_super(struct super_block *sb, enum freeze_holder who)
+int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
 {
 	int ret;
 
@@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	if (sb_rdonly(sb)) {
 		/* Nothing to do really... */
 		WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+		sb->s_writers.freeze_owner = freeze_owner;
 		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 		wake_up_var(&sb->s_writers.frozen);
 		super_unlock_excl(sb);
@@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 	 */
 	WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+	sb->s_writers.freeze_owner = freeze_owner;
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 	wake_up_var(&sb->s_writers.frozen);
 	lockdep_sb_freeze_release(sb);
@@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
  * removes that state without releasing the other state or unlocking the
  * filesystem.
  */
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner)
 {
 	int error = -EINVAL;
 
 	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
 		goto out_unlock;
 
+	if (!may_unfreeze(sb, who, freeze_owner))
+		goto out_unlock;
+
 	/*
 	 * All freezers share a single active reference.
 	 * So just unlock in case there are any left.
@@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 
 	if (sb_rdonly(sb)) {
 		sb->s_writers.frozen = SB_UNFROZEN;
+		sb->s_writers.freeze_owner = NULL;
 		wake_up_var(&sb->s_writers.frozen);
 		goto out_deactivate;
 	}
@@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 	}
 
 	sb->s_writers.frozen = SB_UNFROZEN;
+	sb->s_writers.freeze_owner = NULL;
 	wake_up_var(&sb->s_writers.frozen);
 	sb_freeze_unlock(sb, SB_FREEZE_FS);
 out_deactivate:
@@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
  * thaw_super -- unlock filesystem
  * @sb: the super to thaw
  * @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super()
  * if there are no remaining freezes on the filesystem.
@@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
  * have been frozen through the block layer via multiple block devices.
  * The filesystem remains frozen until all block devices are unfrozen.
  */
-int thaw_super(struct super_block *sb, enum freeze_holder who)
+int thaw_super(struct super_block *sb, enum freeze_holder who,
+	       const void *freeze_owner)
 {
 	if (!super_lock_excl(sb)) {
 		WARN_ON_ONCE("Dying superblock while thawing!");
 		return -EINVAL;
 	}
-	return thaw_super_locked(sb, who);
+	return thaw_super_locked(sb, who, freeze_owner);
 }
 EXPORT_SYMBOL(thaw_super);
 
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index e629663e460a..9b598c5790ad 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -123,7 +123,7 @@ xchk_fsfreeze(
 {
 	int			error;
 
-	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
 	trace_xchk_fsfreeze(sc, error);
 	return error;
 }
@@ -135,7 +135,7 @@ xchk_fsthaw(
 	int			error;
 
 	/* This should always succeed, we have a kernel freeze */
-	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
 	trace_xchk_fsthaw(sc, error);
 	return error;
 }
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index ed8d8ed42f0a..3545dc1d953c 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
 	struct super_block	*sb = mp->m_super;
 	int			error;
 
-	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
 	if (error)
 		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
 
@@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
 	int			error;
 
 	if (kernel_frozen) {
-		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
 		if (error)
 			xfs_emerg(mp, "still frozen after notify failure, err=%d",
 				error);
@@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
 	 * Also thaw userspace call anyway because the device is about to be
 	 * removed immediately.
 	 */
-	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1aa578412f1b..b379a46b5576 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1307,6 +1307,7 @@ struct sb_writers {
 	unsigned short			frozen;		/* Is sb frozen? */
 	int				freeze_kcount;	/* How many kernel freeze requests? */
 	int				freeze_ucount;	/* How many userspace freeze requests? */
+	const void			*freeze_owner;	/* Owner of the freeze */
 	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
 };
 
@@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ * @FREEZE_EXCL: whether actual freezing must be done by the caller
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.
@@ -2283,6 +2285,7 @@ enum freeze_holder {
 	FREEZE_HOLDER_KERNEL	= (1U << 0),
 	FREEZE_HOLDER_USERSPACE	= (1U << 1),
 	FREEZE_MAY_NEST		= (1U << 2),
+	FREEZE_EXCL		= (1U << 3),
 };
 
 struct super_operations {
@@ -2296,9 +2299,9 @@ struct super_operations {
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
 	int (*sync_fs)(struct super_block *sb, int wait);
-	int (*freeze_super) (struct super_block *, enum freeze_holder who);
+	int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
 	int (*freeze_fs) (struct super_block *);
-	int (*thaw_super) (struct super_block *, enum freeze_holder who);
+	int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
@@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
 extern int vfs_statfs(const struct path *, struct kstatfs *);
 extern int user_statfs(const char __user *, struct kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
-int freeze_super(struct super_block *super, enum freeze_holder who);
-int thaw_super(struct super_block *super, enum freeze_holder who);
+int freeze_super(struct super_block *super, enum freeze_holder who,
+		 const void *freeze_owner);
+int thaw_super(struct super_block *super, enum freeze_holder who,
+	       const void *freeze_owner);
 extern __printf(2, 3)
 int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
 extern int super_setup_bdi(struct super_block *sb);

-- 
2.47.2

Re: [PATCH v2 1/4] fs: add owner of freeze/thaw

Posted by Jan Kara 10 months, 1 week ago

On Wed 02-04-25 16:07:31, Christian Brauner wrote:
> For some kernel subsystems it is paramount that they are guaranteed that
> they are the owner of the freeze to avoid any risk of deadlocks. This is
> the case for the power subsystem. Enable it to recognize whether it did
> actually freeze the filesystem.
> 
> If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
> and then fails on the 6th for whatever odd reason (current or future)
> then power needs to undo the freeze of the first 5 filesystems. It can't
> just walk the list again because while it's unlikely that a new
> filesystem got added in the meantime it still cannot tell which
> filesystems the power subsystem actually managed to get a freeze
> reference count on that needs to be dropped during thaw.
> 
> There's various ways out of this ugliness. For example, record the
> filesystems the power subsystem managed to freeze on a temporary list in
> the callbacks and then walk that list backwards during thaw to undo the
> freezing or make sure that the power subsystem just actually exclusively
> freezes things it can freeze and marking such filesystems as being owned
> by power for the duration of the suspend or resume cycle. I opted for
> the latter as that seemed the clean thing to do even if it means more
> code changes.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

I have realized a slight catch with this approach that if hibernation races
with filesystem freezing (e.g. DM reconfiguration), then hibernation need
not freeze a filesystem because it's already frozen but userspace may thaw
the filesystem before hibernation actually happens (relatively harmless).
If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY (rather unexpected). So somehow tracking which
fs was frozen by suspend while properly nesting with other freeze users may
be actually a better approach (maybe just a sb flag even though it's
somewhat hacky?).

								Honza

> ---
>  fs/f2fs/gc.c                |  6 ++--
>  fs/gfs2/super.c             | 20 ++++++------
>  fs/gfs2/sys.c               |  4 +--
>  fs/ioctl.c                  |  8 ++---
>  fs/super.c                  | 76 ++++++++++++++++++++++++++++++++++++---------
>  fs/xfs/scrub/fscounters.c   |  4 +--
>  fs/xfs/xfs_notify_failure.c |  6 ++--
>  include/linux/fs.h          | 13 +++++---
>  8 files changed, 95 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 2b8f9239bede..3e8af62c9e15 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
>  	if (err)
>  		return err;
>  
> -	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> +	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	if (err)
>  		return err;
>  
>  	if (f2fs_readonly(sbi->sb)) {
> -		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> +		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
>  		if (err)
>  			return err;
>  		return -EROFS;
> @@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
>  out_err:
>  	f2fs_up_write(&sbi->cp_global_sem);
>  	f2fs_up_write(&sbi->gc_lock);
> -	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> +	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	return err;
>  }
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 44e5658b896c..519943189109 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
>  	return sdp->sd_log_error;
>  }
>  
> -static int gfs2_do_thaw(struct gfs2_sbd *sdp)
> +static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
>  {
>  	struct super_block *sb = sdp->sd_vfs;
>  	int error;
> @@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
>  	error = gfs2_freeze_lock_shared(sdp);
>  	if (error)
>  		goto fail;
> -	error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +	error = thaw_super(sb, who, freeze_owner);
>  	if (!error)
>  		return 0;
>  
> @@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
>  	if (test_bit(SDF_FROZEN, &sdp->sd_flags))
>  		goto freeze_failed;
>  
> -	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> +	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	if (error)
>  		goto freeze_failed;
>  
>  	gfs2_freeze_unlock(sdp);
>  	set_bit(SDF_FROZEN, &sdp->sd_flags);
>  
> -	error = gfs2_do_thaw(sdp);
> +	error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
>  	if (error)
>  		goto out;
>  
> @@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
>   *
>   */
>  
> -static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
> +static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
> +			     const void *freeze_owner)
>  {
>  	struct gfs2_sbd *sdp = sb->s_fs_info;
>  	int error;
> @@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
>  	}
>  
>  	for (;;) {
> -		error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> +		error = freeze_super(sb, who, freeze_owner);
>  		if (error) {
>  			fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
>  				error);
> @@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
>  			break;
>  		}
>  
> -		error = gfs2_do_thaw(sdp);
> +		error = gfs2_do_thaw(sdp, who, freeze_owner);
>  		if (error)
>  			goto out;
>  
> @@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
>   *
>   */
>  
> -static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
> +static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
> +			   const void *freeze_owner)
>  {
>  	struct gfs2_sbd *sdp = sb->s_fs_info;
>  	int error;
> @@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
>  	atomic_inc(&sb->s_active);
>  	gfs2_freeze_unlock(sdp);
>  
> -	error = gfs2_do_thaw(sdp);
> +	error = gfs2_do_thaw(sdp, who, freeze_owner);
>  
>  	if (!error) {
>  		clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
> diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
> index ecc699f8d9fc..748125653d6c 100644
> --- a/fs/gfs2/sys.c
> +++ b/fs/gfs2/sys.c
> @@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
>  
>  	switch (n) {
>  	case 0:
> -		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
> +		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
>  		break;
>  	case 1:
> -		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
> +		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
>  		break;
>  	default:
>  		return -EINVAL;
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index c91fd2b46a77..bedc83fc2f20 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
>  
>  	/* Freeze */
>  	if (sb->s_op->freeze_super)
> -		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> -	return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> +		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> +	return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  }
>  
>  static int ioctl_fsthaw(struct file *filp)
> @@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
>  
>  	/* Thaw */
>  	if (sb->s_op->thaw_super)
> -		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> -	return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> +	return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  }
>  
>  static int ioctl_file_dedupe_range(struct file *file,
> diff --git a/fs/super.c b/fs/super.c
> index 3c4a496d6438..3ddded4360c6 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -39,7 +39,8 @@
>  #include <uapi/linux/mount.h>
>  #include "internal.h"
>  
> -static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
> +static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
> +			     const void *freeze_owner);
>  
>  static LIST_HEAD(super_blocks);
>  static DEFINE_SPINLOCK(sb_lock);
> @@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
>  	if (IS_ENABLED(CONFIG_BLOCK))
>  		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
>  			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
> -	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
> +	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	return;
>  }
>  
> @@ -1195,9 +1196,9 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->freeze_super)
> -		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  	else
> -		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  
>  	deactivate_super(sb);
>  }
> @@ -1217,9 +1218,9 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->thaw_super)
> -		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  	else
> -		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  
>  	deactivate_super(sb);
>  }
> @@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
>  
>  	if (sb->s_op->freeze_super)
>  		error = sb->s_op->freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	else
>  		error = freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	if (!error)
>  		error = sync_blockdev(bdev);
>  	deactivate_super(sb);
> @@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  
>  	if (sb->s_op->thaw_super)
>  		error = sb->s_op->thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	else
>  		error = thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	deactivate_super(sb);
>  	return error;
>  }
> @@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
>  }
>  
>  #define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
> -#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
> +#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
>  
>  static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
>  {
> @@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>  	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
>  	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>  
> +	if (who & FREEZE_EXCL) {
> +		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> +			return false;
> +
> +		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +			return false;
> +
> +		return (sb->s_writers.freeze_kcount +
> +			sb->s_writers.freeze_ucount) == 0;
> +	}
> +
> +	/* This filesystem is already exclusively frozen. */
> +	if (sb->s_writers.freeze_owner)
> +		return false;
> +
>  	if (who & FREEZE_HOLDER_KERNEL)
>  		return (who & FREEZE_MAY_NEST) ||
>  		       sb->s_writers.freeze_kcount == 0;
> @@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>  	return false;
>  }
>  
> +static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
> +				const void *freeze_owner)
> +{
> +	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> +	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> +
> +	if (who & FREEZE_EXCL) {
> +		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> +			return false;
> +		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> +			return false;
> +		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +			return false;
> +		return sb->s_writers.freeze_owner == freeze_owner;
> +	}
> +
> +	return sb->s_writers.freeze_owner == NULL;
> +}
> +
>  /**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
>   * @who: context that wants to freeze
> + * @freeze_owner: owner of the freeze
>   *
>   * Syncs the super to make sure the filesystem is consistent and calls the fs's
>   * freeze_fs.  Subsequent calls to this without first thawing the fs may return
> @@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>   * Return: If the freeze was successful zero is returned. If the freeze
>   *         failed a negative error code is returned.
>   */
> -int freeze_super(struct super_block *sb, enum freeze_holder who)
> +int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
>  {
>  	int ret;
>  
> @@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	if (sb_rdonly(sb)) {
>  		/* Nothing to do really... */
>  		WARN_ON_ONCE(freeze_inc(sb, who) > 1);
> +		sb->s_writers.freeze_owner = freeze_owner;
>  		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>  		wake_up_var(&sb->s_writers.frozen);
>  		super_unlock_excl(sb);
> @@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
>  	 */
>  	WARN_ON_ONCE(freeze_inc(sb, who) > 1);
> +	sb->s_writers.freeze_owner = freeze_owner;
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>  	wake_up_var(&sb->s_writers.frozen);
>  	lockdep_sb_freeze_release(sb);
> @@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
>   * removes that state without releasing the other state or unlocking the
>   * filesystem.
>   */
> -static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
> +static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
> +			     const void *freeze_owner)
>  {
>  	int error = -EINVAL;
>  
>  	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
>  		goto out_unlock;
>  
> +	if (!may_unfreeze(sb, who, freeze_owner))
> +		goto out_unlock;
> +
>  	/*
>  	 * All freezers share a single active reference.
>  	 * So just unlock in case there are any left.
> @@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  
>  	if (sb_rdonly(sb)) {
>  		sb->s_writers.frozen = SB_UNFROZEN;
> +		sb->s_writers.freeze_owner = NULL;
>  		wake_up_var(&sb->s_writers.frozen);
>  		goto out_deactivate;
>  	}
> @@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  	}
>  
>  	sb->s_writers.frozen = SB_UNFROZEN;
> +	sb->s_writers.freeze_owner = NULL;
>  	wake_up_var(&sb->s_writers.frozen);
>  	sb_freeze_unlock(sb, SB_FREEZE_FS);
>  out_deactivate:
> @@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>   * thaw_super -- unlock filesystem
>   * @sb: the super to thaw
>   * @who: context that wants to freeze
> + * @freeze_owner: owner of the freeze
>   *
>   * Unlocks the filesystem and marks it writeable again after freeze_super()
>   * if there are no remaining freezes on the filesystem.
> @@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>   * have been frozen through the block layer via multiple block devices.
>   * The filesystem remains frozen until all block devices are unfrozen.
>   */
> -int thaw_super(struct super_block *sb, enum freeze_holder who)
> +int thaw_super(struct super_block *sb, enum freeze_holder who,
> +	       const void *freeze_owner)
>  {
>  	if (!super_lock_excl(sb)) {
>  		WARN_ON_ONCE("Dying superblock while thawing!");
>  		return -EINVAL;
>  	}
> -	return thaw_super_locked(sb, who);
> +	return thaw_super_locked(sb, who, freeze_owner);
>  }
>  EXPORT_SYMBOL(thaw_super);
>  
> diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
> index e629663e460a..9b598c5790ad 100644
> --- a/fs/xfs/scrub/fscounters.c
> +++ b/fs/xfs/scrub/fscounters.c
> @@ -123,7 +123,7 @@ xchk_fsfreeze(
>  {
>  	int			error;
>  
> -	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
> +	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
>  	trace_xchk_fsfreeze(sc, error);
>  	return error;
>  }
> @@ -135,7 +135,7 @@ xchk_fsthaw(
>  	int			error;
>  
>  	/* This should always succeed, we have a kernel freeze */
> -	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
> +	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
>  	trace_xchk_fsthaw(sc, error);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index ed8d8ed42f0a..3545dc1d953c 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
>  	struct super_block	*sb = mp->m_super;
>  	int			error;
>  
> -	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
>  	if (error)
>  		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>  
> @@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
>  	int			error;
>  
>  	if (kernel_frozen) {
> -		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
>  		if (error)
>  			xfs_emerg(mp, "still frozen after notify failure, err=%d",
>  				error);
> @@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
>  	 * Also thaw userspace call anyway because the device is about to be
>  	 * removed immediately.
>  	 */
> -	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  }
>  
>  static int
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1aa578412f1b..b379a46b5576 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1307,6 +1307,7 @@ struct sb_writers {
>  	unsigned short			frozen;		/* Is sb frozen? */
>  	int				freeze_kcount;	/* How many kernel freeze requests? */
>  	int				freeze_ucount;	/* How many userspace freeze requests? */
> +	const void			*freeze_owner;	/* Owner of the freeze */
>  	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
>  };
>  
> @@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_EXCL: whether actual freezing must be done by the caller
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> @@ -2283,6 +2285,7 @@ enum freeze_holder {
>  	FREEZE_HOLDER_KERNEL	= (1U << 0),
>  	FREEZE_HOLDER_USERSPACE	= (1U << 1),
>  	FREEZE_MAY_NEST		= (1U << 2),
> +	FREEZE_EXCL		= (1U << 3),
>  };
>  
>  struct super_operations {
> @@ -2296,9 +2299,9 @@ struct super_operations {
>  	void (*evict_inode) (struct inode *);
>  	void (*put_super) (struct super_block *);
>  	int (*sync_fs)(struct super_block *sb, int wait);
> -	int (*freeze_super) (struct super_block *, enum freeze_holder who);
> +	int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
>  	int (*freeze_fs) (struct super_block *);
> -	int (*thaw_super) (struct super_block *, enum freeze_holder who);
> +	int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
>  	int (*remount_fs) (struct super_block *, int *, char *);
> @@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
>  extern int vfs_statfs(const struct path *, struct kstatfs *);
>  extern int user_statfs(const char __user *, struct kstatfs *);
>  extern int fd_statfs(int, struct kstatfs *);
> -int freeze_super(struct super_block *super, enum freeze_holder who);
> -int thaw_super(struct super_block *super, enum freeze_holder who);
> +int freeze_super(struct super_block *super, enum freeze_holder who,
> +		 const void *freeze_owner);
> +int thaw_super(struct super_block *super, enum freeze_holder who,
> +	       const void *freeze_owner);
>  extern __printf(2, 3)
>  int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
>  extern int super_setup_bdi(struct super_block *sb);
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 1/4] fs: add owner of freeze/thaw

Posted by Christian Brauner 10 months, 1 week ago

On Thu, Apr 03, 2025 at 04:56:57PM +0200, Jan Kara wrote:
> On Wed 02-04-25 16:07:31, Christian Brauner wrote:
> > For some kernel subsystems it is paramount that they are guaranteed that
> > they are the owner of the freeze to avoid any risk of deadlocks. This is
> > the case for the power subsystem. Enable it to recognize whether it did
> > actually freeze the filesystem.
> > 
> > If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
> > and then fails on the 6th for whatever odd reason (current or future)
> > then power needs to undo the freeze of the first 5 filesystems. It can't
> > just walk the list again because while it's unlikely that a new
> > filesystem got added in the meantime it still cannot tell which
> > filesystems the power subsystem actually managed to get a freeze
> > reference count on that needs to be dropped during thaw.
> > 
> > There's various ways out of this ugliness. For example, record the
> > filesystems the power subsystem managed to freeze on a temporary list in
> > the callbacks and then walk that list backwards during thaw to undo the
> > freezing or make sure that the power subsystem just actually exclusively
> > freezes things it can freeze and marking such filesystems as being owned
> > by power for the duration of the suspend or resume cycle. I opted for
> > the latter as that seemed the clean thing to do even if it means more
> > code changes.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> I have realized a slight catch with this approach that if hibernation races
> with filesystem freezing (e.g. DM reconfiguration), then hibernation need
> not freeze a filesystem because it's already frozen but userspace may thaw
> the filesystem before hibernation actually happens (relatively harmless).
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY (rather unexpected). So somehow tracking which
> fs was frozen by suspend while properly nesting with other freeze users may
> be actually a better approach (maybe just a sb flag even though it's
> somewhat hacky?).

The approach that I originally had was to add FREEZE_POWER which adds a
simple boolean into the sb_writers instead of a holder and then this
simply nests with the rest. I'll try to post that diff tomorrow.

[PATCH] fs: allow nesting with FREEZE_EXCL

Posted by Christian Brauner 10 months, 1 week ago

If hibernation races with filesystem freezing (e.g. DM reconfiguration),
then hibernation need not freeze a filesystem because it's already
frozen but userspace may thaw the filesystem before hibernation actually
happens.

If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY.

So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
cannot be undone by any of the other concurrent freezers.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 71 ++++++++++++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |  2 +-
 2 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index b4bdbc509dba..e2fee655fbed 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1979,26 +1979,34 @@ static inline int freeze_dec(struct super_block *sb, enum freeze_holder who)
 	return sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount;
 }
 
-static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
+static inline bool may_freeze(struct super_block *sb, enum freeze_holder who,
+			      const void *freeze_owner)
 {
+	lockdep_assert_held(&sb->s_umount);
+
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
 	if (who & FREEZE_EXCL) {
 		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
 			return false;
-
-		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
 			return false;
-
-		return (sb->s_writers.freeze_kcount +
-			sb->s_writers.freeze_ucount) == 0;
+		if (WARN_ON_ONCE(!freeze_owner))
+			return false;
+		/* This freeze already has a specific owner. */
+		if (sb->s_writers.freeze_owner)
+			return false;
+		/*
+		 * This is already frozen multiple times so we're just
+		 * going to take a reference count and mark it as
+		 * belonging to use.
+		 */
+		if (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount)
+			sb->s_writers.freeze_owner = freeze_owner;
+		return true;
 	}
 
-	/* This filesystem is already exclusively frozen. */
-	if (sb->s_writers.freeze_owner)
-		return false;
-
 	if (who & FREEZE_HOLDER_KERNEL)
 		return (who & FREEZE_MAY_NEST) ||
 		       sb->s_writers.freeze_kcount == 0;
@@ -2011,20 +2019,51 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
 				const void *freeze_owner)
 {
+	lockdep_assert_held(&sb->s_umount);
+
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
 	if (who & FREEZE_EXCL) {
-		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
-			return false;
 		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
 			return false;
-		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
+			return false;
+		if (WARN_ON_ONCE(!freeze_owner))
+			return false;
+		if (WARN_ON_ONCE(sb->s_writers.freeze_kcount == 0))
 			return false;
-		return sb->s_writers.freeze_owner == freeze_owner;
+		/* This isn't exclusively frozen. */
+		if (!sb->s_writers.freeze_owner)
+			return false;
+		/* This isn't exclusively frozen by us. */
+		if (sb->s_writers.freeze_owner != freeze_owner)
+			return false;
+		/*
+		 * This is still frozen multiple times so we're just
+		 * going to drop our reference count and undo our
+		 * exclusive freeze.
+		 */
+		if ((sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount) > 1)
+			sb->s_writers.freeze_owner = NULL;
+		return true;
+	}
+
+	if (who & FREEZE_HOLDER_KERNEL) {
+		/*
+		 * Someone's trying to steal the reference belonging to
+		 * @sb->s_writers.freeze_owner.
+		 */
+		if (sb->s_writers.freeze_kcount == 1 &&
+		    sb->s_writers.freeze_owner)
+			return false;
+		return sb->s_writers.freeze_kcount > 0;
 	}
 
-	return sb->s_writers.freeze_owner == NULL;
+	if (who & FREEZE_HOLDER_USERSPACE)
+		return sb->s_writers.freeze_ucount > 0;
+
+	return false;
 }
 
 /**
@@ -2095,7 +2134,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
 
 retry:
 	if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
-		if (may_freeze(sb, who))
+		if (may_freeze(sb, who, freeze_owner))
 			ret = !!WARN_ON_ONCE(freeze_inc(sb, who) == 1);
 		else
 			ret = -EBUSY;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edcba3cd68e..7a3f821d2723 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2270,7 +2270,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
- * @FREEZE_EXCL: whether actual freezing must be done by the caller
+ * @FREEZE_EXCL: a freeze that can only be undone by the owner
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.

---
base-commit: a83fe97e0d53f7d2b0fc62fd9a322a963cb30306
change-id: 20250404-work-freeze-5eacb515f044

Re: [PATCH] fs: allow nesting with FREEZE_EXCL

Posted by Jan Kara 9 months, 1 week ago

On Fri 04-04-25 12:24:09, Christian Brauner wrote:
> If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> then hibernation need not freeze a filesystem because it's already
> frozen but userspace may thaw the filesystem before hibernation actually
> happens.
> 
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY.
> 
> So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> cannot be undone by any of the other concurrent freezers.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

This has fallen through the cracks in my inbox but the patch now looks good
to me. Maybe we should fold it into "fs: add owner of freeze/thaw" to not
have strange intermediate state in the series?

								Honza

> ---
>  fs/super.c         | 71 ++++++++++++++++++++++++++++++++++++++++++------------
>  include/linux/fs.h |  2 +-
>  2 files changed, 56 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b4bdbc509dba..e2fee655fbed 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1979,26 +1979,34 @@ static inline int freeze_dec(struct super_block *sb, enum freeze_holder who)
>  	return sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount;
>  }
>  
> -static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> +static inline bool may_freeze(struct super_block *sb, enum freeze_holder who,
> +			      const void *freeze_owner)
>  {
> +	lockdep_assert_held(&sb->s_umount);
> +
>  	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
>  	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>  
>  	if (who & FREEZE_EXCL) {
>  		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
>  			return false;
> -
> -		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
>  			return false;
> -
> -		return (sb->s_writers.freeze_kcount +
> -			sb->s_writers.freeze_ucount) == 0;
> +		if (WARN_ON_ONCE(!freeze_owner))
> +			return false;
> +		/* This freeze already has a specific owner. */
> +		if (sb->s_writers.freeze_owner)
> +			return false;
> +		/*
> +		 * This is already frozen multiple times so we're just
> +		 * going to take a reference count and mark it as
> +		 * belonging to use.
> +		 */
> +		if (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount)
> +			sb->s_writers.freeze_owner = freeze_owner;
> +		return true;
>  	}
>  
> -	/* This filesystem is already exclusively frozen. */
> -	if (sb->s_writers.freeze_owner)
> -		return false;
> -
>  	if (who & FREEZE_HOLDER_KERNEL)
>  		return (who & FREEZE_MAY_NEST) ||
>  		       sb->s_writers.freeze_kcount == 0;
> @@ -2011,20 +2019,51 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>  static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
>  				const void *freeze_owner)
>  {
> +	lockdep_assert_held(&sb->s_umount);
> +
>  	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
>  	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>  
>  	if (who & FREEZE_EXCL) {
> -		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> -			return false;
>  		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
>  			return false;
> -		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
> +			return false;
> +		if (WARN_ON_ONCE(!freeze_owner))
> +			return false;
> +		if (WARN_ON_ONCE(sb->s_writers.freeze_kcount == 0))
>  			return false;
> -		return sb->s_writers.freeze_owner == freeze_owner;
> +		/* This isn't exclusively frozen. */
> +		if (!sb->s_writers.freeze_owner)
> +			return false;
> +		/* This isn't exclusively frozen by us. */
> +		if (sb->s_writers.freeze_owner != freeze_owner)
> +			return false;
> +		/*
> +		 * This is still frozen multiple times so we're just
> +		 * going to drop our reference count and undo our
> +		 * exclusive freeze.
> +		 */
> +		if ((sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount) > 1)
> +			sb->s_writers.freeze_owner = NULL;
> +		return true;
> +	}
> +
> +	if (who & FREEZE_HOLDER_KERNEL) {
> +		/*
> +		 * Someone's trying to steal the reference belonging to
> +		 * @sb->s_writers.freeze_owner.
> +		 */
> +		if (sb->s_writers.freeze_kcount == 1 &&
> +		    sb->s_writers.freeze_owner)
> +			return false;
> +		return sb->s_writers.freeze_kcount > 0;
>  	}
>  
> -	return sb->s_writers.freeze_owner == NULL;
> +	if (who & FREEZE_HOLDER_USERSPACE)
> +		return sb->s_writers.freeze_ucount > 0;
> +
> +	return false;
>  }
>  
>  /**
> @@ -2095,7 +2134,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
>  
>  retry:
>  	if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
> -		if (may_freeze(sb, who))
> +		if (may_freeze(sb, who, freeze_owner))
>  			ret = !!WARN_ON_ONCE(freeze_inc(sb, who) == 1);
>  		else
>  			ret = -EBUSY;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1edcba3cd68e..7a3f821d2723 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2270,7 +2270,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> - * @FREEZE_EXCL: whether actual freezing must be done by the caller
> + * @FREEZE_EXCL: a freeze that can only be undone by the owner
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> 
> ---
> base-commit: a83fe97e0d53f7d2b0fc62fd9a322a963cb30306
> change-id: 20250404-work-freeze-5eacb515f044
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] fs: allow nesting with FREEZE_EXCL

Posted by Christian Brauner 9 months ago

On Wed, May 07, 2025 at 01:18:34PM +0200, Jan Kara wrote:
> On Fri 04-04-25 12:24:09, Christian Brauner wrote:
> > If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> > then hibernation need not freeze a filesystem because it's already
> > frozen but userspace may thaw the filesystem before hibernation actually
> > happens.
> > 
> > If the race happens the other way around, DM reconfiguration may
> > unexpectedly fail with EBUSY.
> > 
> > So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> > cannot be undone by any of the other concurrent freezers.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> This has fallen through the cracks in my inbox but the patch now looks good
> to me. Maybe we should fold it into "fs: add owner of freeze/thaw" to not
> have strange intermediate state in the series?

Done.

Re: [PATCH] fs: allow nesting with FREEZE_EXCL

Posted by Christoph Hellwig 10 months, 1 week ago

On Fri, Apr 04, 2025 at 12:24:09PM +0200, Christian Brauner wrote:
> If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> then hibernation need not freeze a filesystem because it's already
> frozen but userspace may thaw the filesystem before hibernation actually
> happens.
> 
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY.
> 
> So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> cannot be undone by any of the other concurrent freezers.

What is FREEZE_EXCL?  I can't find it anywhere including in linux-next.

[PATCH v2 2/4] fs: allow all writers to be frozen

Posted by Christian Brauner 10 months, 1 week ago

During freeze/thaw we need to be able to freeze all writers during
suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
file and write to it will not be frozen after we've already frozen the
filesystem.

This has some risk of not being able to freeze processes in case a
process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
filesystem is frozen, a task can block on the frozen filesystem with
e.g., mmap_sem held. If some other task then blocks on grabbing that
mmap_sem, hibernation ill fail because it is unable to hibernate a task
holding mmap_sem. This could be fixed by making a range of filesystem
related locks use freezable sleeping. That's impractical and not
warranted just for suspend/hibernate. Assume that this is an infrequent
problem and we've given userspace a way to skip filesystem freezing
through a sysfs file.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/fs.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..1edcba3cd68e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
-	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-				   level == SB_FREEZE_WRITE);
+	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
 }
 
 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)

-- 
2.47.2

Re: [PATCH v2 2/4] fs: allow all writers to be frozen

Posted by Jan Kara 10 months, 1 week ago

On Wed 02-04-25 16:07:32, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
> 
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

BTW, I agree about the need to silence lockdep somehow.

								Honza

> ---
>  include/linux/fs.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>  
>  static inline void __sb_start_write(struct super_block *sb, int level)
>  {
> -	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> -				   level == SB_FREEZE_WRITE);
> +	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
>  }
>  
>  static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 2/4] fs: allow all writers to be frozen

Posted by Christian Brauner 10 months, 1 week ago

On Wed, Apr 02, 2025 at 04:07:32PM +0200, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
> 
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/linux/fs.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>  
>  static inline void __sb_start_write(struct super_block *sb, int level)
>  {
> -	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> -				   level == SB_FREEZE_WRITE);
> +	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
>  }

Jan, one more thought about freezability here. We know that there will
can be at least one process during hibernation that ends up generating
page faults and that's systemd-journald. When systemd-sleep requests
writing a hibernation image via /sys/power/ files it will inevitably end
up freezing systemd-journald and it may be generating a page fault with
->mmap_lock held. systemd-journald is now sleeping with
SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know this can cause
hibernation to fail. That part is fine. What isn't is that we will very
likely always trigger:

#ifdef CONFIG_LOCKDEP
        /*
         * It's dangerous to freeze with locks held; there be dragons there.
         */
        if (!(state & __TASK_FREEZABLE_UNSAFE))
                WARN_ON_ONCE(debug_locks && p->lockdep_depth);
#endif

with lockdep enabled.

So we really actually need percpu_rswem_read_freezable_unsafe(), i.e.,
TASK_FREEZABLE_UNSAFE.

Re: [PATCH v2 2/4] fs: allow all writers to be frozen

Posted by James Bottomley 10 months, 1 week ago

On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
[...]
> Jan, one more thought about freezability here. We know that there
> will can be at least one process during hibernation that ends up
> generating page faults and that's systemd-journald. When systemd-
> sleep requests writing a hibernation image via /sys/power/ files it
> will inevitably end up freezing systemd-journald and it may be
> generating a page fault with ->mmap_lock held. systemd-journald is
> now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> this can cause hibernation to fail. That part is fine. What isn't is
> that we will very likely always trigger:
> 
> #ifdef CONFIG_LOCKDEP
>         /*
>          * It's dangerous to freeze with locks held; there be dragons
> there.
>          */
>         if (!(state & __TASK_FREEZABLE_UNSAFE))
>                 WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> #endif
> 
> with lockdep enabled.
> 
> So we really actually need percpu_rswem_read_freezable_unsafe(),
> i.e., TASK_FREEZABLE_UNSAFE.

The sched people have pretty strong views about people not doing this,
expressed in the comment in sched.h and commit f5d39b020809
("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
variants got removed with prejudice.

If we do get into this situation the worst that can happen is that
another upper lock acquisition triggers a hibernate failure and we thaw
everything, thus we can never truly deadlock, which is the fear, so
perhaps they might be OK with this.  Note that Rafael's solution to
this was to disable lockdep around hibernate/suspend and resume, which
is another possibility.

Regards,

James

Re: [PATCH v2 2/4] fs: allow all writers to be frozen

Posted by Christian Brauner 10 months, 1 week ago

On Wed, Apr 02, 2025 at 12:03:24PM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
> [...]
> > Jan, one more thought about freezability here. We know that there
> > will can be at least one process during hibernation that ends up
> > generating page faults and that's systemd-journald. When systemd-
> > sleep requests writing a hibernation image via /sys/power/ files it
> > will inevitably end up freezing systemd-journald and it may be
> > generating a page fault with ->mmap_lock held. systemd-journald is
> > now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> > this can cause hibernation to fail. That part is fine. What isn't is
> > that we will very likely always trigger:
> > 
> > #ifdef CONFIG_LOCKDEP
> >         /*
> >          * It's dangerous to freeze with locks held; there be dragons
> > there.
> >          */
> >         if (!(state & __TASK_FREEZABLE_UNSAFE))
> >                 WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > #endif
> > 
> > with lockdep enabled.
> > 
> > So we really actually need percpu_rswem_read_freezable_unsafe(),
> > i.e., TASK_FREEZABLE_UNSAFE.
> 
> The sched people have pretty strong views about people not doing this,
> expressed in the comment in sched.h and commit f5d39b020809
> ("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
> variants got removed with prejudice.
> 
> If we do get into this situation the worst that can happen is that
> another upper lock acquisition triggers a hibernate failure and we thaw
> everything, thus we can never truly deadlock, which is the fear, so

Yes, I know that it's harmless but we need to not generate misleading
lockdep splats when lockdep is turned on and confuse users.

> perhaps they might be OK with this.  Note that Rafael's solution to
> this was to disable lockdep around hibernate/suspend and resume, which
> is another possibility.

It can be done as a follow-up. I'm just saying it needs treatment.

[PATCH v2 3/4] power: freeze filesystems during suspend/resume

Posted by Christian Brauner 10 months, 1 week ago

Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

We add a new sysctl know /sys/power/freeze_filesystems that will allow
userspace to freeze filesystems during suspend/hibernate. For now it
defaults to off. The thaw logic doesn't require checking whether
freezing is enabled because the power subsystem exclusively owns frozen
filesystems for the duration of suspend/hibernate and is able to skip
filesystems it doesn't need to freeze.

Also it is technically possible that filesystem
filesystem_freeze_enabled is true and power freezes the filesystems but
before freezing all processes another process disables
filesystem_freeze_enabled. If power were to place the filesystems_thaw()
call under filesystems_freeze_enabled it would fail to thaw the
fileystems it frozw. The exclusive holder mechanism makes it possible to
iterate through the list without any concern making sure that no
filesystems are left frozen.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c               | 14 ++++++++++----
 kernel/power/hibernate.c | 16 +++++++++++++++-
 kernel/power/main.c      | 31 +++++++++++++++++++++++++++++++
 kernel/power/power.h     |  4 ++++
 kernel/power/suspend.c   |  7 +++++++
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 3ddded4360c6..b4bdbc509dba 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
 	return active;
 }
 
+static const char *filesystems_freeze_ptr = "filesystems_freeze";
+
 static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 {
 	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
@@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->freeze_super)
-		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+				       filesystems_freeze_ptr);
 	else
-		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+			     filesystems_freeze_ptr);
 
 	deactivate_super(sb);
 }
@@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->thaw_super)
-		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+				     filesystems_freeze_ptr);
 	else
-		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+			   filesystems_freeze_ptr);
 
 	deactivate_super(sb);
 }
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 50ec26ea696b..37d733945c59 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -777,6 +777,8 @@ int hibernate(void)
 		goto Restore;
 
 	ksys_sync_helper();
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
 
 	error = freeze_processes();
 	if (error)
@@ -845,6 +847,7 @@ int hibernate(void)
 	/* Don't bother checking whether freezer_test_done is true */
 	freezer_test_done = false;
  Exit:
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
  Restore:
 	pm_restore_console();
@@ -881,6 +884,9 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 	if (error)
 		goto restore;
 
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
+
 	error = freeze_processes();
 	if (error)
 		goto exit;
@@ -940,6 +946,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 	thaw_processes();
 
 exit:
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
 
 restore:
@@ -1028,19 +1035,26 @@ static int software_resume(void)
 	if (error)
 		goto Restore;
 
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
+
 	pm_pr_dbg("Preparing processes for hibernation restore.\n");
 	error = freeze_processes();
-	if (error)
+	if (error) {
+		filesystems_thaw();
 		goto Close_Finish;
+	}
 
 	error = freeze_kernel_threads();
 	if (error) {
 		thaw_processes();
+		filesystems_thaw();
 		goto Close_Finish;
 	}
 
 	error = load_image_and_restore();
 	thaw_processes();
+	filesystems_thaw();
  Finish:
 	pm_notifier_call_chain(PM_POST_RESTORE);
  Restore:
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 6254814d4817..0b0e76324c43 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -962,6 +962,34 @@ power_attr(pm_freeze_timeout);
 
 #endif	/* CONFIG_FREEZER*/
 
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+bool filesystem_freeze_enabled = false;
+
+static ssize_t freeze_filesystems_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", filesystem_freeze_enabled);
+}
+
+static ssize_t freeze_filesystems_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t n)
+{
+	unsigned long val;
+
+	if (kstrtoul(buf, 10, &val))
+		return -EINVAL;
+
+	if (val > 1)
+		return -EINVAL;
+
+	filesystem_freeze_enabled = !!val;
+	return n;
+}
+
+power_attr(freeze_filesystems);
+#endif /* CONFIG_SUSPEND || CONFIG_HIBERNATION */
+
 static struct attribute * g[] = {
 	&state_attr.attr,
 #ifdef CONFIG_PM_TRACE
@@ -991,6 +1019,9 @@ static struct attribute * g[] = {
 #endif
 #ifdef CONFIG_FREEZER
 	&pm_freeze_timeout_attr.attr,
+#endif
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+	&freeze_filesystems_attr.attr,
 #endif
 	NULL,
 };
diff --git a/kernel/power/power.h b/kernel/power/power.h
index c352dea2f67b..2eb81662b8fa 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -18,6 +18,10 @@ struct swsusp_info {
 	unsigned long		size;
 } __aligned(PAGE_SIZE);
 
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+extern bool filesystem_freeze_enabled;
+#endif
+
 #ifdef CONFIG_HIBERNATION
 /* kernel/power/snapshot.c */
 extern void __init hibernate_reserved_size_init(void);
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 8eaec4ab121d..76b141b9aac0 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -30,6 +30,7 @@
 #include <trace/events/power.h>
 #include <linux/compiler.h>
 #include <linux/moduleparam.h>
+#include <linux/fs.h>
 
 #include "power.h"
 
@@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
 	if (error)
 		goto Restore;
 
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
 	trace_suspend_resume(TPS("freeze_processes"), 0, true);
 	error = suspend_freeze_processes();
 	trace_suspend_resume(TPS("freeze_processes"), 0, false);
@@ -550,6 +553,7 @@ int suspend_devices_and_enter(suspend_state_t state)
 static void suspend_finish(void)
 {
 	suspend_thaw_processes();
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_SUSPEND);
 	pm_restore_console();
 }
@@ -588,6 +592,8 @@ static int enter_state(suspend_state_t state)
 		ksys_sync_helper();
 		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
 	}
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
 
 	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
 	pm_suspend_clear_flags();
@@ -609,6 +615,7 @@ static int enter_state(suspend_state_t state)
 	pm_pr_dbg("Finishing wakeup.\n");
 	suspend_finish();
  Unlock:
+	filesystems_thaw();
 	mutex_unlock(&system_transition_mutex);
 	return error;
 }

-- 
2.47.2

Re: [PATCH v2 3/4] power: freeze filesystems during suspend/resume

Posted by Jan Kara 10 months, 1 week ago

On Wed 02-04-25 16:07:33, Christian Brauner wrote:
> Now all the pieces are in place to actually allow the power subsystem
> to freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.
> 
> We could bubble up errors and fail suspend/resume if the error isn't
> EBUSY (aka it's already frozen) but I don't think that this is worth it.
> Filesystem freezing during suspend/resume is best-effort. If the user
> has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
> reason then we simply skip them.
> 
> What we have now is already a big improvement and let's see how we fare
> with it before making our lives even harder (and uglier) than we have
> to.
> 
> We add a new sysctl know /sys/power/freeze_filesystems that will allow
> userspace to freeze filesystems during suspend/hibernate. For now it
> defaults to off. The thaw logic doesn't require checking whether
> freezing is enabled because the power subsystem exclusively owns frozen
> filesystems for the duration of suspend/hibernate and is able to skip
> filesystems it doesn't need to freeze.
> 
> Also it is technically possible that filesystem
> filesystem_freeze_enabled is true and power freezes the filesystems but
> before freezing all processes another process disables
> filesystem_freeze_enabled. If power were to place the filesystems_thaw()
> call under filesystems_freeze_enabled it would fail to thaw the
> fileystems it frozw. The exclusive holder mechanism makes it possible to
> iterate through the list without any concern making sure that no
> filesystems are left frozen.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good modulo the nesting issue I've mentioned in my comments to patch
1. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c               | 14 ++++++++++----
>  kernel/power/hibernate.c | 16 +++++++++++++++-
>  kernel/power/main.c      | 31 +++++++++++++++++++++++++++++++
>  kernel/power/power.h     |  4 ++++
>  kernel/power/suspend.c   |  7 +++++++
>  5 files changed, 67 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 3ddded4360c6..b4bdbc509dba 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
>  	return active;
>  }
>  
> +static const char *filesystems_freeze_ptr = "filesystems_freeze";
> +
>  static void filesystems_freeze_callback(struct super_block *sb, void *unused)
>  {
>  	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
> @@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->freeze_super)
> -		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +				       filesystems_freeze_ptr);
>  	else
> -		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +			     filesystems_freeze_ptr);
>  
>  	deactivate_super(sb);
>  }
> @@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->thaw_super)
> -		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +				     filesystems_freeze_ptr);
>  	else
> -		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +			   filesystems_freeze_ptr);
>  
>  	deactivate_super(sb);
>  }
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 50ec26ea696b..37d733945c59 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -777,6 +777,8 @@ int hibernate(void)
>  		goto Restore;
>  
>  	ksys_sync_helper();
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
>  
>  	error = freeze_processes();
>  	if (error)
> @@ -845,6 +847,7 @@ int hibernate(void)
>  	/* Don't bother checking whether freezer_test_done is true */
>  	freezer_test_done = false;
>   Exit:
> +	filesystems_thaw();
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>   Restore:
>  	pm_restore_console();
> @@ -881,6 +884,9 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  	if (error)
>  		goto restore;
>  
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
> +
>  	error = freeze_processes();
>  	if (error)
>  		goto exit;
> @@ -940,6 +946,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  	thaw_processes();
>  
>  exit:
> +	filesystems_thaw();
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>  
>  restore:
> @@ -1028,19 +1035,26 @@ static int software_resume(void)
>  	if (error)
>  		goto Restore;
>  
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
> +
>  	pm_pr_dbg("Preparing processes for hibernation restore.\n");
>  	error = freeze_processes();
> -	if (error)
> +	if (error) {
> +		filesystems_thaw();
>  		goto Close_Finish;
> +	}
>  
>  	error = freeze_kernel_threads();
>  	if (error) {
>  		thaw_processes();
> +		filesystems_thaw();
>  		goto Close_Finish;
>  	}
>  
>  	error = load_image_and_restore();
>  	thaw_processes();
> +	filesystems_thaw();
>   Finish:
>  	pm_notifier_call_chain(PM_POST_RESTORE);
>   Restore:
> diff --git a/kernel/power/main.c b/kernel/power/main.c
> index 6254814d4817..0b0e76324c43 100644
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -962,6 +962,34 @@ power_attr(pm_freeze_timeout);
>  
>  #endif	/* CONFIG_FREEZER*/
>  
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +bool filesystem_freeze_enabled = false;
> +
> +static ssize_t freeze_filesystems_show(struct kobject *kobj,
> +				       struct kobj_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", filesystem_freeze_enabled);
> +}
> +
> +static ssize_t freeze_filesystems_store(struct kobject *kobj,
> +					struct kobj_attribute *attr,
> +					const char *buf, size_t n)
> +{
> +	unsigned long val;
> +
> +	if (kstrtoul(buf, 10, &val))
> +		return -EINVAL;
> +
> +	if (val > 1)
> +		return -EINVAL;
> +
> +	filesystem_freeze_enabled = !!val;
> +	return n;
> +}
> +
> +power_attr(freeze_filesystems);
> +#endif /* CONFIG_SUSPEND || CONFIG_HIBERNATION */
> +
>  static struct attribute * g[] = {
>  	&state_attr.attr,
>  #ifdef CONFIG_PM_TRACE
> @@ -991,6 +1019,9 @@ static struct attribute * g[] = {
>  #endif
>  #ifdef CONFIG_FREEZER
>  	&pm_freeze_timeout_attr.attr,
> +#endif
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +	&freeze_filesystems_attr.attr,
>  #endif
>  	NULL,
>  };
> diff --git a/kernel/power/power.h b/kernel/power/power.h
> index c352dea2f67b..2eb81662b8fa 100644
> --- a/kernel/power/power.h
> +++ b/kernel/power/power.h
> @@ -18,6 +18,10 @@ struct swsusp_info {
>  	unsigned long		size;
>  } __aligned(PAGE_SIZE);
>  
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +extern bool filesystem_freeze_enabled;
> +#endif
> +
>  #ifdef CONFIG_HIBERNATION
>  /* kernel/power/snapshot.c */
>  extern void __init hibernate_reserved_size_init(void);
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 8eaec4ab121d..76b141b9aac0 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -30,6 +30,7 @@
>  #include <trace/events/power.h>
>  #include <linux/compiler.h>
>  #include <linux/moduleparam.h>
> +#include <linux/fs.h>
>  
>  #include "power.h"
>  
> @@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
>  	if (error)
>  		goto Restore;
>  
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
>  	trace_suspend_resume(TPS("freeze_processes"), 0, true);
>  	error = suspend_freeze_processes();
>  	trace_suspend_resume(TPS("freeze_processes"), 0, false);
> @@ -550,6 +553,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  static void suspend_finish(void)
>  {
>  	suspend_thaw_processes();
> +	filesystems_thaw();
>  	pm_notifier_call_chain(PM_POST_SUSPEND);
>  	pm_restore_console();
>  }
> @@ -588,6 +592,8 @@ static int enter_state(suspend_state_t state)
>  		ksys_sync_helper();
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
>  	}
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
>  
>  	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
>  	pm_suspend_clear_flags();
> @@ -609,6 +615,7 @@ static int enter_state(suspend_state_t state)
>  	pm_pr_dbg("Finishing wakeup.\n");
>  	suspend_finish();
>   Unlock:
> +	filesystems_thaw();
>  	mutex_unlock(&system_transition_mutex);
>  	return error;
>  }
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw

Posted by Christian Brauner 10 months, 1 week ago

Sysfs is built on top of kernfs and sysfs provides the power management
infrastructure to support suspend/hibernate by writing to various files
in /sys/power/. As filesystems may be automatically frozen during
suspend/hibernate implementing freeze/thaw support for kernfs
generically will cause deadlocks as the suspending/hibernation
initiating task will hold a VFS lock that it will then wait upon to be
released. If freeze/thaw for kernfs is needed talk to the VFS.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/kernfs/mount.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1..d2073bb2b633 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,21 @@ const struct super_operations kernfs_sops = {
 
 	.show_options	= kernfs_sop_show_options,
 	.show_path	= kernfs_sop_show_path,
+
+	/*
+	 * sysfs is built on top of kernfs and sysfs provides the power
+	 * management infrastructure to support suspend/hibernate by
+	 * writing to various files in /sys/power/. As filesystems may
+	 * be automatically frozen during suspend/hibernate implementing
+	 * freeze/thaw support for kernfs generically will cause
+	 * deadlocks as the suspending/hibernation initiating task will
+	 * hold a VFS lock that it will then wait upon to be released.
+	 * If freeze/thaw for kernfs is needed talk to the VFS.
+	 */
+	.freeze_fs	= NULL,
+	.unfreeze_fs	= NULL,
+	.freeze_super	= NULL,
+	.thaw_super	= NULL,
 };
 
 static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,

-- 
2.47.2

Re: [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw

Posted by Jan Kara 10 months, 1 week ago

On Wed 02-04-25 16:07:34, Christian Brauner wrote:
> Sysfs is built on top of kernfs and sysfs provides the power management
> infrastructure to support suspend/hibernate by writing to various files
> in /sys/power/. As filesystems may be automatically frozen during
> suspend/hibernate implementing freeze/thaw support for kernfs
> generically will cause deadlocks as the suspending/hibernation
> initiating task will hold a VFS lock that it will then wait upon to be
> released. If freeze/thaw for kernfs is needed talk to the VFS.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Yeah, good idea. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/kernfs/mount.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index 1358c21837f1..d2073bb2b633 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,21 @@ const struct super_operations kernfs_sops = {
>  
>  	.show_options	= kernfs_sop_show_options,
>  	.show_path	= kernfs_sop_show_path,
> +
> +	/*
> +	 * sysfs is built on top of kernfs and sysfs provides the power
> +	 * management infrastructure to support suspend/hibernate by
> +	 * writing to various files in /sys/power/. As filesystems may
> +	 * be automatically frozen during suspend/hibernate implementing
> +	 * freeze/thaw support for kernfs generically will cause
> +	 * deadlocks as the suspending/hibernation initiating task will
> +	 * hold a VFS lock that it will then wait upon to be released.
> +	 * If freeze/thaw for kernfs is needed talk to the VFS.
> +	 */
> +	.freeze_fs	= NULL,
> +	.unfreeze_fs	= NULL,
> +	.freeze_super	= NULL,
> +	.thaw_super	= NULL,
>  };
>  
>  static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume

Posted by Christian Brauner 10 months, 1 week ago

On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation.

I would like to place this behind a Kconfig option and add a
/sys/power/freeze_on_suspend option as these changes are pretty
sensitive and to give userspace the ability to experiment with this for
a while until we remove it. That means we should skip the removal of all
the freezer changes in the filesystems until we're happy enough that
this works reliable enough.