fs/btrfs/disk-io.c | 4 +-- fs/btrfs/scrub.c | 2 +- fs/ext4/mballoc.c | 2 +- fs/ext4/super.c | 3 -- fs/f2fs/gc.c | 6 ++-- fs/gfs2/super.c | 20 ++++++----- fs/gfs2/sys.c | 4 +-- fs/ioctl.c | 8 ++--- fs/super.c | 82 ++++++++++++++++++++++++++++++++++++--------- fs/xfs/scrub/fscounters.c | 4 +-- fs/xfs/xfs_discard.c | 2 +- fs/xfs/xfs_log.c | 3 +- fs/xfs/xfs_log_cil.c | 2 +- fs/xfs/xfs_mru_cache.c | 2 +- fs/xfs/xfs_notify_failure.c | 6 ++-- fs/xfs/xfs_pwork.c | 2 +- fs/xfs/xfs_super.c | 14 ++++---- fs/xfs/xfs_trans_ail.c | 3 -- fs/xfs/xfs_zone_gc.c | 2 -- include/linux/fs.h | 16 ++++++--- kernel/power/hibernate.c | 13 ++++++- kernel/power/suspend.c | 8 +++++ 22 files changed, 139 insertions(+), 69 deletions(-)
The whole shebang can also be found at:
https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
I know nothing about power or hibernation. I've tested it as best as I
could. Works for me (TM).
I need to catch some actual sleep now...
---
Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.
If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (3):
fs: add owner of freeze/thaw
fs: allow pagefault based writers to be frozen
power: freeze filesystems during suspend/resume
Luis Chamberlain (3):
ext4: replace kthread freezing with auto fs freezing
btrfs: replace kthread freezing with auto fs freezing
xfs: replace kthread freezing with auto fs freezing
fs/btrfs/disk-io.c | 4 +--
fs/btrfs/scrub.c | 2 +-
fs/ext4/mballoc.c | 2 +-
fs/ext4/super.c | 3 --
fs/f2fs/gc.c | 6 ++--
fs/gfs2/super.c | 20 ++++++-----
fs/gfs2/sys.c | 4 +--
fs/ioctl.c | 8 ++---
fs/super.c | 82 ++++++++++++++++++++++++++++++++++++---------
fs/xfs/scrub/fscounters.c | 4 +--
fs/xfs/xfs_discard.c | 2 +-
fs/xfs/xfs_log.c | 3 +-
fs/xfs/xfs_log_cil.c | 2 +-
fs/xfs/xfs_mru_cache.c | 2 +-
fs/xfs/xfs_notify_failure.c | 6 ++--
fs/xfs/xfs_pwork.c | 2 +-
fs/xfs/xfs_super.c | 14 ++++----
fs/xfs/xfs_trans_ail.c | 3 --
fs/xfs/xfs_zone_gc.c | 2 --
include/linux/fs.h | 16 ++++++---
kernel/power/hibernate.c | 13 ++++++-
kernel/power/suspend.c | 8 +++++
22 files changed, 139 insertions(+), 69 deletions(-)
---
base-commit: a68c99192db8060f383a2680333866c0be688ece
change-id: 20250401-work-freeze-693b5b5a78e0
On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
>
> I know nothing about power or hibernation. I've tested it as best as
> I could. Works for me (TM).
I'm testing the latest you have in work.freeze and it doesn't currently
work for me. Patch 7b315c39b67d ("power: freeze filesystems during
suspend/resume") doesn't set filesystems_freeze_ptr so it ends up being
NULL and tripping over this check
+static inline bool may_unfreeze(struct super_block *sb, enum
freeze_holder who,
+ const void *freeze_owner)
+{
+ WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+ WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+ if (who & FREEZE_EXCL) {
+ if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+ return false;
in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to resume
from hibernate. Setting it to __builtin_return_address(0) in
filesystems_freeze() makes everything work as expected, so that's what
I'm testing now.
I suppose one minor, minor nit is that the vagaries of English grammar
mean that the verbs fail and succeed don't take the same grammatical
construction, so failed can take the infinitive (failed to thaw)
perfectly well, but succeeded takes a prepositional gerund construction
instead: "succeeded at/in thawing" instead of the infinitive "succeeded
to thaw" ... I've no idea why, but I'd probably blame the Victorians
...
Regards,
James
On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> >
> > I know nothing about power or hibernation. I've tested it as best as
> > I could. Works for me (TM).
>
> I'm testing the latest you have in work.freeze and it doesn't currently
> work for me. Patch 7b315c39b67d ("power: freeze filesystems during
> suspend/resume") doesn't set filesystems_freeze_ptr so it ends up being
> NULL and tripping over this check
I haven't pushed the new version there. Sorry about that. I only have it
locally.
>
> +static inline bool may_unfreeze(struct super_block *sb, enum
> freeze_holder who,
> + const void *freeze_owner)
> +{
> + WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> + WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> +
> + if (who & FREEZE_EXCL) {
> + if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> + return false;
>
>
> in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to resume
> from hibernate. Setting it to __builtin_return_address(0) in
> filesystems_freeze() makes everything work as expected, so that's what
> I'm testing now.
+1
I'll send the final version out in a bit.
On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > >
> > > I know nothing about power or hibernation. I've tested it as best
> > > as I could. Works for me (TM).
> >
> > I'm testing the latest you have in work.freeze and it doesn't
> > currently work for me. Patch 7b315c39b67d ("power: freeze
> > filesystems during suspend/resume") doesn't set
> > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > this check
>
> I haven't pushed the new version there. Sorry about that. I only have
> it locally.
>
> >
> > +static inline bool may_unfreeze(struct super_block *sb, enum
> > freeze_holder who,
> > + const void *freeze_owner)
> > +{
> > + WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > + WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > +
> > + if (who & FREEZE_EXCL) {
> > + if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > NULL))
> > + return false;
> >
> >
> > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > resume from hibernate. Setting it to __builtin_return_address(0)
> > in filesystems_freeze() makes everything work as expected, so
> > that's what I'm testing now.
>
> +1
>
> I'll send the final version out in a bit.
I've now done some extensive testing on loop nested filesystems with
fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
Hibernate/Resume has currently worked on these without a hitch (and the
fio load burps a bit but then starts running at full speed within a few
seconds). What I'm doing is a single round of hibernate/resume followed
by a reboot. I'm relying on the fschecks to detect any filesystem
corruption. I've also tried doing a couple of fresh starts of the
hibernated image to check that we did correctly freeze the filesystems.
The problems I've noticed are:
1. I'm using 9p to push host directories throught and that
completely hangs after a resume. This is expected because the
virtio server is out of sync, but it does indicate a need to
address Jeff's question of what we should be doing for network
filesystems (and is also the reason I have to reboot after
resuming).
2. Top doesn't show any CPU activity after resume even though fio is
definitely running. This seems to be a suspend issue and
unrelated to filesystems, but I'll continue investigating.
Regards,
James
On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > > The whole shebang can also be found at:
> > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > >
> > > > I know nothing about power or hibernation. I've tested it as best
> > > > as I could. Works for me (TM).
> > >
> > > I'm testing the latest you have in work.freeze and it doesn't
> > > currently work for me. Patch 7b315c39b67d ("power: freeze
> > > filesystems during suspend/resume") doesn't set
> > > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > > this check
> >
> > I haven't pushed the new version there. Sorry about that. I only have
> > it locally.
> >
> > >
> > > +static inline bool may_unfreeze(struct super_block *sb, enum
> > > freeze_holder who,
> > > + const void *freeze_owner)
> > > +{
> > > + WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > > + WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > > +
> > > + if (who & FREEZE_EXCL) {
> > > + if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > > NULL))
> > > + return false;
> > >
> > >
> > > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > > resume from hibernate. Setting it to __builtin_return_address(0)
> > > in filesystems_freeze() makes everything work as expected, so
> > > that's what I'm testing now.
> >
> > +1
> >
> > I'll send the final version out in a bit.
>
> I've now done some extensive testing on loop nested filesystems with
> fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
> Hibernate/Resume has currently worked on these without a hitch (and the
> fio load burps a bit but then starts running at full speed within a few
> seconds). What I'm doing is a single round of hibernate/resume followed
> by a reboot. I'm relying on the fschecks to detect any filesystem
> corruption. I've also tried doing a couple of fresh starts of the
> hibernated image to check that we did correctly freeze the filesystems.
>
> The problems I've noticed are:
>
> 1. I'm using 9p to push host directories throught and that
> completely hangs after a resume. This is expected because the
> virtio server is out of sync, but it does indicate a need to
> address Jeff's question of what we should be doing for network
> filesystems (and is also the reason I have to reboot after
> resuming).
No network filesystem supports freeze/thaw so they're not frozen/thawed
during hibernation.
virtiofs doesn't support filesystem freezing either and it warns about
virtio_driver based freezing not being implemented.
On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > > The whole shebang can also be found at:
> > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > >
> > > > I know nothing about power or hibernation. I've tested it as best
> > > > as I could. Works for me (TM).
> > >
> > > I'm testing the latest you have in work.freeze and it doesn't
> > > currently work for me. Patch 7b315c39b67d ("power: freeze
> > > filesystems during suspend/resume") doesn't set
> > > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > > this check
> >
> > I haven't pushed the new version there. Sorry about that. I only have
> > it locally.
> >
> > >
> > > +static inline bool may_unfreeze(struct super_block *sb, enum
> > > freeze_holder who,
> > > + const void *freeze_owner)
> > > +{
> > > + WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > > + WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > > +
> > > + if (who & FREEZE_EXCL) {
> > > + if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > > NULL))
> > > + return false;
> > >
> > >
> > > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > > resume from hibernate. Setting it to __builtin_return_address(0)
> > > in filesystems_freeze() makes everything work as expected, so
> > > that's what I'm testing now.
> >
> > +1
> >
> > I'll send the final version out in a bit.
>
> I've now done some extensive testing on loop nested filesystems with
> fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
> Hibernate/Resume has currently worked on these without a hitch (and the
> fio load burps a bit but then starts running at full speed within a few
> seconds). What I'm doing is a single round of hibernate/resume followed
> by a reboot. I'm relying on the fschecks to detect any filesystem
> corruption. I've also tried doing a couple of fresh starts of the
> hibernated image to check that we did correctly freeze the filesystems.
>
> The problems I've noticed are:
>
> 1. I'm using 9p to push host directories throught and that
> completely hangs after a resume. This is expected because the
> virtio server is out of sync, but it does indicate a need to
> address Jeff's question of what we should be doing for network
> filesystems (and is also the reason I have to reboot after
> resuming).
> 2. Top doesn't show any CPU activity after resume even though fio is
> definitely running. This seems to be a suspend issue and
> unrelated to filesystems, but I'll continue investigating.
To be clear, on the fio run -- are you running fio *while*
suspend/resume cycle on XFS? That used to stall / break suspend
resume. We may want to test dd against a drive too, that will use
the block device cache, and I forget if we have a freeze/thaw for it.
Luis
On Tue, 2025-04-08 at 10:09 -0700, Luis Chamberlain wrote: > On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote: [...] > > I've now done some extensive testing on loop nested filesystems > > with fio load on the upper. I've tried xfs on ext4 and ext4 on > > ext4. Hibernate/Resume has currently worked on these without a > > hitch (and the fio load burps a bit but then starts running at full > > speed within a few seconds). What I'm doing is a single round of > > hibernate/resume followed by a reboot. I'm relying on the fschecks > > to detect any filesystem corruption. I've also tried doing a couple > > of fresh starts of the hibernated image to check that we did > > correctly freeze the filesystems. > > > > The problems I've noticed are: > > > > 1. I'm using 9p to push host directories throught and that > > completely hangs after a resume. This is expected because the > > virtio server is out of sync, but it does indicate a need to > > address Jeff's question of what we should be doing for > > network > > filesystems (and is also the reason I have to reboot after > > resuming). > > 2. Top doesn't show any CPU activity after resume even though > > fio is > > definitely running. This seems to be a suspend issue and > > unrelated to filesystems, but I'll continue investigating. > > To be clear, on the fio run -- are you running fio *while* > suspend/resume cycle on XFS? Yes, that's why I said "the fio load burps a bit" (as in after resume) "but then starts running full speed after a few seconds". > That used to stall / break suspend resume. We may want to test dd > against a drive too, that will use the block device cache, and I > forget if we have a freeze/thaw for it. fio is running a read/write test, but I think all my caches are write through for safety (although I have verified that the device cache flush is sent as the last sequence of hibernate). Regards, James
And in case its useful, to test this on a VM you'll need on libvirt:
</os>
<pm>
<suspend-to-mem enabled='yes'/>
<suspend-to-disk enabled='yes'/>
</pm>
The litmus test which used to stall without ever returning was:
while true; do
dd if=/dev/zero of=/path/to/xfs/foo bs=1M count=1024 &> /dev/null
done
To suspend you can use one of these two:
echo mem > /sys/power/state
systemctl suspend
Verify you can resume from suspend:
virsh qemu-monitor-command guest_foo 'query-current-machine'
{"return":{"wakeup-suspend-support":true},"id":"libvirt-415"}
To resume the guest:
virsh dompmwakeup guest_foo
Luis
On Tue, 2025-04-08 at 10:20 -0700, Luis Chamberlain wrote: > And in case its useful, to test this on a VM you'll need on libvirt: Just so we're clear, I'm only doing hibernate, not suspend tests. I figure they should be a bit harsher, but you never know. The reason is I set up my test rig for efivarfs, which only has a testable problem on hibernate. Regards, James
On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote: > The whole shebang can also be found at: > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > I know nothing about power or hibernation. I've tested it as best as I > could. Works for me (TM). > > I need to catch some actual sleep now... > > --- > > Now all the pieces are in place to actually allow the power subsystem to > freeze/thaw filesystems during suspend/resume. Filesystems are only > frozen and thawed if the power subsystem does actually own the freeze. Urgh, I was relying on all kthreads to be freezable for live-patching: https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net So I understand the problem with freezing filesystems, but can't we leave the TASK_FREEZABLE in the kthreads? The way I understand it, the power subsystem will first freeze the filesystems before it goes freeze threads anyway. So them remaining freezable should not affect anything, right?
On Tue, Apr 01, 2025 at 04:14:07PM +0200, Peter Zijlstra wrote: > On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote: > > The whole shebang can also be found at: > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > > > I know nothing about power or hibernation. I've tested it as best as I > > could. Works for me (TM). > > > > I need to catch some actual sleep now... > > > > --- > > > > Now all the pieces are in place to actually allow the power subsystem to > > freeze/thaw filesystems during suspend/resume. Filesystems are only > > frozen and thawed if the power subsystem does actually own the freeze. > > Urgh, I was relying on all kthreads to be freezable for live-patching: > > https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net > > So I understand the problem with freezing filesystems, but can't we > leave the TASK_FREEZABLE in the kthreads? The way I understand it, the Yeah, we can. > power subsystem will first freeze the filesystems before it goes freeze > threads anyway. So them remaining freezable should not affect anything, > right? Yes. I've dropped the other patches. I've discussed this later downthread with Jan. >
On Tue, Apr 01, 2025 at 04:40:33PM +0200, Christian Brauner wrote: > On Tue, Apr 01, 2025 at 04:14:07PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote: > > > The whole shebang can also be found at: > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > > > > > I know nothing about power or hibernation. I've tested it as best as I > > > could. Works for me (TM). > > > > > > I need to catch some actual sleep now... > > > > > > --- > > > > > > Now all the pieces are in place to actually allow the power subsystem to > > > freeze/thaw filesystems during suspend/resume. Filesystems are only > > > frozen and thawed if the power subsystem does actually own the freeze. > > > > Urgh, I was relying on all kthreads to be freezable for live-patching: > > > > https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net > > > > So I understand the problem with freezing filesystems, but can't we > > leave the TASK_FREEZABLE in the kthreads? The way I understand it, the > > Yeah, we can. > > > power subsystem will first freeze the filesystems before it goes freeze > > threads anyway. So them remaining freezable should not affect anything, > > right? > > Yes. I've dropped the other patches. I've discussed this later > downthread with Jan. Thanks!
On Tue 01-04-25 02:32:45, Christian Brauner wrote: > The whole shebang can also be found at: > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > I know nothing about power or hibernation. I've tested it as best as I > could. Works for me (TM). > > I need to catch some actual sleep now... > > --- > > Now all the pieces are in place to actually allow the power subsystem to > freeze/thaw filesystems during suspend/resume. Filesystems are only > frozen and thawed if the power subsystem does actually own the freeze. > > Othwerwise it risks thawing filesystems it didn't own. This could be > done differently be e.g., keeping the filesystems that were actually > frozen on a list and then unfreezing them from that list. This is > disgustingly unclean though and reeks of an ugly hack. > > If the filesystem is already frozen by the time we've frozen all > userspace processes we don't care to freeze it again. That's userspace's > job once the process resumes. We only actually freeze filesystems if we > absolutely have to and we ignore other failures to freeze. Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I remember we have specifically designed nesting of freeze counters so that this way power subsystem can be sure freezing succeeds even if the filesystem is already frozen (by userspace or the kernel) and similarly power subsystem cannot thaw a filesystem frozen by somebody else. It will just drop its freeze refcount... What am I missing? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Tue, Apr 01, 2025 at 11:32:49AM +0200, Jan Kara wrote: > On Tue 01-04-25 02:32:45, Christian Brauner wrote: > > The whole shebang can also be found at: > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > > > I know nothing about power or hibernation. I've tested it as best as I > > could. Works for me (TM). > > > > I need to catch some actual sleep now... > > > > --- > > > > Now all the pieces are in place to actually allow the power subsystem to > > freeze/thaw filesystems during suspend/resume. Filesystems are only > > frozen and thawed if the power subsystem does actually own the freeze. > > > > Othwerwise it risks thawing filesystems it didn't own. This could be > > done differently be e.g., keeping the filesystems that were actually > > frozen on a list and then unfreezing them from that list. This is > > disgustingly unclean though and reeks of an ugly hack. > > > > If the filesystem is already frozen by the time we've frozen all > > userspace processes we don't care to freeze it again. That's userspace's > > job once the process resumes. We only actually freeze filesystems if we > > absolutely have to and we ignore other failures to freeze. > > Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST | > FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I > remember we have specifically designed nesting of freeze counters so that > this way power subsystem can be sure freezing succeeds even if the > filesystem is already frozen (by userspace or the kernel) and similarly > power subsystem cannot thaw a filesystem frozen by somebody else. It will > just drop its freeze refcount... What am I missing? If we have 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. We can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime we still cannot tell what filesystems the power subsystem actually managed to get a freeze reference count on that we need to drop during thaw. There's various ways out of this ugliness. Either we record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or we make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. What are your thoughts on this?
On Tue 01-04-25 15:03:33, Christian Brauner wrote: > On Tue, Apr 01, 2025 at 11:32:49AM +0200, Jan Kara wrote: > > On Tue 01-04-25 02:32:45, Christian Brauner wrote: > > > The whole shebang can also be found at: > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > > > > > I know nothing about power or hibernation. I've tested it as best as I > > > could. Works for me (TM). > > > > > > I need to catch some actual sleep now... > > > > > > --- > > > > > > Now all the pieces are in place to actually allow the power subsystem to > > > freeze/thaw filesystems during suspend/resume. Filesystems are only > > > frozen and thawed if the power subsystem does actually own the freeze. > > > > > > Othwerwise it risks thawing filesystems it didn't own. This could be > > > done differently be e.g., keeping the filesystems that were actually > > > frozen on a list and then unfreezing them from that list. This is > > > disgustingly unclean though and reeks of an ugly hack. > > > > > > If the filesystem is already frozen by the time we've frozen all > > > userspace processes we don't care to freeze it again. That's userspace's > > > job once the process resumes. We only actually freeze filesystems if we > > > absolutely have to and we ignore other failures to freeze. > > > > Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST | > > FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I > > remember we have specifically designed nesting of freeze counters so that > > this way power subsystem can be sure freezing succeeds even if the > > filesystem is already frozen (by userspace or the kernel) and similarly > > power subsystem cannot thaw a filesystem frozen by somebody else. It will > > just drop its freeze refcount... What am I missing? > > If we have 10 filesystems and suspend/hibernate manges to freeze 5 and > then fails on the 6th for whatever odd reason (current or future) then > power needs to undo the freeze of the first 5 filesystems. We can't just > walk the list again because while it's unlikely that a new filesystem > got added in the meantime we still cannot tell what filesystems the > power subsystem actually managed to get a freeze reference count on that > we need to drop during thaw. > > There's various ways out of this ugliness. Either we record the > filesystems the power subsystem managed to freeze on a temporary list in > the callbacks and then walk that list backwards during thaw to undo the > freezing or we make sure that the power subsystem just actually > exclusively freezes things it can freeze and marking such filesystems as > being owned by power for the duration of the suspend or resume cycle. I > opted for the latter as that seemed the clean thing to do even if it > means more code changes. What are your thoughts on this? Ah, I see. Thanks for explanation. So failure to freeze filesystem should be rare (mostly only due to IO errors or similar serious issues) hence I'd consider failing hibernation in case we fail to freeze some filesystem appropriate. The function that's walking all superblocks and freezing them could just walk from the superblock where freezing failed towards the end and thaw all filesystems. That way the function also has the nice property that it either freezes everything or keeps things as they were. But you've touched on an interesting case I didn't consider: New superblocks can be added to the end of the list while we are walking it. These superblocks will not be frozen and on resume (or error recovery) this will confuse things. Your "freeze owner" stuff deals with this problem nicely. Somewhat lighter fix for this may be to provide the superblock to start from / end with to these loops iterating and freezing / thawing superblocks. It doesn't seem too hacky but if you prefer your freeze owner approach I won't object. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.
If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Drop all patches that remove TASK_FREEZABLE.
- Expand commit messages a bit.
- Link to v1: https://lore.kernel.org/r/20250401-work-freeze-v1-0-d000611d4ab0@kernel.org
---
Christian Brauner (4):
fs: add owner of freeze/thaw
fs: allow all writers to be frozen
power: freeze filesystems during suspend/resume
kernfs: add warning about implementing freeze/thaw
fs/f2fs/gc.c | 6 ++--
fs/gfs2/super.c | 20 ++++++-----
fs/gfs2/sys.c | 4 +--
fs/ioctl.c | 8 ++---
fs/kernfs/mount.c | 15 +++++++++
fs/super.c | 82 ++++++++++++++++++++++++++++++++++++---------
fs/xfs/scrub/fscounters.c | 4 +--
fs/xfs/xfs_notify_failure.c | 6 ++--
include/linux/fs.h | 16 +++++----
kernel/power/hibernate.c | 16 ++++++++-
kernel/power/main.c | 31 +++++++++++++++++
kernel/power/power.h | 4 +++
kernel/power/suspend.c | 7 ++++
13 files changed, 174 insertions(+), 45 deletions(-)
---
base-commit: 62dfd8d59e2d16873398ede5b1835e302df789b3
change-id: 20250401-work-freeze-693b5b5a78e0
Hi, Christian Brauner, Jan Kara and other contributors of this patchset. I did experiments on my laptop, and these experiments show that this patchset does not solve various longstanding problems related to suspend and filesystems. (Even if I enable /sys/power/freeze_filesystems ) Now let me describe problems I had in the past (and still have!) and then experiments I did and their results. So, I had these 3 problems: - Suspend doesn't work if fstrim in progress (note that I use btrfs as root file system) - Suspend doesn't work if scrub in progress - Suspend doesn't work if we try to read from fuse-sshfs filesystem while network is down Let me describe third problem in more detail. To reproduce you need to do this: - Mount remote filesystem using sshfs (it is based on ssh and fuse) - Disable internet - Run command "ls" in that sshfs filesystem (this command will, of course, hang, because network is down) - Then suspend Suspend will not work. Does your patchset supposed to fix these problems? Okay, so just now I was able to reproduce all 3 problems on latest mainline ( f4a40a4282f467ec99745c6ba62cb84346e42139 ), which (as well as I understand) has this patchset applied. I reproduced them with /sys/power/freeze_filesystems set to both 0 and 1 (thus I did 3 * 2 = 6 experiments). I'm available for further testing. -- Askar Safin
Hi! On Sun 20-07-25 22:23:36, Askar Safin wrote: > I did experiments on my laptop, and these experiments show that this > patchset does not solve various longstanding problems related to suspend > and filesystems. (Even if I enable /sys/power/freeze_filesystems ) > > Now let me describe problems I had in the past (and still have!) and then > experiments I did and their results. > > So, I had these 3 problems: > > - Suspend doesn't work if fstrim in progress (note that I use btrfs as > root file system) Right, this is expected because the FITRIM ioctl (syscall as any other) likely takes too long and so the suspend code looses its patience. There's nothing VFS can do about this. You can talk to btrfs developers to periodically check for pending signal / freezing event like e.g. ext4 does in ext4_trim_interrupted() to avoid suspend failures when FITRIM is running. > - Suspend doesn't work if scrub in progress Similar situation as with FITRIM. This is fully in control of the filesystem and unless the filesystem adds checks and early abort paths, VFS cannot do anything. > - Suspend doesn't work if we try to read from fuse-sshfs filesystem while > network is down On the surface the problem is the same as the above two but the details here are subtly different. Here I expect (although I'm not 100% sure) the blocked process is inside the FUSE filesystem waiting for the FUSE daemon to reply (a /proc/<pid>/stack of the blocked process would be useful here). In theory, FUSE filesystem should be able to make the wait for reply in TASK_FREEZABLE state which would fix your issue. In any case this is very likely work for FUSE developers. So I'm sorry but the patch set you speak about isn't supposed to fix any of the above issues you hit. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Mon, 21 Jul 2025 at 14:09, Jan Kara <jack@suse.cz> wrote: > > Hi! > > On Sun 20-07-25 22:23:36, Askar Safin wrote: > > - Suspend doesn't work if we try to read from fuse-sshfs filesystem while > > network is down > > On the surface the problem is the same as the above two but the details > here are subtly different. Here I expect (although I'm not 100% sure) the > blocked process is inside the FUSE filesystem waiting for the FUSE daemon > to reply (a /proc/<pid>/stack of the blocked process would be useful here). > In theory, FUSE filesystem should be able to make the wait for reply in > TASK_FREEZABLE state which would fix your issue. In any case this is very > likely work for FUSE developers. This is a known problem with an unknown solution. We can fix some of the cases, but changing all filesystem locks to be freezable is likely not workable. Thanks, Miklos
---- On Mon, 04 Aug 2025 09:31:00 +0400 Miklos Szeredi <miklos@szeredi.hu> wrote --- > This is a known problem with an unknown solution. > > We can fix some of the cases, but changing all filesystem locks to be > freezable is likely not workable. So what to do in case of networked FUSE, such as sshfs? We should put workaround to other place? Where? To libfuse or to each networked FUSE daemon, such as sshfs? I hit this bug in the past, and I'm very angry. A lot of other people hit this FUSE bug, too: https://github.com/systemd/systemd/issues/37590 . What about this timeout solution: https://lore.kernel.org/linux-fsdevel/20250122215528.1270478-1-joannelkoong@gmail.com/ ? Will it work? As well as I understand, currently kernel waits 20 seconds, when it tries to freeze processes when suspending. So, what if we use this Joanne Koong's timeout patch, and set timeout to 19 seconds? So, in short, if we cannot properly fix kernel, then where fix belongs? What should we do? -- Askar Safin https://types.pl/@safinaskar
On (25/08/04 10:02), Askar Safin wrote: > What about this timeout solution: https://lore.kernel.org/linux-fsdevel/20250122215528.1270478-1-joannelkoong@gmail.com/ ? > Will it work? As well as I understand, currently kernel waits 20 seconds, when it tries to freeze processes when suspending. > So, what if we use this Joanne Koong's timeout patch, and set timeout to 19 seconds? I think the problem with this approach is that not all fuse connections are remote (over the network). One can use fuse to mount vfat/ntfs or even archives. Destroying those connections for suspend is unlikely to be loved by users.
For some kernel subsystems it is paramount that they are guaranteed that
they are the owner of the freeze to avoid any risk of deadlocks. This is
the case for the power subsystem. Enable it to recognize whether it did
actually freeze the filesystem.
If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
and then fails on the 6th for whatever odd reason (current or future)
then power needs to undo the freeze of the first 5 filesystems. It can't
just walk the list again because while it's unlikely that a new
filesystem got added in the meantime it still cannot tell which
filesystems the power subsystem actually managed to get a freeze
reference count on that needs to be dropped during thaw.
There's various ways out of this ugliness. For example, record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or make sure that the power subsystem just actually exclusively
freezes things it can freeze and marking such filesystems as being owned
by power for the duration of the suspend or resume cycle. I opted for
the latter as that seemed the clean thing to do even if it means more
code changes.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/f2fs/gc.c | 6 ++--
fs/gfs2/super.c | 20 ++++++------
fs/gfs2/sys.c | 4 +--
fs/ioctl.c | 8 ++---
fs/super.c | 76 ++++++++++++++++++++++++++++++++++++---------
fs/xfs/scrub/fscounters.c | 4 +--
fs/xfs/xfs_notify_failure.c | 6 ++--
include/linux/fs.h | 13 +++++---
8 files changed, 95 insertions(+), 42 deletions(-)
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 2b8f9239bede..3e8af62c9e15 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
if (err)
return err;
- err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+ err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
if (err)
return err;
if (f2fs_readonly(sbi->sb)) {
- err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+ err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
if (err)
return err;
return -EROFS;
@@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
out_err:
f2fs_up_write(&sbi->cp_global_sem);
f2fs_up_write(&sbi->gc_lock);
- thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+ thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
return err;
}
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 44e5658b896c..519943189109 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
return sdp->sd_log_error;
}
-static int gfs2_do_thaw(struct gfs2_sbd *sdp)
+static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
{
struct super_block *sb = sdp->sd_vfs;
int error;
@@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
error = gfs2_freeze_lock_shared(sdp);
if (error)
goto fail;
- error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+ error = thaw_super(sb, who, freeze_owner);
if (!error)
return 0;
@@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
if (test_bit(SDF_FROZEN, &sdp->sd_flags))
goto freeze_failed;
- error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+ error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
if (error)
goto freeze_failed;
gfs2_freeze_unlock(sdp);
set_bit(SDF_FROZEN, &sdp->sd_flags);
- error = gfs2_do_thaw(sdp);
+ error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
if (error)
goto out;
@@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
*
*/
-static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner)
{
struct gfs2_sbd *sdp = sb->s_fs_info;
int error;
@@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
}
for (;;) {
- error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+ error = freeze_super(sb, who, freeze_owner);
if (error) {
fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
error);
@@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
break;
}
- error = gfs2_do_thaw(sdp);
+ error = gfs2_do_thaw(sdp, who, freeze_owner);
if (error)
goto out;
@@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
*
*/
-static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner)
{
struct gfs2_sbd *sdp = sb->s_fs_info;
int error;
@@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
atomic_inc(&sb->s_active);
gfs2_freeze_unlock(sdp);
- error = gfs2_do_thaw(sdp);
+ error = gfs2_do_thaw(sdp, who, freeze_owner);
if (!error) {
clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index ecc699f8d9fc..748125653d6c 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
switch (n) {
case 0:
- error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+ error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
break;
case 1:
- error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+ error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
break;
default:
return -EINVAL;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index c91fd2b46a77..bedc83fc2f20 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
/* Freeze */
if (sb->s_op->freeze_super)
- return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
- return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+ return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+ return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
}
static int ioctl_fsthaw(struct file *filp)
@@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
/* Thaw */
if (sb->s_op->thaw_super)
- return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
- return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+ return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+ return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
}
static int ioctl_file_dedupe_range(struct file *file,
diff --git a/fs/super.c b/fs/super.c
index 3c4a496d6438..3ddded4360c6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -39,7 +39,8 @@
#include <uapi/linux/mount.h>
#include "internal.h"
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner);
static LIST_HEAD(super_blocks);
static DEFINE_SPINLOCK(sb_lock);
@@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
if (IS_ENABLED(CONFIG_BLOCK))
while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
- thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
+ thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
return;
}
@@ -1195,9 +1196,9 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
return;
if (sb->s_op->freeze_super)
- sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+ sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
else
- freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+ freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
deactivate_super(sb);
}
@@ -1217,9 +1218,9 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
return;
if (sb->s_op->thaw_super)
- sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+ sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
else
- thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+ thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
deactivate_super(sb);
}
@@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
if (sb->s_op->freeze_super)
error = sb->s_op->freeze_super(sb,
- FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+ FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
else
error = freeze_super(sb,
- FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+ FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
if (!error)
error = sync_blockdev(bdev);
deactivate_super(sb);
@@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
if (sb->s_op->thaw_super)
error = sb->s_op->thaw_super(sb,
- FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+ FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
else
error = thaw_super(sb,
- FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+ FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
deactivate_super(sb);
return error;
}
@@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
}
#define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
-#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
+#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
{
@@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
WARN_ON_ONCE((who & ~FREEZE_FLAGS));
WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+ if (who & FREEZE_EXCL) {
+ if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+ return false;
+
+ if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+ return false;
+
+ return (sb->s_writers.freeze_kcount +
+ sb->s_writers.freeze_ucount) == 0;
+ }
+
+ /* This filesystem is already exclusively frozen. */
+ if (sb->s_writers.freeze_owner)
+ return false;
+
if (who & FREEZE_HOLDER_KERNEL)
return (who & FREEZE_MAY_NEST) ||
sb->s_writers.freeze_kcount == 0;
@@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
return false;
}
+static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner)
+{
+ WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+ WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+ if (who & FREEZE_EXCL) {
+ if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+ return false;
+ if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+ return false;
+ if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+ return false;
+ return sb->s_writers.freeze_owner == freeze_owner;
+ }
+
+ return sb->s_writers.freeze_owner == NULL;
+}
+
/**
* freeze_super - lock the filesystem and force it into a consistent state
* @sb: the super to lock
* @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
*
* Syncs the super to make sure the filesystem is consistent and calls the fs's
* freeze_fs. Subsequent calls to this without first thawing the fs may return
@@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
* Return: If the freeze was successful zero is returned. If the freeze
* failed a negative error code is returned.
*/
-int freeze_super(struct super_block *sb, enum freeze_holder who)
+int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
{
int ret;
@@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
if (sb_rdonly(sb)) {
/* Nothing to do really... */
WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+ sb->s_writers.freeze_owner = freeze_owner;
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
wake_up_var(&sb->s_writers.frozen);
super_unlock_excl(sb);
@@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
* when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
*/
WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+ sb->s_writers.freeze_owner = freeze_owner;
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
wake_up_var(&sb->s_writers.frozen);
lockdep_sb_freeze_release(sb);
@@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
* removes that state without releasing the other state or unlocking the
* filesystem.
*/
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner)
{
int error = -EINVAL;
if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
goto out_unlock;
+ if (!may_unfreeze(sb, who, freeze_owner))
+ goto out_unlock;
+
/*
* All freezers share a single active reference.
* So just unlock in case there are any left.
@@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
if (sb_rdonly(sb)) {
sb->s_writers.frozen = SB_UNFROZEN;
+ sb->s_writers.freeze_owner = NULL;
wake_up_var(&sb->s_writers.frozen);
goto out_deactivate;
}
@@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
}
sb->s_writers.frozen = SB_UNFROZEN;
+ sb->s_writers.freeze_owner = NULL;
wake_up_var(&sb->s_writers.frozen);
sb_freeze_unlock(sb, SB_FREEZE_FS);
out_deactivate:
@@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
* thaw_super -- unlock filesystem
* @sb: the super to thaw
* @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
*
* Unlocks the filesystem and marks it writeable again after freeze_super()
* if there are no remaining freezes on the filesystem.
@@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
* have been frozen through the block layer via multiple block devices.
* The filesystem remains frozen until all block devices are unfrozen.
*/
-int thaw_super(struct super_block *sb, enum freeze_holder who)
+int thaw_super(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner)
{
if (!super_lock_excl(sb)) {
WARN_ON_ONCE("Dying superblock while thawing!");
return -EINVAL;
}
- return thaw_super_locked(sb, who);
+ return thaw_super_locked(sb, who, freeze_owner);
}
EXPORT_SYMBOL(thaw_super);
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index e629663e460a..9b598c5790ad 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -123,7 +123,7 @@ xchk_fsfreeze(
{
int error;
- error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+ error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
trace_xchk_fsfreeze(sc, error);
return error;
}
@@ -135,7 +135,7 @@ xchk_fsthaw(
int error;
/* This should always succeed, we have a kernel freeze */
- error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+ error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
trace_xchk_fsthaw(sc, error);
return error;
}
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index ed8d8ed42f0a..3545dc1d953c 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
struct super_block *sb = mp->m_super;
int error;
- error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+ error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
if (error)
xfs_emerg(mp, "already frozen by kernel, err=%d", error);
@@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
int error;
if (kernel_frozen) {
- error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+ error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
if (error)
xfs_emerg(mp, "still frozen after notify failure, err=%d",
error);
@@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
* Also thaw userspace call anyway because the device is about to be
* removed immediately.
*/
- thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+ thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
}
static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1aa578412f1b..b379a46b5576 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1307,6 +1307,7 @@ struct sb_writers {
unsigned short frozen; /* Is sb frozen? */
int freeze_kcount; /* How many kernel freeze requests? */
int freeze_ucount; /* How many userspace freeze requests? */
+ const void *freeze_owner; /* Owner of the freeze */
struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];
};
@@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
* @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
* @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
* @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ * @FREEZE_EXCL: whether actual freezing must be done by the caller
*
* Indicate who the owner of the freeze or thaw request is and whether
* the freeze needs to be exclusive or can nest.
@@ -2283,6 +2285,7 @@ enum freeze_holder {
FREEZE_HOLDER_KERNEL = (1U << 0),
FREEZE_HOLDER_USERSPACE = (1U << 1),
FREEZE_MAY_NEST = (1U << 2),
+ FREEZE_EXCL = (1U << 3),
};
struct super_operations {
@@ -2296,9 +2299,9 @@ struct super_operations {
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
- int (*freeze_super) (struct super_block *, enum freeze_holder who);
+ int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
int (*freeze_fs) (struct super_block *);
- int (*thaw_super) (struct super_block *, enum freeze_holder who);
+ int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
@@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
extern int vfs_statfs(const struct path *, struct kstatfs *);
extern int user_statfs(const char __user *, struct kstatfs *);
extern int fd_statfs(int, struct kstatfs *);
-int freeze_super(struct super_block *super, enum freeze_holder who);
-int thaw_super(struct super_block *super, enum freeze_holder who);
+int freeze_super(struct super_block *super, enum freeze_holder who,
+ const void *freeze_owner);
+int thaw_super(struct super_block *super, enum freeze_holder who,
+ const void *freeze_owner);
extern __printf(2, 3)
int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
extern int super_setup_bdi(struct super_block *sb);
--
2.47.2
On Wed 02-04-25 16:07:31, Christian Brauner wrote:
> For some kernel subsystems it is paramount that they are guaranteed that
> they are the owner of the freeze to avoid any risk of deadlocks. This is
> the case for the power subsystem. Enable it to recognize whether it did
> actually freeze the filesystem.
>
> If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
> and then fails on the 6th for whatever odd reason (current or future)
> then power needs to undo the freeze of the first 5 filesystems. It can't
> just walk the list again because while it's unlikely that a new
> filesystem got added in the meantime it still cannot tell which
> filesystems the power subsystem actually managed to get a freeze
> reference count on that needs to be dropped during thaw.
>
> There's various ways out of this ugliness. For example, record the
> filesystems the power subsystem managed to freeze on a temporary list in
> the callbacks and then walk that list backwards during thaw to undo the
> freezing or make sure that the power subsystem just actually exclusively
> freezes things it can freeze and marking such filesystems as being owned
> by power for the duration of the suspend or resume cycle. I opted for
> the latter as that seemed the clean thing to do even if it means more
> code changes.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
I have realized a slight catch with this approach that if hibernation races
with filesystem freezing (e.g. DM reconfiguration), then hibernation need
not freeze a filesystem because it's already frozen but userspace may thaw
the filesystem before hibernation actually happens (relatively harmless).
If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY (rather unexpected). So somehow tracking which
fs was frozen by suspend while properly nesting with other freeze users may
be actually a better approach (maybe just a sb flag even though it's
somewhat hacky?).
Honza
> ---
> fs/f2fs/gc.c | 6 ++--
> fs/gfs2/super.c | 20 ++++++------
> fs/gfs2/sys.c | 4 +--
> fs/ioctl.c | 8 ++---
> fs/super.c | 76 ++++++++++++++++++++++++++++++++++++---------
> fs/xfs/scrub/fscounters.c | 4 +--
> fs/xfs/xfs_notify_failure.c | 6 ++--
> include/linux/fs.h | 13 +++++---
> 8 files changed, 95 insertions(+), 42 deletions(-)
>
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 2b8f9239bede..3e8af62c9e15 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
> if (err)
> return err;
>
> - err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> + err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
> if (err)
> return err;
>
> if (f2fs_readonly(sbi->sb)) {
> - err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> + err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
> if (err)
> return err;
> return -EROFS;
> @@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
> out_err:
> f2fs_up_write(&sbi->cp_global_sem);
> f2fs_up_write(&sbi->gc_lock);
> - thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> + thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
> return err;
> }
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 44e5658b896c..519943189109 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
> return sdp->sd_log_error;
> }
>
> -static int gfs2_do_thaw(struct gfs2_sbd *sdp)
> +static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
> {
> struct super_block *sb = sdp->sd_vfs;
> int error;
> @@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
> error = gfs2_freeze_lock_shared(sdp);
> if (error)
> goto fail;
> - error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> + error = thaw_super(sb, who, freeze_owner);
> if (!error)
> return 0;
>
> @@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
> if (test_bit(SDF_FROZEN, &sdp->sd_flags))
> goto freeze_failed;
>
> - error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> + error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> if (error)
> goto freeze_failed;
>
> gfs2_freeze_unlock(sdp);
> set_bit(SDF_FROZEN, &sdp->sd_flags);
>
> - error = gfs2_do_thaw(sdp);
> + error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
> if (error)
> goto out;
>
> @@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
> *
> */
>
> -static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
> +static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner)
> {
> struct gfs2_sbd *sdp = sb->s_fs_info;
> int error;
> @@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
> }
>
> for (;;) {
> - error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> + error = freeze_super(sb, who, freeze_owner);
> if (error) {
> fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
> error);
> @@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
> break;
> }
>
> - error = gfs2_do_thaw(sdp);
> + error = gfs2_do_thaw(sdp, who, freeze_owner);
> if (error)
> goto out;
>
> @@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
> *
> */
>
> -static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
> +static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner)
> {
> struct gfs2_sbd *sdp = sb->s_fs_info;
> int error;
> @@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
> atomic_inc(&sb->s_active);
> gfs2_freeze_unlock(sdp);
>
> - error = gfs2_do_thaw(sdp);
> + error = gfs2_do_thaw(sdp, who, freeze_owner);
>
> if (!error) {
> clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
> diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
> index ecc699f8d9fc..748125653d6c 100644
> --- a/fs/gfs2/sys.c
> +++ b/fs/gfs2/sys.c
> @@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
>
> switch (n) {
> case 0:
> - error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
> + error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
> break;
> case 1:
> - error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
> + error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
> break;
> default:
> return -EINVAL;
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index c91fd2b46a77..bedc83fc2f20 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
>
> /* Freeze */
> if (sb->s_op->freeze_super)
> - return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> - return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> + return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> + return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> }
>
> static int ioctl_fsthaw(struct file *filp)
> @@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
>
> /* Thaw */
> if (sb->s_op->thaw_super)
> - return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> - return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> + return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> + return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> }
>
> static int ioctl_file_dedupe_range(struct file *file,
> diff --git a/fs/super.c b/fs/super.c
> index 3c4a496d6438..3ddded4360c6 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -39,7 +39,8 @@
> #include <uapi/linux/mount.h>
> #include "internal.h"
>
> -static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
> +static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner);
>
> static LIST_HEAD(super_blocks);
> static DEFINE_SPINLOCK(sb_lock);
> @@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
> if (IS_ENABLED(CONFIG_BLOCK))
> while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
> pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
> - thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
> + thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
> return;
> }
>
> @@ -1195,9 +1196,9 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
> return;
>
> if (sb->s_op->freeze_super)
> - sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> + sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> else
> - freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> + freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>
> deactivate_super(sb);
> }
> @@ -1217,9 +1218,9 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
> return;
>
> if (sb->s_op->thaw_super)
> - sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> + sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> else
> - thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> + thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>
> deactivate_super(sb);
> }
> @@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
>
> if (sb->s_op->freeze_super)
> error = sb->s_op->freeze_super(sb,
> - FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> + FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> else
> error = freeze_super(sb,
> - FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> + FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> if (!error)
> error = sync_blockdev(bdev);
> deactivate_super(sb);
> @@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
>
> if (sb->s_op->thaw_super)
> error = sb->s_op->thaw_super(sb,
> - FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> + FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> else
> error = thaw_super(sb,
> - FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> + FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> deactivate_super(sb);
> return error;
> }
> @@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
> }
>
> #define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
> -#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
> +#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
>
> static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
> {
> @@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>
> + if (who & FREEZE_EXCL) {
> + if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> + return false;
> +
> + if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> + return false;
> +
> + return (sb->s_writers.freeze_kcount +
> + sb->s_writers.freeze_ucount) == 0;
> + }
> +
> + /* This filesystem is already exclusively frozen. */
> + if (sb->s_writers.freeze_owner)
> + return false;
> +
> if (who & FREEZE_HOLDER_KERNEL)
> return (who & FREEZE_MAY_NEST) ||
> sb->s_writers.freeze_kcount == 0;
> @@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> return false;
> }
>
> +static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner)
> +{
> + WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> + WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> +
> + if (who & FREEZE_EXCL) {
> + if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> + return false;
> + if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> + return false;
> + if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> + return false;
> + return sb->s_writers.freeze_owner == freeze_owner;
> + }
> +
> + return sb->s_writers.freeze_owner == NULL;
> +}
> +
> /**
> * freeze_super - lock the filesystem and force it into a consistent state
> * @sb: the super to lock
> * @who: context that wants to freeze
> + * @freeze_owner: owner of the freeze
> *
> * Syncs the super to make sure the filesystem is consistent and calls the fs's
> * freeze_fs. Subsequent calls to this without first thawing the fs may return
> @@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> * Return: If the freeze was successful zero is returned. If the freeze
> * failed a negative error code is returned.
> */
> -int freeze_super(struct super_block *sb, enum freeze_holder who)
> +int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
> {
> int ret;
>
> @@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
> if (sb_rdonly(sb)) {
> /* Nothing to do really... */
> WARN_ON_ONCE(freeze_inc(sb, who) > 1);
> + sb->s_writers.freeze_owner = freeze_owner;
> sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> wake_up_var(&sb->s_writers.frozen);
> super_unlock_excl(sb);
> @@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
> * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
> */
> WARN_ON_ONCE(freeze_inc(sb, who) > 1);
> + sb->s_writers.freeze_owner = freeze_owner;
> sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> wake_up_var(&sb->s_writers.frozen);
> lockdep_sb_freeze_release(sb);
> @@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
> * removes that state without releasing the other state or unlocking the
> * filesystem.
> */
> -static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
> +static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner)
> {
> int error = -EINVAL;
>
> if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
> goto out_unlock;
>
> + if (!may_unfreeze(sb, who, freeze_owner))
> + goto out_unlock;
> +
> /*
> * All freezers share a single active reference.
> * So just unlock in case there are any left.
> @@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>
> if (sb_rdonly(sb)) {
> sb->s_writers.frozen = SB_UNFROZEN;
> + sb->s_writers.freeze_owner = NULL;
> wake_up_var(&sb->s_writers.frozen);
> goto out_deactivate;
> }
> @@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
> }
>
> sb->s_writers.frozen = SB_UNFROZEN;
> + sb->s_writers.freeze_owner = NULL;
> wake_up_var(&sb->s_writers.frozen);
> sb_freeze_unlock(sb, SB_FREEZE_FS);
> out_deactivate:
> @@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
> * thaw_super -- unlock filesystem
> * @sb: the super to thaw
> * @who: context that wants to freeze
> + * @freeze_owner: owner of the freeze
> *
> * Unlocks the filesystem and marks it writeable again after freeze_super()
> * if there are no remaining freezes on the filesystem.
> @@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
> * have been frozen through the block layer via multiple block devices.
> * The filesystem remains frozen until all block devices are unfrozen.
> */
> -int thaw_super(struct super_block *sb, enum freeze_holder who)
> +int thaw_super(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner)
> {
> if (!super_lock_excl(sb)) {
> WARN_ON_ONCE("Dying superblock while thawing!");
> return -EINVAL;
> }
> - return thaw_super_locked(sb, who);
> + return thaw_super_locked(sb, who, freeze_owner);
> }
> EXPORT_SYMBOL(thaw_super);
>
> diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
> index e629663e460a..9b598c5790ad 100644
> --- a/fs/xfs/scrub/fscounters.c
> +++ b/fs/xfs/scrub/fscounters.c
> @@ -123,7 +123,7 @@ xchk_fsfreeze(
> {
> int error;
>
> - error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
> + error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
> trace_xchk_fsfreeze(sc, error);
> return error;
> }
> @@ -135,7 +135,7 @@ xchk_fsthaw(
> int error;
>
> /* This should always succeed, we have a kernel freeze */
> - error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
> + error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
> trace_xchk_fsthaw(sc, error);
> return error;
> }
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index ed8d8ed42f0a..3545dc1d953c 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
> struct super_block *sb = mp->m_super;
> int error;
>
> - error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
> if (error)
> xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>
> @@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
> int error;
>
> if (kernel_frozen) {
> - error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
> if (error)
> xfs_emerg(mp, "still frozen after notify failure, err=%d",
> error);
> @@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
> * Also thaw userspace call anyway because the device is about to be
> * removed immediately.
> */
> - thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> + thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> }
>
> static int
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1aa578412f1b..b379a46b5576 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1307,6 +1307,7 @@ struct sb_writers {
> unsigned short frozen; /* Is sb frozen? */
> int freeze_kcount; /* How many kernel freeze requests? */
> int freeze_ucount; /* How many userspace freeze requests? */
> + const void *freeze_owner; /* Owner of the freeze */
> struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];
> };
>
> @@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
> * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
> * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
> * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_EXCL: whether actual freezing must be done by the caller
> *
> * Indicate who the owner of the freeze or thaw request is and whether
> * the freeze needs to be exclusive or can nest.
> @@ -2283,6 +2285,7 @@ enum freeze_holder {
> FREEZE_HOLDER_KERNEL = (1U << 0),
> FREEZE_HOLDER_USERSPACE = (1U << 1),
> FREEZE_MAY_NEST = (1U << 2),
> + FREEZE_EXCL = (1U << 3),
> };
>
> struct super_operations {
> @@ -2296,9 +2299,9 @@ struct super_operations {
> void (*evict_inode) (struct inode *);
> void (*put_super) (struct super_block *);
> int (*sync_fs)(struct super_block *sb, int wait);
> - int (*freeze_super) (struct super_block *, enum freeze_holder who);
> + int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
> int (*freeze_fs) (struct super_block *);
> - int (*thaw_super) (struct super_block *, enum freeze_holder who);
> + int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
> int (*unfreeze_fs) (struct super_block *);
> int (*statfs) (struct dentry *, struct kstatfs *);
> int (*remount_fs) (struct super_block *, int *, char *);
> @@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
> extern int vfs_statfs(const struct path *, struct kstatfs *);
> extern int user_statfs(const char __user *, struct kstatfs *);
> extern int fd_statfs(int, struct kstatfs *);
> -int freeze_super(struct super_block *super, enum freeze_holder who);
> -int thaw_super(struct super_block *super, enum freeze_holder who);
> +int freeze_super(struct super_block *super, enum freeze_holder who,
> + const void *freeze_owner);
> +int thaw_super(struct super_block *super, enum freeze_holder who,
> + const void *freeze_owner);
> extern __printf(2, 3)
> int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
> extern int super_setup_bdi(struct super_block *sb);
>
> --
> 2.47.2
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Thu, Apr 03, 2025 at 04:56:57PM +0200, Jan Kara wrote: > On Wed 02-04-25 16:07:31, Christian Brauner wrote: > > For some kernel subsystems it is paramount that they are guaranteed that > > they are the owner of the freeze to avoid any risk of deadlocks. This is > > the case for the power subsystem. Enable it to recognize whether it did > > actually freeze the filesystem. > > > > If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 > > and then fails on the 6th for whatever odd reason (current or future) > > then power needs to undo the freeze of the first 5 filesystems. It can't > > just walk the list again because while it's unlikely that a new > > filesystem got added in the meantime it still cannot tell which > > filesystems the power subsystem actually managed to get a freeze > > reference count on that needs to be dropped during thaw. > > > > There's various ways out of this ugliness. For example, record the > > filesystems the power subsystem managed to freeze on a temporary list in > > the callbacks and then walk that list backwards during thaw to undo the > > freezing or make sure that the power subsystem just actually exclusively > > freezes things it can freeze and marking such filesystems as being owned > > by power for the duration of the suspend or resume cycle. I opted for > > the latter as that seemed the clean thing to do even if it means more > > code changes. > > > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > I have realized a slight catch with this approach that if hibernation races > with filesystem freezing (e.g. DM reconfiguration), then hibernation need > not freeze a filesystem because it's already frozen but userspace may thaw > the filesystem before hibernation actually happens (relatively harmless). > If the race happens the other way around, DM reconfiguration may > unexpectedly fail with EBUSY (rather unexpected). So somehow tracking which > fs was frozen by suspend while properly nesting with other freeze users may > be actually a better approach (maybe just a sb flag even though it's > somewhat hacky?). The approach that I originally had was to add FREEZE_POWER which adds a simple boolean into the sb_writers instead of a holder and then this simply nests with the rest. I'll try to post that diff tomorrow.
If hibernation races with filesystem freezing (e.g. DM reconfiguration),
then hibernation need not freeze a filesystem because it's already
frozen but userspace may thaw the filesystem before hibernation actually
happens.
If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY.
So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
cannot be undone by any of the other concurrent freezers.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/super.c | 71 ++++++++++++++++++++++++++++++++++++++++++------------
include/linux/fs.h | 2 +-
2 files changed, 56 insertions(+), 17 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index b4bdbc509dba..e2fee655fbed 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1979,26 +1979,34 @@ static inline int freeze_dec(struct super_block *sb, enum freeze_holder who)
return sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount;
}
-static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
+static inline bool may_freeze(struct super_block *sb, enum freeze_holder who,
+ const void *freeze_owner)
{
+ lockdep_assert_held(&sb->s_umount);
+
WARN_ON_ONCE((who & ~FREEZE_FLAGS));
WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
if (who & FREEZE_EXCL) {
if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
return false;
-
- if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+ if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
return false;
-
- return (sb->s_writers.freeze_kcount +
- sb->s_writers.freeze_ucount) == 0;
+ if (WARN_ON_ONCE(!freeze_owner))
+ return false;
+ /* This freeze already has a specific owner. */
+ if (sb->s_writers.freeze_owner)
+ return false;
+ /*
+ * This is already frozen multiple times so we're just
+ * going to take a reference count and mark it as
+ * belonging to use.
+ */
+ if (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount)
+ sb->s_writers.freeze_owner = freeze_owner;
+ return true;
}
- /* This filesystem is already exclusively frozen. */
- if (sb->s_writers.freeze_owner)
- return false;
-
if (who & FREEZE_HOLDER_KERNEL)
return (who & FREEZE_MAY_NEST) ||
sb->s_writers.freeze_kcount == 0;
@@ -2011,20 +2019,51 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
const void *freeze_owner)
{
+ lockdep_assert_held(&sb->s_umount);
+
WARN_ON_ONCE((who & ~FREEZE_FLAGS));
WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
if (who & FREEZE_EXCL) {
- if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
- return false;
if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
return false;
- if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+ if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
+ return false;
+ if (WARN_ON_ONCE(!freeze_owner))
+ return false;
+ if (WARN_ON_ONCE(sb->s_writers.freeze_kcount == 0))
return false;
- return sb->s_writers.freeze_owner == freeze_owner;
+ /* This isn't exclusively frozen. */
+ if (!sb->s_writers.freeze_owner)
+ return false;
+ /* This isn't exclusively frozen by us. */
+ if (sb->s_writers.freeze_owner != freeze_owner)
+ return false;
+ /*
+ * This is still frozen multiple times so we're just
+ * going to drop our reference count and undo our
+ * exclusive freeze.
+ */
+ if ((sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount) > 1)
+ sb->s_writers.freeze_owner = NULL;
+ return true;
+ }
+
+ if (who & FREEZE_HOLDER_KERNEL) {
+ /*
+ * Someone's trying to steal the reference belonging to
+ * @sb->s_writers.freeze_owner.
+ */
+ if (sb->s_writers.freeze_kcount == 1 &&
+ sb->s_writers.freeze_owner)
+ return false;
+ return sb->s_writers.freeze_kcount > 0;
}
- return sb->s_writers.freeze_owner == NULL;
+ if (who & FREEZE_HOLDER_USERSPACE)
+ return sb->s_writers.freeze_ucount > 0;
+
+ return false;
}
/**
@@ -2095,7 +2134,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
retry:
if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
- if (may_freeze(sb, who))
+ if (may_freeze(sb, who, freeze_owner))
ret = !!WARN_ON_ONCE(freeze_inc(sb, who) == 1);
else
ret = -EBUSY;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edcba3cd68e..7a3f821d2723 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2270,7 +2270,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
* @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
* @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
* @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
- * @FREEZE_EXCL: whether actual freezing must be done by the caller
+ * @FREEZE_EXCL: a freeze that can only be undone by the owner
*
* Indicate who the owner of the freeze or thaw request is and whether
* the freeze needs to be exclusive or can nest.
---
base-commit: a83fe97e0d53f7d2b0fc62fd9a322a963cb30306
change-id: 20250404-work-freeze-5eacb515f044
On Fri 04-04-25 12:24:09, Christian Brauner wrote:
> If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> then hibernation need not freeze a filesystem because it's already
> frozen but userspace may thaw the filesystem before hibernation actually
> happens.
>
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY.
>
> So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> cannot be undone by any of the other concurrent freezers.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
This has fallen through the cracks in my inbox but the patch now looks good
to me. Maybe we should fold it into "fs: add owner of freeze/thaw" to not
have strange intermediate state in the series?
Honza
> ---
> fs/super.c | 71 ++++++++++++++++++++++++++++++++++++++++++------------
> include/linux/fs.h | 2 +-
> 2 files changed, 56 insertions(+), 17 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index b4bdbc509dba..e2fee655fbed 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1979,26 +1979,34 @@ static inline int freeze_dec(struct super_block *sb, enum freeze_holder who)
> return sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount;
> }
>
> -static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> +static inline bool may_freeze(struct super_block *sb, enum freeze_holder who,
> + const void *freeze_owner)
> {
> + lockdep_assert_held(&sb->s_umount);
> +
> WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>
> if (who & FREEZE_EXCL) {
> if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> return false;
> -
> - if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> + if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
> return false;
> -
> - return (sb->s_writers.freeze_kcount +
> - sb->s_writers.freeze_ucount) == 0;
> + if (WARN_ON_ONCE(!freeze_owner))
> + return false;
> + /* This freeze already has a specific owner. */
> + if (sb->s_writers.freeze_owner)
> + return false;
> + /*
> + * This is already frozen multiple times so we're just
> + * going to take a reference count and mark it as
> + * belonging to use.
> + */
> + if (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount)
> + sb->s_writers.freeze_owner = freeze_owner;
> + return true;
> }
>
> - /* This filesystem is already exclusively frozen. */
> - if (sb->s_writers.freeze_owner)
> - return false;
> -
> if (who & FREEZE_HOLDER_KERNEL)
> return (who & FREEZE_MAY_NEST) ||
> sb->s_writers.freeze_kcount == 0;
> @@ -2011,20 +2019,51 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
> const void *freeze_owner)
> {
> + lockdep_assert_held(&sb->s_umount);
> +
> WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>
> if (who & FREEZE_EXCL) {
> - if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> - return false;
> if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> return false;
> - if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> + if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
> + return false;
> + if (WARN_ON_ONCE(!freeze_owner))
> + return false;
> + if (WARN_ON_ONCE(sb->s_writers.freeze_kcount == 0))
> return false;
> - return sb->s_writers.freeze_owner == freeze_owner;
> + /* This isn't exclusively frozen. */
> + if (!sb->s_writers.freeze_owner)
> + return false;
> + /* This isn't exclusively frozen by us. */
> + if (sb->s_writers.freeze_owner != freeze_owner)
> + return false;
> + /*
> + * This is still frozen multiple times so we're just
> + * going to drop our reference count and undo our
> + * exclusive freeze.
> + */
> + if ((sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount) > 1)
> + sb->s_writers.freeze_owner = NULL;
> + return true;
> + }
> +
> + if (who & FREEZE_HOLDER_KERNEL) {
> + /*
> + * Someone's trying to steal the reference belonging to
> + * @sb->s_writers.freeze_owner.
> + */
> + if (sb->s_writers.freeze_kcount == 1 &&
> + sb->s_writers.freeze_owner)
> + return false;
> + return sb->s_writers.freeze_kcount > 0;
> }
>
> - return sb->s_writers.freeze_owner == NULL;
> + if (who & FREEZE_HOLDER_USERSPACE)
> + return sb->s_writers.freeze_ucount > 0;
> +
> + return false;
> }
>
> /**
> @@ -2095,7 +2134,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
>
> retry:
> if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
> - if (may_freeze(sb, who))
> + if (may_freeze(sb, who, freeze_owner))
> ret = !!WARN_ON_ONCE(freeze_inc(sb, who) == 1);
> else
> ret = -EBUSY;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1edcba3cd68e..7a3f821d2723 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2270,7 +2270,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
> * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
> * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
> * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> - * @FREEZE_EXCL: whether actual freezing must be done by the caller
> + * @FREEZE_EXCL: a freeze that can only be undone by the owner
> *
> * Indicate who the owner of the freeze or thaw request is and whether
> * the freeze needs to be exclusive or can nest.
>
> ---
> base-commit: a83fe97e0d53f7d2b0fc62fd9a322a963cb30306
> change-id: 20250404-work-freeze-5eacb515f044
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Wed, May 07, 2025 at 01:18:34PM +0200, Jan Kara wrote: > On Fri 04-04-25 12:24:09, Christian Brauner wrote: > > If hibernation races with filesystem freezing (e.g. DM reconfiguration), > > then hibernation need not freeze a filesystem because it's already > > frozen but userspace may thaw the filesystem before hibernation actually > > happens. > > > > If the race happens the other way around, DM reconfiguration may > > unexpectedly fail with EBUSY. > > > > So allow FREEZE_EXCL to nest with other holders. An exclusive freezer > > cannot be undone by any of the other concurrent freezers. > > > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > This has fallen through the cracks in my inbox but the patch now looks good > to me. Maybe we should fold it into "fs: add owner of freeze/thaw" to not > have strange intermediate state in the series? Done.
On Fri, Apr 04, 2025 at 12:24:09PM +0200, Christian Brauner wrote: > If hibernation races with filesystem freezing (e.g. DM reconfiguration), > then hibernation need not freeze a filesystem because it's already > frozen but userspace may thaw the filesystem before hibernation actually > happens. > > If the race happens the other way around, DM reconfiguration may > unexpectedly fail with EBUSY. > > So allow FREEZE_EXCL to nest with other holders. An exclusive freezer > cannot be undone by any of the other concurrent freezers. What is FREEZE_EXCL? I can't find it anywhere including in linux-next.
During freeze/thaw we need to be able to freeze all writers during
suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
file and write to it will not be frozen after we've already frozen the
filesystem.
This has some risk of not being able to freeze processes in case a
process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
filesystem is frozen, a task can block on the frozen filesystem with
e.g., mmap_sem held. If some other task then blocks on grabbing that
mmap_sem, hibernation ill fail because it is unable to hibernate a task
holding mmap_sem. This could be fixed by making a range of filesystem
related locks use freezable sleeping. That's impractical and not
warranted just for suspend/hibernate. Assume that this is an infrequent
problem and we've given userspace a way to skip filesystem freezing
through a sysfs file.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/fs.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..1edcba3cd68e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
static inline void __sb_start_write(struct super_block *sb, int level)
{
- percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
- level == SB_FREEZE_WRITE);
+ percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
}
static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
--
2.47.2
On Wed 02-04-25 16:07:32, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
>
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
BTW, I agree about the need to silence lockdep somehow.
Honza
> ---
> include/linux/fs.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>
> static inline void __sb_start_write(struct super_block *sb, int level)
> {
> - percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> - level == SB_FREEZE_WRITE);
> + percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
> }
>
> static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
>
> --
> 2.47.2
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Wed, Apr 02, 2025 at 04:07:32PM +0200, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
>
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> include/linux/fs.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>
> static inline void __sb_start_write(struct super_block *sb, int level)
> {
> - percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> - level == SB_FREEZE_WRITE);
> + percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
> }
Jan, one more thought about freezability here. We know that there will
can be at least one process during hibernation that ends up generating
page faults and that's systemd-journald. When systemd-sleep requests
writing a hibernation image via /sys/power/ files it will inevitably end
up freezing systemd-journald and it may be generating a page fault with
->mmap_lock held. systemd-journald is now sleeping with
SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know this can cause
hibernation to fail. That part is fine. What isn't is that we will very
likely always trigger:
#ifdef CONFIG_LOCKDEP
/*
* It's dangerous to freeze with locks held; there be dragons there.
*/
if (!(state & __TASK_FREEZABLE_UNSAFE))
WARN_ON_ONCE(debug_locks && p->lockdep_depth);
#endif
with lockdep enabled.
So we really actually need percpu_rswem_read_freezable_unsafe(), i.e.,
TASK_FREEZABLE_UNSAFE.
On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
[...]
> Jan, one more thought about freezability here. We know that there
> will can be at least one process during hibernation that ends up
> generating page faults and that's systemd-journald. When systemd-
> sleep requests writing a hibernation image via /sys/power/ files it
> will inevitably end up freezing systemd-journald and it may be
> generating a page fault with ->mmap_lock held. systemd-journald is
> now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> this can cause hibernation to fail. That part is fine. What isn't is
> that we will very likely always trigger:
>
> #ifdef CONFIG_LOCKDEP
> /*
> * It's dangerous to freeze with locks held; there be dragons
> there.
> */
> if (!(state & __TASK_FREEZABLE_UNSAFE))
> WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> #endif
>
> with lockdep enabled.
>
> So we really actually need percpu_rswem_read_freezable_unsafe(),
> i.e., TASK_FREEZABLE_UNSAFE.
The sched people have pretty strong views about people not doing this,
expressed in the comment in sched.h and commit f5d39b020809
("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
variants got removed with prejudice.
If we do get into this situation the worst that can happen is that
another upper lock acquisition triggers a hibernate failure and we thaw
everything, thus we can never truly deadlock, which is the fear, so
perhaps they might be OK with this. Note that Rafael's solution to
this was to disable lockdep around hibernate/suspend and resume, which
is another possibility.
Regards,
James
On Wed, Apr 02, 2025 at 12:03:24PM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
> [...]
> > Jan, one more thought about freezability here. We know that there
> > will can be at least one process during hibernation that ends up
> > generating page faults and that's systemd-journald. When systemd-
> > sleep requests writing a hibernation image via /sys/power/ files it
> > will inevitably end up freezing systemd-journald and it may be
> > generating a page fault with ->mmap_lock held. systemd-journald is
> > now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> > this can cause hibernation to fail. That part is fine. What isn't is
> > that we will very likely always trigger:
> >
> > #ifdef CONFIG_LOCKDEP
> > /*
> > * It's dangerous to freeze with locks held; there be dragons
> > there.
> > */
> > if (!(state & __TASK_FREEZABLE_UNSAFE))
> > WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > #endif
> >
> > with lockdep enabled.
> >
> > So we really actually need percpu_rswem_read_freezable_unsafe(),
> > i.e., TASK_FREEZABLE_UNSAFE.
>
> The sched people have pretty strong views about people not doing this,
> expressed in the comment in sched.h and commit f5d39b020809
> ("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
> variants got removed with prejudice.
>
> If we do get into this situation the worst that can happen is that
> another upper lock acquisition triggers a hibernate failure and we thaw
> everything, thus we can never truly deadlock, which is the fear, so
Yes, I know that it's harmless but we need to not generate misleading
lockdep splats when lockdep is turned on and confuse users.
> perhaps they might be OK with this. Note that Rafael's solution to
> this was to disable lockdep around hibernate/suspend and resume, which
> is another possibility.
It can be done as a follow-up. I'm just saying it needs treatment.
Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
We add a new sysctl know /sys/power/freeze_filesystems that will allow
userspace to freeze filesystems during suspend/hibernate. For now it
defaults to off. The thaw logic doesn't require checking whether
freezing is enabled because the power subsystem exclusively owns frozen
filesystems for the duration of suspend/hibernate and is able to skip
filesystems it doesn't need to freeze.
Also it is technically possible that filesystem
filesystem_freeze_enabled is true and power freezes the filesystems but
before freezing all processes another process disables
filesystem_freeze_enabled. If power were to place the filesystems_thaw()
call under filesystems_freeze_enabled it would fail to thaw the
fileystems it frozw. The exclusive holder mechanism makes it possible to
iterate through the list without any concern making sure that no
filesystems are left frozen.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/super.c | 14 ++++++++++----
kernel/power/hibernate.c | 16 +++++++++++++++-
kernel/power/main.c | 31 +++++++++++++++++++++++++++++++
kernel/power/power.h | 4 ++++
kernel/power/suspend.c | 7 +++++++
5 files changed, 67 insertions(+), 5 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index 3ddded4360c6..b4bdbc509dba 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
return active;
}
+static const char *filesystems_freeze_ptr = "filesystems_freeze";
+
static void filesystems_freeze_callback(struct super_block *sb, void *unused)
{
if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
@@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
return;
if (sb->s_op->freeze_super)
- sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+ sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+ filesystems_freeze_ptr);
else
- freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+ freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+ filesystems_freeze_ptr);
deactivate_super(sb);
}
@@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
return;
if (sb->s_op->thaw_super)
- sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+ sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+ filesystems_freeze_ptr);
else
- thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+ thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+ filesystems_freeze_ptr);
deactivate_super(sb);
}
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 50ec26ea696b..37d733945c59 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -777,6 +777,8 @@ int hibernate(void)
goto Restore;
ksys_sync_helper();
+ if (filesystem_freeze_enabled)
+ filesystems_freeze();
error = freeze_processes();
if (error)
@@ -845,6 +847,7 @@ int hibernate(void)
/* Don't bother checking whether freezer_test_done is true */
freezer_test_done = false;
Exit:
+ filesystems_thaw();
pm_notifier_call_chain(PM_POST_HIBERNATION);
Restore:
pm_restore_console();
@@ -881,6 +884,9 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
if (error)
goto restore;
+ if (filesystem_freeze_enabled)
+ filesystems_freeze();
+
error = freeze_processes();
if (error)
goto exit;
@@ -940,6 +946,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
thaw_processes();
exit:
+ filesystems_thaw();
pm_notifier_call_chain(PM_POST_HIBERNATION);
restore:
@@ -1028,19 +1035,26 @@ static int software_resume(void)
if (error)
goto Restore;
+ if (filesystem_freeze_enabled)
+ filesystems_freeze();
+
pm_pr_dbg("Preparing processes for hibernation restore.\n");
error = freeze_processes();
- if (error)
+ if (error) {
+ filesystems_thaw();
goto Close_Finish;
+ }
error = freeze_kernel_threads();
if (error) {
thaw_processes();
+ filesystems_thaw();
goto Close_Finish;
}
error = load_image_and_restore();
thaw_processes();
+ filesystems_thaw();
Finish:
pm_notifier_call_chain(PM_POST_RESTORE);
Restore:
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 6254814d4817..0b0e76324c43 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -962,6 +962,34 @@ power_attr(pm_freeze_timeout);
#endif /* CONFIG_FREEZER*/
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+bool filesystem_freeze_enabled = false;
+
+static ssize_t freeze_filesystems_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", filesystem_freeze_enabled);
+}
+
+static ssize_t freeze_filesystems_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (kstrtoul(buf, 10, &val))
+ return -EINVAL;
+
+ if (val > 1)
+ return -EINVAL;
+
+ filesystem_freeze_enabled = !!val;
+ return n;
+}
+
+power_attr(freeze_filesystems);
+#endif /* CONFIG_SUSPEND || CONFIG_HIBERNATION */
+
static struct attribute * g[] = {
&state_attr.attr,
#ifdef CONFIG_PM_TRACE
@@ -991,6 +1019,9 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_FREEZER
&pm_freeze_timeout_attr.attr,
+#endif
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+ &freeze_filesystems_attr.attr,
#endif
NULL,
};
diff --git a/kernel/power/power.h b/kernel/power/power.h
index c352dea2f67b..2eb81662b8fa 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -18,6 +18,10 @@ struct swsusp_info {
unsigned long size;
} __aligned(PAGE_SIZE);
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+extern bool filesystem_freeze_enabled;
+#endif
+
#ifdef CONFIG_HIBERNATION
/* kernel/power/snapshot.c */
extern void __init hibernate_reserved_size_init(void);
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 8eaec4ab121d..76b141b9aac0 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -30,6 +30,7 @@
#include <trace/events/power.h>
#include <linux/compiler.h>
#include <linux/moduleparam.h>
+#include <linux/fs.h>
#include "power.h"
@@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
if (error)
goto Restore;
+ if (filesystem_freeze_enabled)
+ filesystems_freeze();
trace_suspend_resume(TPS("freeze_processes"), 0, true);
error = suspend_freeze_processes();
trace_suspend_resume(TPS("freeze_processes"), 0, false);
@@ -550,6 +553,7 @@ int suspend_devices_and_enter(suspend_state_t state)
static void suspend_finish(void)
{
suspend_thaw_processes();
+ filesystems_thaw();
pm_notifier_call_chain(PM_POST_SUSPEND);
pm_restore_console();
}
@@ -588,6 +592,8 @@ static int enter_state(suspend_state_t state)
ksys_sync_helper();
trace_suspend_resume(TPS("sync_filesystems"), 0, false);
}
+ if (filesystem_freeze_enabled)
+ filesystems_freeze();
pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
pm_suspend_clear_flags();
@@ -609,6 +615,7 @@ static int enter_state(suspend_state_t state)
pm_pr_dbg("Finishing wakeup.\n");
suspend_finish();
Unlock:
+ filesystems_thaw();
mutex_unlock(&system_transition_mutex);
return error;
}
--
2.47.2
On Wed 02-04-25 16:07:33, Christian Brauner wrote:
> Now all the pieces are in place to actually allow the power subsystem
> to freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.
>
> We could bubble up errors and fail suspend/resume if the error isn't
> EBUSY (aka it's already frozen) but I don't think that this is worth it.
> Filesystem freezing during suspend/resume is best-effort. If the user
> has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
> reason then we simply skip them.
>
> What we have now is already a big improvement and let's see how we fare
> with it before making our lives even harder (and uglier) than we have
> to.
>
> We add a new sysctl know /sys/power/freeze_filesystems that will allow
> userspace to freeze filesystems during suspend/hibernate. For now it
> defaults to off. The thaw logic doesn't require checking whether
> freezing is enabled because the power subsystem exclusively owns frozen
> filesystems for the duration of suspend/hibernate and is able to skip
> filesystems it doesn't need to freeze.
>
> Also it is technically possible that filesystem
> filesystem_freeze_enabled is true and power freezes the filesystems but
> before freezing all processes another process disables
> filesystem_freeze_enabled. If power were to place the filesystems_thaw()
> call under filesystems_freeze_enabled it would fail to thaw the
> fileystems it frozw. The exclusive holder mechanism makes it possible to
> iterate through the list without any concern making sure that no
> filesystems are left frozen.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Looks good modulo the nesting issue I've mentioned in my comments to patch
1. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/super.c | 14 ++++++++++----
> kernel/power/hibernate.c | 16 +++++++++++++++-
> kernel/power/main.c | 31 +++++++++++++++++++++++++++++++
> kernel/power/power.h | 4 ++++
> kernel/power/suspend.c | 7 +++++++
> 5 files changed, 67 insertions(+), 5 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 3ddded4360c6..b4bdbc509dba 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
> return active;
> }
>
> +static const char *filesystems_freeze_ptr = "filesystems_freeze";
> +
> static void filesystems_freeze_callback(struct super_block *sb, void *unused)
> {
> if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
> @@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
> return;
>
> if (sb->s_op->freeze_super)
> - sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> + sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> + filesystems_freeze_ptr);
> else
> - freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> + freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> + filesystems_freeze_ptr);
>
> deactivate_super(sb);
> }
> @@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
> return;
>
> if (sb->s_op->thaw_super)
> - sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> + sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> + filesystems_freeze_ptr);
> else
> - thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> + thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> + filesystems_freeze_ptr);
>
> deactivate_super(sb);
> }
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 50ec26ea696b..37d733945c59 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -777,6 +777,8 @@ int hibernate(void)
> goto Restore;
>
> ksys_sync_helper();
> + if (filesystem_freeze_enabled)
> + filesystems_freeze();
>
> error = freeze_processes();
> if (error)
> @@ -845,6 +847,7 @@ int hibernate(void)
> /* Don't bother checking whether freezer_test_done is true */
> freezer_test_done = false;
> Exit:
> + filesystems_thaw();
> pm_notifier_call_chain(PM_POST_HIBERNATION);
> Restore:
> pm_restore_console();
> @@ -881,6 +884,9 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
> if (error)
> goto restore;
>
> + if (filesystem_freeze_enabled)
> + filesystems_freeze();
> +
> error = freeze_processes();
> if (error)
> goto exit;
> @@ -940,6 +946,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
> thaw_processes();
>
> exit:
> + filesystems_thaw();
> pm_notifier_call_chain(PM_POST_HIBERNATION);
>
> restore:
> @@ -1028,19 +1035,26 @@ static int software_resume(void)
> if (error)
> goto Restore;
>
> + if (filesystem_freeze_enabled)
> + filesystems_freeze();
> +
> pm_pr_dbg("Preparing processes for hibernation restore.\n");
> error = freeze_processes();
> - if (error)
> + if (error) {
> + filesystems_thaw();
> goto Close_Finish;
> + }
>
> error = freeze_kernel_threads();
> if (error) {
> thaw_processes();
> + filesystems_thaw();
> goto Close_Finish;
> }
>
> error = load_image_and_restore();
> thaw_processes();
> + filesystems_thaw();
> Finish:
> pm_notifier_call_chain(PM_POST_RESTORE);
> Restore:
> diff --git a/kernel/power/main.c b/kernel/power/main.c
> index 6254814d4817..0b0e76324c43 100644
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -962,6 +962,34 @@ power_attr(pm_freeze_timeout);
>
> #endif /* CONFIG_FREEZER*/
>
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +bool filesystem_freeze_enabled = false;
> +
> +static ssize_t freeze_filesystems_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", filesystem_freeze_enabled);
> +}
> +
> +static ssize_t freeze_filesystems_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t n)
> +{
> + unsigned long val;
> +
> + if (kstrtoul(buf, 10, &val))
> + return -EINVAL;
> +
> + if (val > 1)
> + return -EINVAL;
> +
> + filesystem_freeze_enabled = !!val;
> + return n;
> +}
> +
> +power_attr(freeze_filesystems);
> +#endif /* CONFIG_SUSPEND || CONFIG_HIBERNATION */
> +
> static struct attribute * g[] = {
> &state_attr.attr,
> #ifdef CONFIG_PM_TRACE
> @@ -991,6 +1019,9 @@ static struct attribute * g[] = {
> #endif
> #ifdef CONFIG_FREEZER
> &pm_freeze_timeout_attr.attr,
> +#endif
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> + &freeze_filesystems_attr.attr,
> #endif
> NULL,
> };
> diff --git a/kernel/power/power.h b/kernel/power/power.h
> index c352dea2f67b..2eb81662b8fa 100644
> --- a/kernel/power/power.h
> +++ b/kernel/power/power.h
> @@ -18,6 +18,10 @@ struct swsusp_info {
> unsigned long size;
> } __aligned(PAGE_SIZE);
>
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +extern bool filesystem_freeze_enabled;
> +#endif
> +
> #ifdef CONFIG_HIBERNATION
> /* kernel/power/snapshot.c */
> extern void __init hibernate_reserved_size_init(void);
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 8eaec4ab121d..76b141b9aac0 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -30,6 +30,7 @@
> #include <trace/events/power.h>
> #include <linux/compiler.h>
> #include <linux/moduleparam.h>
> +#include <linux/fs.h>
>
> #include "power.h"
>
> @@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
> if (error)
> goto Restore;
>
> + if (filesystem_freeze_enabled)
> + filesystems_freeze();
> trace_suspend_resume(TPS("freeze_processes"), 0, true);
> error = suspend_freeze_processes();
> trace_suspend_resume(TPS("freeze_processes"), 0, false);
> @@ -550,6 +553,7 @@ int suspend_devices_and_enter(suspend_state_t state)
> static void suspend_finish(void)
> {
> suspend_thaw_processes();
> + filesystems_thaw();
> pm_notifier_call_chain(PM_POST_SUSPEND);
> pm_restore_console();
> }
> @@ -588,6 +592,8 @@ static int enter_state(suspend_state_t state)
> ksys_sync_helper();
> trace_suspend_resume(TPS("sync_filesystems"), 0, false);
> }
> + if (filesystem_freeze_enabled)
> + filesystems_freeze();
>
> pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
> pm_suspend_clear_flags();
> @@ -609,6 +615,7 @@ static int enter_state(suspend_state_t state)
> pm_pr_dbg("Finishing wakeup.\n");
> suspend_finish();
> Unlock:
> + filesystems_thaw();
> mutex_unlock(&system_transition_mutex);
> return error;
> }
>
> --
> 2.47.2
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
Sysfs is built on top of kernfs and sysfs provides the power management
infrastructure to support suspend/hibernate by writing to various files
in /sys/power/. As filesystems may be automatically frozen during
suspend/hibernate implementing freeze/thaw support for kernfs
generically will cause deadlocks as the suspending/hibernation
initiating task will hold a VFS lock that it will then wait upon to be
released. If freeze/thaw for kernfs is needed talk to the VFS.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/kernfs/mount.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1..d2073bb2b633 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,21 @@ const struct super_operations kernfs_sops = {
.show_options = kernfs_sop_show_options,
.show_path = kernfs_sop_show_path,
+
+ /*
+ * sysfs is built on top of kernfs and sysfs provides the power
+ * management infrastructure to support suspend/hibernate by
+ * writing to various files in /sys/power/. As filesystems may
+ * be automatically frozen during suspend/hibernate implementing
+ * freeze/thaw support for kernfs generically will cause
+ * deadlocks as the suspending/hibernation initiating task will
+ * hold a VFS lock that it will then wait upon to be released.
+ * If freeze/thaw for kernfs is needed talk to the VFS.
+ */
+ .freeze_fs = NULL,
+ .unfreeze_fs = NULL,
+ .freeze_super = NULL,
+ .thaw_super = NULL,
};
static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
--
2.47.2
On Wed 02-04-25 16:07:34, Christian Brauner wrote:
> Sysfs is built on top of kernfs and sysfs provides the power management
> infrastructure to support suspend/hibernate by writing to various files
> in /sys/power/. As filesystems may be automatically frozen during
> suspend/hibernate implementing freeze/thaw support for kernfs
> generically will cause deadlocks as the suspending/hibernation
> initiating task will hold a VFS lock that it will then wait upon to be
> released. If freeze/thaw for kernfs is needed talk to the VFS.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Yeah, good idea. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/kernfs/mount.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index 1358c21837f1..d2073bb2b633 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,21 @@ const struct super_operations kernfs_sops = {
>
> .show_options = kernfs_sop_show_options,
> .show_path = kernfs_sop_show_path,
> +
> + /*
> + * sysfs is built on top of kernfs and sysfs provides the power
> + * management infrastructure to support suspend/hibernate by
> + * writing to various files in /sys/power/. As filesystems may
> + * be automatically frozen during suspend/hibernate implementing
> + * freeze/thaw support for kernfs generically will cause
> + * deadlocks as the suspending/hibernation initiating task will
> + * hold a VFS lock that it will then wait upon to be released.
> + * If freeze/thaw for kernfs is needed talk to the VFS.
> + */
> + .freeze_fs = NULL,
> + .unfreeze_fs = NULL,
> + .freeze_super = NULL,
> + .thaw_super = NULL,
> };
>
> static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
>
> --
> 2.47.2
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote: > The whole shebang can also be found at: > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze > > I know nothing about power or hibernation. I would like to place this behind a Kconfig option and add a /sys/power/freeze_on_suspend option as these changes are pretty sensitive and to give userspace the ability to experiment with this for a while until we remove it. That means we should skip the removal of all the freezer changes in the filesystems until we're happy enough that this works reliable enough.
© 2016 - 2025 Red Hat, Inc.