From: Hou Tao <houtao1@huawei.com>
The freeing of relinquished volume will wake up the pending volume
acquisition by using wake_up_bit(), however it is mismatched with
wait_var_event() used in fscache_wait_on_volume_collision() and it will
never wake up the waiter in the wait-queue because these two functions
operate on different wait-queues.
According to the implementation in fscache_wait_on_volume_collision(),
if the wake-up of pending acquisition is delayed longer than 20 seconds
(e.g., due to the delay of on-demand fd closing), the first
wait_var_event_timeout() will timeout and the following wait_var_event()
will hang forever as shown below:
FS-Cache: Potential volume collision new=00000024 old=00000022
......
INFO: task mount:1148 blocked for more than 122 seconds.
Not tainted 6.1.0-rc6+ #1
task:mount state:D stack:0 pid:1148 ppid:1
Call Trace:
<TASK>
__schedule+0x2f6/0xb80
schedule+0x67/0xe0
fscache_wait_on_volume_collision.cold+0x80/0x82
__fscache_acquire_volume+0x40d/0x4e0
erofs_fscache_register_volume+0x51/0xe0 [erofs]
erofs_fscache_register_fs+0x19c/0x240 [erofs]
erofs_fc_fill_super+0x746/0xaf0 [erofs]
vfs_get_super+0x7d/0x100
get_tree_nodev+0x16/0x20
erofs_fc_get_tree+0x20/0x30 [erofs]
vfs_get_tree+0x24/0xb0
path_mount+0x2fa/0xa90
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x30/0x60
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Considering that wake_up_bit() is more selective, so fixing it by using
wait_on_bit() instead of wait_var_event() to wait for the freeing of
relinquished volume. In addition because waitqueue_active() is used in
wake_up_bit() and clear_bit() doesn't imply any memory barrier, so also
adding smp_mb__after_atomic() before wake_up_bit().
Fixes: 62ab63352350 ("fscache: Implement volume registration")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
fs/fscache/volume.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/fscache/volume.c b/fs/fscache/volume.c
index ab8ceddf9efa..fc3dd3bc851d 100644
--- a/fs/fscache/volume.c
+++ b/fs/fscache/volume.c
@@ -141,13 +141,14 @@ static bool fscache_is_acquire_pending(struct fscache_volume *volume)
static void fscache_wait_on_volume_collision(struct fscache_volume *candidate,
unsigned int collidee_debug_id)
{
- wait_var_event_timeout(&candidate->flags,
- !fscache_is_acquire_pending(candidate), 20 * HZ);
+ wait_on_bit_timeout(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
+ TASK_UNINTERRUPTIBLE, 20 * HZ);
if (fscache_is_acquire_pending(candidate)) {
pr_notice("Potential volume collision new=%08x old=%08x",
candidate->debug_id, collidee_debug_id);
fscache_stat(&fscache_n_volumes_collision);
- wait_var_event(&candidate->flags, !fscache_is_acquire_pending(candidate));
+ wait_on_bit(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING,
+ TASK_UNINTERRUPTIBLE);
}
}
@@ -348,6 +349,11 @@ static void fscache_wake_pending_volume(struct fscache_volume *volume,
if (fscache_volume_same(cursor, volume)) {
fscache_see_volume(cursor, fscache_volume_see_hash_wake);
clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags);
+ /*
+ * Paired with barrier in wait_on_bit(). Check
+ * wake_up_bit() and waitqueue_active() for details.
+ */
+ smp_mb__after_atomic();
wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING);
return;
}
--
2.29.2
On 12/26/22 6:33 PM, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > > The freeing of relinquished volume will wake up the pending volume > acquisition by using wake_up_bit(), however it is mismatched with > wait_var_event() used in fscache_wait_on_volume_collision() and it will > never wake up the waiter in the wait-queue because these two functions > operate on different wait-queues. > > According to the implementation in fscache_wait_on_volume_collision(), > if the wake-up of pending acquisition is delayed longer than 20 seconds > (e.g., due to the delay of on-demand fd closing), the first > wait_var_event_timeout() will timeout and the following wait_var_event() > will hang forever as shown below: > > FS-Cache: Potential volume collision new=00000024 old=00000022 > ...... > INFO: task mount:1148 blocked for more than 122 seconds. > Not tainted 6.1.0-rc6+ #1 > task:mount state:D stack:0 pid:1148 ppid:1 > Call Trace: > <TASK> > __schedule+0x2f6/0xb80 > schedule+0x67/0xe0 > fscache_wait_on_volume_collision.cold+0x80/0x82 > __fscache_acquire_volume+0x40d/0x4e0 > erofs_fscache_register_volume+0x51/0xe0 [erofs] > erofs_fscache_register_fs+0x19c/0x240 [erofs] > erofs_fc_fill_super+0x746/0xaf0 [erofs] > vfs_get_super+0x7d/0x100 > get_tree_nodev+0x16/0x20 > erofs_fc_get_tree+0x20/0x30 [erofs] > vfs_get_tree+0x24/0xb0 > path_mount+0x2fa/0xa90 > do_mount+0x7c/0xa0 > __x64_sys_mount+0x8b/0xe0 > do_syscall_64+0x30/0x60 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > > Considering that wake_up_bit() is more selective, so fixing it by using ^ fix > wait_on_bit() instead of wait_var_event() to wait for the freeing of > relinquished volume. In addition because waitqueue_active() is used in > wake_up_bit() and clear_bit() doesn't imply any memory barrier, so also > adding smp_mb__after_atomic() before wake_up_bit(). ... doesn't imply any memory barrier, add ... > > Fixes: 62ab63352350 ("fscache: Implement volume registration") > Signed-off-by: Hou Tao <houtao1@huawei.com> Otherwise LGTM :) Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> > --- > fs/fscache/volume.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/fs/fscache/volume.c b/fs/fscache/volume.c > index ab8ceddf9efa..fc3dd3bc851d 100644 > --- a/fs/fscache/volume.c > +++ b/fs/fscache/volume.c > @@ -141,13 +141,14 @@ static bool fscache_is_acquire_pending(struct fscache_volume *volume) > static void fscache_wait_on_volume_collision(struct fscache_volume *candidate, > unsigned int collidee_debug_id) > { > - wait_var_event_timeout(&candidate->flags, > - !fscache_is_acquire_pending(candidate), 20 * HZ); > + wait_on_bit_timeout(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING, > + TASK_UNINTERRUPTIBLE, 20 * HZ); > if (fscache_is_acquire_pending(candidate)) { > pr_notice("Potential volume collision new=%08x old=%08x", > candidate->debug_id, collidee_debug_id); > fscache_stat(&fscache_n_volumes_collision); > - wait_var_event(&candidate->flags, !fscache_is_acquire_pending(candidate)); > + wait_on_bit(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING, > + TASK_UNINTERRUPTIBLE); > } > } > > @@ -348,6 +349,11 @@ static void fscache_wake_pending_volume(struct fscache_volume *volume, > if (fscache_volume_same(cursor, volume)) { > fscache_see_volume(cursor, fscache_volume_see_hash_wake); > clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags); > + /* > + * Paired with barrier in wait_on_bit(). Check > + * wake_up_bit() and waitqueue_active() for details. > + */ > + smp_mb__after_atomic(); > wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING); > return; > } -- Thanks, Jingbo
Hi, On 1/12/2023 11:47 AM, Jingbo Xu wrote: > > On 12/26/22 6:33 PM, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> The freeing of relinquished volume will wake up the pending volume >> acquisition by using wake_up_bit(), however it is mismatched with >> wait_var_event() used in fscache_wait_on_volume_collision() and it will >> never wake up the waiter in the wait-queue because these two functions >> operate on different wait-queues. >> >> According to the implementation in fscache_wait_on_volume_collision(), >> if the wake-up of pending acquisition is delayed longer than 20 seconds >> (e.g., due to the delay of on-demand fd closing), the first >> wait_var_event_timeout() will timeout and the following wait_var_event() >> will hang forever as shown below: >> >> FS-Cache: Potential volume collision new=00000024 old=00000022 >> ...... >> INFO: task mount:1148 blocked for more than 122 seconds. >> Not tainted 6.1.0-rc6+ #1 >> task:mount state:D stack:0 pid:1148 ppid:1 >> Call Trace: >> <TASK> >> __schedule+0x2f6/0xb80 >> schedule+0x67/0xe0 >> fscache_wait_on_volume_collision.cold+0x80/0x82 >> __fscache_acquire_volume+0x40d/0x4e0 >> erofs_fscache_register_volume+0x51/0xe0 [erofs] >> erofs_fscache_register_fs+0x19c/0x240 [erofs] >> erofs_fc_fill_super+0x746/0xaf0 [erofs] >> vfs_get_super+0x7d/0x100 >> get_tree_nodev+0x16/0x20 >> erofs_fc_get_tree+0x20/0x30 [erofs] >> vfs_get_tree+0x24/0xb0 >> path_mount+0x2fa/0xa90 >> do_mount+0x7c/0xa0 >> __x64_sys_mount+0x8b/0xe0 >> do_syscall_64+0x30/0x60 >> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >> >> Considering that wake_up_bit() is more selective, so fixing it by using > ^ > fix >> wait_on_bit() instead of wait_var_event() to wait for the freeing of >> relinquished volume. In addition because waitqueue_active() is used in >> wake_up_bit() and clear_bit() doesn't imply any memory barrier, so also >> adding smp_mb__after_atomic() before wake_up_bit(). > ... doesn't imply any memory barrier, add ... Thanks for suggestions above. Will update in v3. > >> Fixes: 62ab63352350 ("fscache: Implement volume registration") >> Signed-off-by: Hou Tao <houtao1@huawei.com> > > Otherwise LGTM :) > > Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Thanks for review. > >> --- >> fs/fscache/volume.c | 12 +++++++++--- >> 1 file changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/fs/fscache/volume.c b/fs/fscache/volume.c >> index ab8ceddf9efa..fc3dd3bc851d 100644 >> --- a/fs/fscache/volume.c >> +++ b/fs/fscache/volume.c >> @@ -141,13 +141,14 @@ static bool fscache_is_acquire_pending(struct fscache_volume *volume) >> static void fscache_wait_on_volume_collision(struct fscache_volume *candidate, >> unsigned int collidee_debug_id) >> { >> - wait_var_event_timeout(&candidate->flags, >> - !fscache_is_acquire_pending(candidate), 20 * HZ); >> + wait_on_bit_timeout(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING, >> + TASK_UNINTERRUPTIBLE, 20 * HZ); >> if (fscache_is_acquire_pending(candidate)) { >> pr_notice("Potential volume collision new=%08x old=%08x", >> candidate->debug_id, collidee_debug_id); >> fscache_stat(&fscache_n_volumes_collision); >> - wait_var_event(&candidate->flags, !fscache_is_acquire_pending(candidate)); >> + wait_on_bit(&candidate->flags, FSCACHE_VOLUME_ACQUIRE_PENDING, >> + TASK_UNINTERRUPTIBLE); >> } >> } >> >> @@ -348,6 +349,11 @@ static void fscache_wake_pending_volume(struct fscache_volume *volume, >> if (fscache_volume_same(cursor, volume)) { >> fscache_see_volume(cursor, fscache_volume_see_hash_wake); >> clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags); >> + /* >> + * Paired with barrier in wait_on_bit(). Check >> + * wake_up_bit() and waitqueue_active() for details. >> + */ >> + smp_mb__after_atomic(); >> wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING); >> return; >> }
Hou Tao <houtao@huaweicloud.com> wrote: > clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags); > + /* > + * Paired with barrier in wait_on_bit(). Check > + * wake_up_bit() and waitqueue_active() for details. > + */ > + smp_mb__after_atomic(); > wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING); What two values are you applying a partial ordering to? David
On 1/12/23 12:06 AM, David Howells wrote: > Hou Tao <houtao@huaweicloud.com> wrote: > >> clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags); >> + /* >> + * Paired with barrier in wait_on_bit(). Check >> + * wake_up_bit() and waitqueue_active() for details. >> + */ >> + smp_mb__after_atomic(); >> wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING); > > What two values are you applying a partial ordering to? Yeah Hou Tao has explained that a full barrier is needed here to avoid the potential reordering at the waker side. As I was also researching on this these days, I'd like to share my thought on this, hopefully if it could give some insight :) Without the barrier at the waker side, it may suffer from the following race: ``` CPU0 - waker CPU1 - waiter if (waitqueue_active(wq_head)) <-- find no wq_entry in wq_head list wake_up(wq_head); for (;;) { prepare_to_wait(...); # add wq_entry into wq_head list if (@cond) <-- @cond is false break; schedule(); <-- wq_entry still in wq_head list, wait for next wakeup } finish_wait(&wq_head, &wait); @cond = true; ``` in which case the waiter misses the wakeup for one time. -- Thanks, Jingbo
Hi, On 1/12/2023 11:58 AM, Jingbo Xu wrote: > > On 1/12/23 12:06 AM, David Howells wrote: >> Hou Tao <houtao@huaweicloud.com> wrote: >> >>> clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags); >>> + /* >>> + * Paired with barrier in wait_on_bit(). Check >>> + * wake_up_bit() and waitqueue_active() for details. >>> + */ >>> + smp_mb__after_atomic(); >>> wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING); >> What two values are you applying a partial ordering to? > Yeah Hou Tao has explained that a full barrier is needed here to avoid > the potential reordering at the waker side. > > As I was also researching on this these days, I'd like to share my > thought on this, hopefully if it could give some insight :) > > Without the barrier at the waker side, it may suffer from the following > race: > > ``` > CPU0 - waker CPU1 - waiter > > if (waitqueue_active(wq_head)) <-- find no wq_entry in wq_head list > wake_up(wq_head); > > for (;;) { > prepare_to_wait(...); > # add wq_entry into wq_head list > > if (@cond) <-- @cond is false > break; > schedule(); <-- wq_entry still in > wq_head list, > wait for next wakeup > } > finish_wait(&wq_head, &wait); > > @cond = true; > ``` > > in which case the waiter misses the wakeup for one time. Thanks for the details annotation. It is exactly what I tried to say but failed to. >
Hi, On 1/12/2023 12:06 AM, David Howells wrote: > Hou Tao <houtao@huaweicloud.com> wrote: > >> clear_bit(FSCACHE_VOLUME_ACQUIRE_PENDING, &cursor->flags); >> + /* >> + * Paired with barrier in wait_on_bit(). Check >> + * wake_up_bit() and waitqueue_active() for details. >> + */ >> + smp_mb__after_atomic(); >> wake_up_bit(&cursor->flags, FSCACHE_VOLUME_ACQUIRE_PENDING); > What two values are you applying a partial ordering to? cursor->flags and wq->head. fscache_wake_pending_volume() will write cursor->flags and read wq->head through waitqueue_active(), and the wait will write wq->head then read cursor->flags. > > David >
© 2016 - 2025 Red Hat, Inc.