[PATCH] erofs: fix PSI memstall accounting

Gao Xiang posted 1 patch 1 year, 2 months ago
fs/erofs/zdata.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH] erofs: fix PSI memstall accounting
Posted by Gao Xiang 1 year, 2 months ago
Max Kellermann recently reported psi_group_cpu.tasks[NR_MEMSTALL] is
incorrect in 6.11.9 kernel.

I think the root cause of the recent issue is that since the problematic
commit, bio can be NULL so psi_memstall_leave() could be missed in
z_erofs_submit_queue().

Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+8tvSowiJADW2RuKyofL_CSkm_SuyZA7ME5vMLWmL6pqw@mail.gmail.com
Fixes: 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
 fs/erofs/zdata.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 01f147505487..19ef4ff2a134 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1792,9 +1792,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
 			erofs_fscache_submit_bio(bio);
 		else
 			submit_bio(bio);
-		if (memstall)
-			psi_memstall_leave(&pflags);
 	}
+	if (memstall)
+		psi_memstall_leave(&pflags);
 
 	/*
 	 * although background is preferred, no one is pending for submission.
-- 
2.43.5
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Chao Yu 1 year, 1 month ago
On 2024/11/27 16:52, Gao Xiang wrote:
> Max Kellermann recently reported psi_group_cpu.tasks[NR_MEMSTALL] is
> incorrect in 6.11.9 kernel.
> 
> I think the root cause of the recent issue is that since the problematic
> commit, bio can be NULL so psi_memstall_leave() could be missed in
> z_erofs_submit_queue().
> 
> Reported-by: Max Kellermann <max.kellermann@ionos.com>
> Closes: https://lore.kernel.org/r/CAKPOu+8tvSowiJADW2RuKyofL_CSkm_SuyZA7ME5vMLWmL6pqw@mail.gmail.com
> Fixes: 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly")
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

Reviewed-by: Chao Yu <chao@kernel.org>

Thanks,
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Sandeep Dhavale 1 year, 2 months ago
Looks good,

Reviewed-by: Sandeep Dhavale <dhavale@google.com>

Thanks,
Sandeep.
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Gao Xiang 1 year, 2 months ago
Hi Sandeep,

On 2024/12/3 05:52, Sandeep Dhavale wrote:
> Looks good,
> 
> Reviewed-by: Sandeep Dhavale <dhavale@google.com>

Thanks for the review!  Max Kellermann reported the similar
issue can still be reproduced with this patch although I'm
not sure if it's the same issue though.


Hi Max,
I think at least this patch resolves a recent regression,
I guess we'd better to address it first.

As for your reports, I think we might need more information
about this, since I don't find more clues from the EROFS
codebase itself.

Thanks,
Gao Xiang

> 
> Thanks,
> Sandeep.
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Max Kellermann 1 year, 2 months ago
On Tue, Dec 3, 2024 at 2:42 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> I think at least this patch resolves a recent regression,
> I guess we'd better to address it first.

Certainly your patch is an improvement, but maybe this just wasn't the
bug I was experiencing.

> As for your reports, I think we might need more information
> about this, since I don't find more clues from the EROFS
> codebase itself.

What kind of information do you need? I posted some high-level
information about my setup here:
https://lore.kernel.org/lkml/CAKPOu+-_X9cc723v_f_BW4CwfHJe_mi=+cbUBP2tZO-kEcyoMA@mail.gmail.com/
What else do you want to know?
This triggers pretty often now - can we improve Suren's patch to
record and dump more information when this happens?

Right now, we don't even know if this is an erofs problem (and if this
ever was). It is just one of several candidates.

Max
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Gao Xiang 1 year, 2 months ago

On 2024/12/3 16:02, Max Kellermann via Linux-erofs wrote:
> On Tue, Dec 3, 2024 at 2:42 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>> I think at least this patch resolves a recent regression,
>> I guess we'd better to address it first.
> 
> Certainly your patch is an improvement, but maybe this just wasn't the
> bug I was experiencing.
> 
>> As for your reports, I think we might need more information
>> about this, since I don't find more clues from the EROFS
>> codebase itself.
> 
> What kind of information do you need? I posted some high-level
> information about my setup here:
> https://lore.kernel.org/lkml/CAKPOu+-_X9cc723v_f_BW4CwfHJe_mi=+cbUBP2tZO-kEcyoMA@mail.gmail.com/
> What else do you want to know?
> This triggers pretty often now - can we improve Suren's patch to
> record and dump more information when this happens?
> 
> Right now, we don't even know if this is an erofs problem (and if this
> ever was). It is just one of several candidates.

I have no idea since I'm not working on PSI stuffs.

But I guess you could record more frame addresses to get the caller
of readahead_expand()? e.g. __builtin_return_address(1)?

Also btw, is _RET_IP_ stable among all cases as readahead_expand()?

Thanks,
Gao Xiang

> 
> Max

Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Max Kellermann 1 year, 2 months ago
On Tue, Dec 3, 2024 at 9:15 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> But I guess you could record more frame addresses to get the caller
> of readahead_expand()? e.g. __builtin_return_address(1)?

This turned out to be a big disaster. __builtin_return_address(1)
instantly crashes the kernel, causing the whole (production) cluster
to go down.
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Max Kellermann 1 year, 2 months ago
On Tue, Dec 3, 2024 at 2:30 PM Max Kellermann <max.kellermann@ionos.com> wrote:
> __builtin_return_address(1) instantly crashes the kernel

I was able to capture the __builtin_return_address crash on a serial
console (this has nothing to do with the PSI memstall bug):

[16438.331677] BUG: kernel NULL pointer dereference, address: 0000000000000008
[16438.338693] #PF: supervisor read access in kernel mode
[16438.343866] #PF: error_code(0x0000) - not-present page
[16438.349038] PGD 0 P4D 0
[16438.351588] Oops: Oops: 0000 [#1] SMP PTI
[16438.355625] CPU: 18 UID: 2147486501 PID: 23486 Comm: less Not
tainted 6.11.10-cm4all4-hp+ #291
[16438.364297] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
Gen10, BIOS U30 09/05/2019
[16438.372880] RIP: 0010:psi_memstall_enter+0x7f/0xa0
[16438.377708] Code: 89 df 80 8b 19 05 00 00 01 48 8b 45 08 48 89 83
c0 08 00 00 48 8b 45 00 48 8b 40 08 48 89 83 c8 08 00 00 48 8b 45 00
48 8b 00 <48> 8b 40 08 48 89 83 d0 08 00 00 e8 d1 fe ff ff 4c 89 e7 e8
29 28
[16438.396609] RSP: 0000:ffff9ee683bc7bd0 EFLAGS: 00010002
[16438.401869] RAX: 0000000000000000 RBX: ffff90c70c291740 RCX: 0000000000000001
[16438.409052] RDX: 000000000000000a RSI: 0000000000000000 RDI: ffff90c70c291740
[16438.416234] RBP: ffff9ee683bc7be0 R08: 000000000007cec1 R09: 0000000000000007
[16438.423418] R10: ffff90c727d83678 R11: 0000000000000000 R12: ffff9124bd0ae000
[16438.430600] R13: 0000000000000014 R14: ffff9ee683bc7ce8 R15: 0000000000000000
[16438.437785] FS:  00007ff15f1f4740(0000) GS:ffff9124bd080000(0000)
knlGS:0000000000000000
[16438.445929] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16438.451714] CR2: 0000000000000008 CR3: 000000010c54a002 CR4: 00000000007706f0
[16438.458896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16438.466079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16438.473263] PKRU: 55555554
[16438.475984] Call Trace:
[16438.478446]  <TASK>
[16438.480557]  ? __die+0x1f/0x60
[16438.483636]  ? page_fault_oops+0x15c/0x450
[16438.487761]  ? exc_page_fault+0x5e/0x100
[16438.491710]  ? asm_exc_page_fault+0x22/0x30
[16438.495923]  ? psi_memstall_enter+0x7f/0xa0
[16438.500135]  read_pages+0x205/0x220
[16438.503647]  ? free_unref_page+0x2c1/0x420
[16438.507771]  page_cache_ra_order+0x1d3/0x2b0
[16438.512069]  filemap_fault+0x548/0xba0
[16438.515845]  __do_fault+0x32/0x90
[16438.519182]  do_fault+0x2a1/0x4a0
[16438.522519]  __handle_mm_fault+0x337/0x1ca0
[16438.526730]  handle_mm_fault+0x128/0x260
[16438.530677]  do_user_addr_fault+0x1d8/0x5b0
[16438.534889]  exc_page_fault+0x5e/0x100
[16438.538663]  asm_exc_page_fault+0x22/0x30
[16438.542697] RIP: 0033:0x55ebb60829a0
[16438.546298] Code: Unable to access opcode bytes at 0x55ebb6082976.
[16438.552519] RSP: 002b:00007fffd2a03658 EFLAGS: 00010246
[16438.557779] RAX: 0000000000000016 RBX: 000055ebdb88b410 RCX: 000055ebdb88b410
[16438.564963] RDX: 0000000000000ee4 RSI: 000055ec5b0b1e30 RDI: 000055ebdb88b910
[16438.572147] RBP: 0000000000000010 R08: 00000000ffffffff R09: 0000000000000000
[16438.579331] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000046
[16438.586514] R13: 000055ebb6099e6c R14: 000055ebb6099c93 R15: 00007fffd2a03706
[16438.593697]  </TASK>
[16438.595895] Modules linked in:
[16438.598970] CR2: 0000000000000008
[16438.602307] ---[ end trace 0000000000000000 ]---
Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Gao Xiang 1 year, 2 months ago

On 2024/12/3 22:00, Max Kellermann wrote:
> On Tue, Dec 3, 2024 at 2:30 PM Max Kellermann <max.kellermann@ionos.com> wrote:
>> __builtin_return_address(1) instantly crashes the kernel
> 
> I was able to capture the __builtin_return_address crash on a serial
> console (this has nothing to do with the PSI memstall bug):

Ok.. so __builtin_return_address(1) doesn't actually work?

Thanks,
Gao Xiang

> 
> [16438.331677] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [16438.338693] #PF: supervisor read access in kernel mode
> [16438.343866] #PF: error_code(0x0000) - not-present page
> [16438.349038] PGD 0 P4D 0
> [16438.351588] Oops: Oops: 0000 [#1] SMP PTI
> [16438.355625] CPU: 18 UID: 2147486501 PID: 23486 Comm: less Not
> tainted 6.11.10-cm4all4-hp+ #291
> [16438.364297] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
> Gen10, BIOS U30 09/05/2019
> [16438.372880] RIP: 0010:psi_memstall_enter+0x7f/0xa0
> [16438.377708] Code: 89 df 80 8b 19 05 00 00 01 48 8b 45 08 48 89 83
> c0 08 00 00 48 8b 45 00 48 8b 40 08 48 89 83 c8 08 00 00 48 8b 45 00
> 48 8b 00 <48> 8b 40 08 48 89 83 d0 08 00 00 e8 d1 fe ff ff 4c 89 e7 e8
> 29 28
> [16438.396609] RSP: 0000:ffff9ee683bc7bd0 EFLAGS: 00010002
> [16438.401869] RAX: 0000000000000000 RBX: ffff90c70c291740 RCX: 0000000000000001
> [16438.409052] RDX: 000000000000000a RSI: 0000000000000000 RDI: ffff90c70c291740
> [16438.416234] RBP: ffff9ee683bc7be0 R08: 000000000007cec1 R09: 0000000000000007
> [16438.423418] R10: ffff90c727d83678 R11: 0000000000000000 R12: ffff9124bd0ae000
> [16438.430600] R13: 0000000000000014 R14: ffff9ee683bc7ce8 R15: 0000000000000000
> [16438.437785] FS:  00007ff15f1f4740(0000) GS:ffff9124bd080000(0000)
> knlGS:0000000000000000
> [16438.445929] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [16438.451714] CR2: 0000000000000008 CR3: 000000010c54a002 CR4: 00000000007706f0
> [16438.458896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [16438.466079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [16438.473263] PKRU: 55555554
> [16438.475984] Call Trace:
> [16438.478446]  <TASK>
> [16438.480557]  ? __die+0x1f/0x60
> [16438.483636]  ? page_fault_oops+0x15c/0x450
> [16438.487761]  ? exc_page_fault+0x5e/0x100
> [16438.491710]  ? asm_exc_page_fault+0x22/0x30
> [16438.495923]  ? psi_memstall_enter+0x7f/0xa0
> [16438.500135]  read_pages+0x205/0x220
> [16438.503647]  ? free_unref_page+0x2c1/0x420
> [16438.507771]  page_cache_ra_order+0x1d3/0x2b0
> [16438.512069]  filemap_fault+0x548/0xba0
> [16438.515845]  __do_fault+0x32/0x90
> [16438.519182]  do_fault+0x2a1/0x4a0
> [16438.522519]  __handle_mm_fault+0x337/0x1ca0
> [16438.526730]  handle_mm_fault+0x128/0x260
> [16438.530677]  do_user_addr_fault+0x1d8/0x5b0
> [16438.534889]  exc_page_fault+0x5e/0x100
> [16438.538663]  asm_exc_page_fault+0x22/0x30
> [16438.542697] RIP: 0033:0x55ebb60829a0
> [16438.546298] Code: Unable to access opcode bytes at 0x55ebb6082976.
> [16438.552519] RSP: 002b:00007fffd2a03658 EFLAGS: 00010246
> [16438.557779] RAX: 0000000000000016 RBX: 000055ebdb88b410 RCX: 000055ebdb88b410
> [16438.564963] RDX: 0000000000000ee4 RSI: 000055ec5b0b1e30 RDI: 000055ebdb88b910
> [16438.572147] RBP: 0000000000000010 R08: 00000000ffffffff R09: 0000000000000000
> [16438.579331] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000046
> [16438.586514] R13: 000055ebb6099e6c R14: 000055ebb6099c93 R15: 00007fffd2a03706
> [16438.593697]  </TASK>
> [16438.595895] Modules linked in:
> [16438.598970] CR2: 0000000000000008
> [16438.602307] ---[ end trace 0000000000000000 ]---

Re: [PATCH] erofs: fix PSI memstall accounting
Posted by Max Kellermann 1 year, 2 months ago
On Tue, Dec 3, 2024 at 9:15 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> But I guess you could record more frame addresses to get the caller
> of readahead_expand()? e.g. __builtin_return_address(1)?

ok, I'll try to figure that out.

> Also btw, is _RET_IP_ stable among all cases as readahead_expand()?

Yes, always the same address.