[v2] RE: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

RE: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Luck, Tony 11 months, 3 weeks ago

> > We could, but I don't like it much. By taking the page offline from the relatively
> > kind environment of a regular interrupt, we often avoid taking a machine check
> > (which is an unfriendly environment for software).
>
> Right.
>
> > We could make the action in uc_decode_notifier() configurable. Default=off
> > but with a command line option to enable for systems that are stuck with
> > broadcast machine checks.
>
> So we can figure that out during boot - no need for yet another cmdline
> option.

Yup. I think the boot time test might be something like:

	// Enable UCNA offline for systems with broadcast machine check
	if (!(AMD || LMCE))
		mce_register_decode_chain(&mce_uc_nb);
>
> It still doesn't fix the race and I'd like to fix that instead, in the optimal
> case.
>
> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
> need to fix the commit message to explain why this thing even happens.
>
> I.e., basically what you wrote and Shuai could use that explanation to write
> a commit message explaining what the situation is along with the background so
> that when we go back to this later, we will actually know what is going on.

Agreed. Shaui needs to harvest this thread to fill out the details in the commit
messages.

>
> But looking at
>
>   046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>
> That thing was trying to fix the same reporting fail. Why didn't it do that?
>
> Ooooh, now I see what the issue is. He doesn't want to kill the process which
> gets the wrong SIGBUS. Maybe the commit title should've said that:
>
>   mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>
> or so.
>
> But how/why is that ok?
>
> Are we confident that
>
> +        * ret = 0 when poison page is a clean page and it's dropped, no
> +        * SIGBUS is needed.
>
> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
> CMCI has won the race?

There are probably other races. Two CPUs both take local #MC on the same page
(maybe not all that rare in threaded processes ... or even with some hot code in 
a shared library).

> Can memory poison return 0 there too, for another reason and we end up *not
> killing* a process which we should have?
>
> Hmmm.

Hmmm indeed. Needs some thought. Though failing to kill a process likely means
it retries the access and comes right back to try again (without the race this time).

>
> > On Intel that would mean not registering the notifier at all. What about AMD?
> > Do you have similar races for MCE_DEFERRED_SEVERITY errors?
>
> Probably. Lemme ask around.
>
> > [1] Some OEMs still do not enable LMCE in their BIOS.
>
> Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete,
> trust me.
>
> Are we force-enabling LMCE in this case when booting?

Linux tries to enable if LMCE is supported, but BIOS has veto power.
See the bit in lmce_supported() that checks MSR_IA32_FEAT_CTL

-Tony

Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Borislav Petkov 11 months, 2 weeks ago

On Thu, Feb 20, 2025 at 05:50:14PM +0000, Luck, Tony wrote:
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Yap.

> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in 
> a shared library).

Yap, exactly. And I think there's nothing we can do - the hw is out there so
the sw needs to handle them cases correctly.

> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).

What happens if it fails to kill the process? It'll return to it, it'll try to
touch the faulty memory and raise another #MC? Right, I think so.

> > > On Intel that would mean not registering the notifier at all. What about AMD?
> > > Do you have similar races for MCE_DEFERRED_SEVERITY errors?
> >
> > Probably. Lemme ask around.

After talking to folks internally, yeah, I think we'll probably have a similar
thing. Haven't seen it happen yet.

> Linux tries to enable if LMCE is supported, but BIOS has veto power.
> See the bit in lmce_supported() that checks MSR_IA32_FEAT_CTL

I'm trying to educate our hw folks to not rely on OEM BIOS if possible. For
every chance I get. Otherwise you get crap like that and this is never getting
better.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Shuai Xue 11 months, 3 weeks ago


在 2025/2/21 01:50, Luck, Tony 写道:
>>> We could, but I don't like it much. By taking the page offline from the relatively
>>> kind environment of a regular interrupt, we often avoid taking a machine check
>>> (which is an unfriendly environment for software).
>>
>> Right.
>>
>>> We could make the action in uc_decode_notifier() configurable. Default=off
>>> but with a command line option to enable for systems that are stuck with
>>> broadcast machine checks.
>>
>> So we can figure that out during boot - no need for yet another cmdline
>> option.
> 
> Yup. I think the boot time test might be something like:
> 
> 	// Enable UCNA offline for systems with broadcast machine check
> 	if (!(AMD || LMCE))
> 		mce_register_decode_chain(&mce_uc_nb);
>>
>> It still doesn't fix the race and I'd like to fix that instead, in the optimal
>> case.
>>
>> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
>> need to fix the commit message to explain why this thing even happens.
>>
>> I.e., basically what you wrote and Shuai could use that explanation to write
>> a commit message explaining what the situation is along with the background so
>> that when we go back to this later, we will actually know what is going on.
> 
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Sure, I'd like to add more backgroud details with Tony's explanation.

> 
>>
>> But looking at
>>
>>    046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>
>> That thing was trying to fix the same reporting fail. Why didn't it do that?
>>
>> Ooooh, now I see what the issue is. He doesn't want to kill the process which
>> gets the wrong SIGBUS. Maybe the commit title should've said that:
>>
>>    mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>>
>> or so.
>>
>> But how/why is that ok?
>>
>> Are we confident that
>>
>> +        * ret = 0 when poison page is a clean page and it's dropped, no
>> +        * SIGBUS is needed.
>>
>> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
>> CMCI has won the race?
> 
> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in
> a shared library).
> 
>> Can memory poison return 0 there too, for another reason and we end up *not
>> killing* a process which we should have?
>>
>> Hmmm.
> 
> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).
> 

Emmm, if two threaded processes consume a poisond data, there may three CPUs
race, two of which take local #MC on the same page and one take CMCI. For,
example:

#perf script
kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

It seems to complicate the issue further.

IMHO, we should focus on three main points:

- kill_accessing_process() is only called when the flags are set to
   MF_ACTION_REQUIRED, which means it is in the MCE path.
- Whether the page is clean determines the behavior of try_to_unmap. For a
   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
   and simply unmaps the PTE.
- When does walk_page_range() with hwpoison_walk_ops return 1?
   1. If the poison page still exists, we should of course kill the current
      process.
   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
      it is a dirty page, we should also kill the current process, too.
   3. Otherwise, it returns 0, which means the page is clean.


Thanks.
Shuai

Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Borislav Petkov 11 months, 2 weeks ago

On Fri, Feb 21, 2025 at 02:05:28PM +0800, Shuai Xue wrote:
> #perf script
> kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
>         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
>         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
>         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
>         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
>         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
>         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
>         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
>         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
>         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])
> 
> einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
>         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
> 
> einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
>         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

What are those stack traces supposed to say?

Two processes are injecting, cause a #MC and a kworker gets to handle the UC?

All injecting to the same page?

What's the upper limit on CPUs seeing the same hw error and all raising
a CMCI/#MC?

> - kill_accessing_process() is only called when the flags are set to
>   MF_ACTION_REQUIRED, which means it is in the MCE path.
> - Whether the page is clean determines the behavior of try_to_unmap. For a
>   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
>   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
>   and simply unmaps the PTE.
> - When does walk_page_range() with hwpoison_walk_ops return 1?
>   1. If the poison page still exists, we should of course kill the current
>      process.
>   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
>      it is a dirty page, we should also kill the current process, too.
>   3. Otherwise, it returns 0, which means the page is clean.

I think you're too deep into detail. What I'd do is step back, think what
would be the *proper* recovery action and then make sure memory_failure does
that. If it doesn't - fix it to do so.

So, what should really happen wrt recovery action if any number of CPUs see
the same memory error?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Shuai Xue 11 months, 2 weeks ago


在 2025/2/25 06:01, Borislav Petkov 写道:
> On Fri, Feb 21, 2025 at 02:05:28PM +0800, Shuai Xue wrote:
>> #perf script
>> kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
>>          ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>>          ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
>>          ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
>>          ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
>>          ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
>>          ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
>>          ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
>>          ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
>>          ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
>>          ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])
>>
>> einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
>>          ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>>          ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>>          ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>>          ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>>          ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>>          ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>>                    405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
>>
>> einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
>>          ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>>          ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>>          ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>>          ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>>          ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>>          ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>>                    405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
> 
> What are those stack traces supposed to say?
> 
> Two processes are injecting, cause a #MC and a kworker gets to handle the UC?
> 
> All injecting to the same page?

Yes, I inject poison to a page and create two process with pthread_create() which
trigger the same poison page.

> 
> What's the upper limit on CPUs seeing the same hw error and all raising
> a CMCI/#MC?

It depends on the forked process which trying to read the poison.

> 
>> - kill_accessing_process() is only called when the flags are set to
>>    MF_ACTION_REQUIRED, which means it is in the MCE path.
>> - Whether the page is clean determines the behavior of try_to_unmap. For a
>>    dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
>>    PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
>>    and simply unmaps the PTE.
>> - When does walk_page_range() with hwpoison_walk_ops return 1?
>>    1. If the poison page still exists, we should of course kill the current
>>       process.
>>    2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
>>       it is a dirty page, we should also kill the current process, too.
>>    3. Otherwise, it returns 0, which means the page is clean.
> 
> I think you're too deep into detail. What I'd do is step back, think what
> would be the *proper* recovery action and then make sure memory_failure does
> that. If it doesn't - fix it to do so.
> 
> So, what should really happen wrt recovery action if any number of CPUs see
> the same memory error?
> 

IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
detect a memory error for dirty page, which is the current behavior in the
memory_failure.

Thanks
Shuai

Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Borislav Petkov 11 months, 2 weeks ago

On Tue, Feb 25, 2025 at 09:51:25AM +0800, Shuai Xue wrote:
> It depends on the forked process which trying to read the poison.

And? Can you try creating more processes and see what happens then?

> IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
> detect a memory error for dirty page, which is the current behavior in the
> memory_failure.

And for all those other processes which do get to see the already
poisoned/clean page, they should continue on their merry way instead of
getting killed by a SIGBUS?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

Posted by Shuai Xue 11 months, 2 weeks ago


在 2025/2/28 20:35, Borislav Petkov 写道:
> On Tue, Feb 25, 2025 at 09:51:25AM +0800, Shuai Xue wrote:
>> It depends on the forked process which trying to read the poison.
> 
> And? Can you try creating more processes and see what happens then?
> 

Sure.

The experimental model includes:

1. inject UE to a memory buffer
2. create 10 processes
3. all 10 process read the posioned buffer
4. 10 MCEs and 1 UCNA will be triggered
5. each process receives a SIGBUS

Some details:

#perf record -e probe:memory_failure -agR -- ./einj_mem_uc thread
0: thread   vaddr = 0x7f65f08da400 paddr = 82702ec400
injecting ...
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
Test passed
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.640 MB perf.data (11 samples) ]


#perf script
einj_mem_uc 1722254 [151] 695128.161644: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722255 [014] 695128.161712: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722256 [153] 695128.161716: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722257 [124] 695128.161759: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722258 [154] 695128.161782: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722259 [026] 695128.161819: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722260 [157] 695128.161852: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722261 [158] 695128.161895: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

kworker/50:3-mm 1714430 [050] 695128.168736: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 1722252 [050] 695128.183025: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722253 [051] 695128.191348: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

>> IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
>> detect a memory error for dirty page, which is the current behavior in the
>> memory_failure.
> 
> And for all those other processes which do get to see the already
> poisoned/clean page, they should continue on their merry way instead of
> getting killed by a SIGBUS?
> 

Yes, memory_failure() only sends a SIGBUS signal to the process that
is actively reading a poisoned page. Other processes that share the
poisoned page will not receive a SIGBUS signal unless they have the
PF_MCE_EARLY flag set.[1]

[1]https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com

Thanks.
Shuai