target/i386/cpu.c | 18 +++++++++++++++++- target/i386/cpu.h | 4 ++++ target/i386/helper.c | 4 ++++ target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- 4 files changed, 45 insertions(+), 9 deletions(-)
In the event that a guest process attempts to access memory that has been poisoned in response to a deferred uncorrected MCE, an AMD system will currently generate a SIGBUS error which will result in the entire guest being shutdown. Ideally, we only want to kill the guest process that accessed poisoned memory in this case. This support has been included in qemu for Intel hosts for a long time, but there are a couple of changes needed for AMD hosts. First, we will need to expose the SUCCOR cpuid bit to guests. Second, we need to modify the MCE injection code to avoid Intel specific behavior when we are running on an AMD host. v2: - Add "succor" feature word. - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. v3: - Reorder series. Only enable SUCCOR after bugs have been fixed. - Introduce new patch ignoring AO errors. v4: - Remove redundant check for AO errors. John Allen (2): i386: Fix MCE support for AMD hosts i386: Add support for SUCCOR feature William Roche (1): i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest target/i386/cpu.c | 18 +++++++++++++++++- target/i386/cpu.h | 4 ++++ target/i386/helper.c | 4 ++++ target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- 4 files changed, 45 insertions(+), 9 deletions(-) -- 2.39.3
On 12/09/2023 22:18, John Allen wrote: > In the event that a guest process attempts to access memory that has > been poisoned in response to a deferred uncorrected MCE, an AMD system > will currently generate a SIGBUS error which will result in the entire > guest being shutdown. Ideally, we only want to kill the guest process > that accessed poisoned memory in this case. > > This support has been included in qemu for Intel hosts for a long time, > but there are a couple of changes needed for AMD hosts. First, we will > need to expose the SUCCOR cpuid bit to guests. Second, we need to modify > the MCE injection code to avoid Intel specific behavior when we are > running on an AMD host. > Is there any update with respect to this series? John's series should fix MCE injection on AMD; as today it is just crashing the guest (sadly) when an MCE happens in the hypervisor. William, Paolo, I think the sort-of-dependency(?) of this where we block migration if there was a poisoned page on is already in Peter's migration tree[1] (CC'ed). So perhaps this series just needs John to resend it given that it's been a couple months since v4? [1] https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/ > v2: > - Add "succor" feature word. > - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. > > v3: > - Reorder series. Only enable SUCCOR after bugs have been fixed. > - Introduce new patch ignoring AO errors. > > v4: > - Remove redundant check for AO errors. > > John Allen (2): > i386: Fix MCE support for AMD hosts > i386: Add support for SUCCOR feature > > William Roche (1): > i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest > > target/i386/cpu.c | 18 +++++++++++++++++- > target/i386/cpu.h | 4 ++++ > target/i386/helper.c | 4 ++++ > target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- > 4 files changed, 45 insertions(+), 9 deletions(-) >
On Wed, Feb 07, 2024 at 11:21:05AM +0000, Joao Martins wrote: > On 12/09/2023 22:18, John Allen wrote: > > In the event that a guest process attempts to access memory that has > > been poisoned in response to a deferred uncorrected MCE, an AMD system > > will currently generate a SIGBUS error which will result in the entire > > guest being shutdown. Ideally, we only want to kill the guest process > > that accessed poisoned memory in this case. > > > > This support has been included in qemu for Intel hosts for a long time, > > but there are a couple of changes needed for AMD hosts. First, we will > > need to expose the SUCCOR cpuid bit to guests. Second, we need to modify > > the MCE injection code to avoid Intel specific behavior when we are > > running on an AMD host. > > > > Is there any update with respect to this series? > > John's series should fix MCE injection on AMD; as today it is just crashing the > guest (sadly) when an MCE happens in the hypervisor. > > William, Paolo, I think the sort-of-dependency(?) of this where we block > migration if there was a poisoned page on is already in Peter's migration > tree[1] (CC'ed). So perhaps this series just needs John to resend it given that > it's been a couple months since v4? It looks like this series still applies cleanly to latest qemu, but I can resend if needed. Thanks, John > > [1] > https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/ > > > v2: > > - Add "succor" feature word. > > - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. > > > > v3: > > - Reorder series. Only enable SUCCOR after bugs have been fixed. > > - Introduce new patch ignoring AO errors. > > > > v4: > > - Remove redundant check for AO errors. > > > > John Allen (2): > > i386: Fix MCE support for AMD hosts > > i386: Add support for SUCCOR feature > > > > William Roche (1): > > i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest > > > > target/i386/cpu.c | 18 +++++++++++++++++- > > target/i386/cpu.h | 4 ++++ > > target/i386/helper.c | 4 ++++ > > target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- > > 4 files changed, 45 insertions(+), 9 deletions(-) > > >
On 20/02/2024 17:27, John Allen wrote: > On Wed, Feb 07, 2024 at 11:21:05AM +0000, Joao Martins wrote: >> On 12/09/2023 22:18, John Allen wrote: >>> In the event that a guest process attempts to access memory that has >>> been poisoned in response to a deferred uncorrected MCE, an AMD system >>> will currently generate a SIGBUS error which will result in the entire >>> guest being shutdown. Ideally, we only want to kill the guest process >>> that accessed poisoned memory in this case. >>> >>> This support has been included in qemu for Intel hosts for a long time, >>> but there are a couple of changes needed for AMD hosts. First, we will >>> need to expose the SUCCOR cpuid bit to guests. Second, we need to modify >>> the MCE injection code to avoid Intel specific behavior when we are >>> running on an AMD host. >>> >> >> Is there any update with respect to this series? >> >> John's series should fix MCE injection on AMD; as today it is just crashing the >> guest (sadly) when an MCE happens in the hypervisor. >> >> William, Paolo, I think the sort-of-dependency(?) of this where we block >> migration if there was a poisoned page on is already in Peter's migration >> tree[1] (CC'ed). So perhaps this series just needs John to resend it given that >> it's been a couple months since v4? > > It looks like this series still applies cleanly to latest qemu, but I > can resend if needed. > That's great I suppose. I was hoping Paolo responds, to understand next steps. There's also the other kernel patch that Paolo suggested[0], to declare the SUCCOR bit in the kvm supported CPUID? Maybe it's being held up because of that? [0] https://lore.kernel.org/qemu-devel/d4c1bb9b-8438-ed00-c79d-e8ad2a7e4eed@redhat.com/ > Thanks, > John > >> >> [1] >> https://lore.kernel.org/qemu-devel/20240130190640.139364-2-william.roche@oracle.com/ >> >>> v2: >>> - Add "succor" feature word. >>> - Add case to kvm_arch_get_supported_cpuid for the SUCCOR feature. >>> >>> v3: >>> - Reorder series. Only enable SUCCOR after bugs have been fixed. >>> - Introduce new patch ignoring AO errors. >>> >>> v4: >>> - Remove redundant check for AO errors. >>> >>> John Allen (2): >>> i386: Fix MCE support for AMD hosts >>> i386: Add support for SUCCOR feature >>> >>> William Roche (1): >>> i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest >>> >>> target/i386/cpu.c | 18 +++++++++++++++++- >>> target/i386/cpu.h | 4 ++++ >>> target/i386/helper.c | 4 ++++ >>> target/i386/kvm/kvm.c | 28 ++++++++++++++++++++-------- >>> 4 files changed, 45 insertions(+), 9 deletions(-) >>> >>
© 2016 - 2024 Red Hat, Inc.