arch/arm64/kvm/arm.c | 51 +++++++----- arch/mips/kvm/mips.c | 37 ++++++++- arch/riscv/kvm/vcpu.c | 44 +++++++---- arch/s390/include/asm/entry-common.h | 10 +++ arch/s390/include/asm/kvm_host.h | 3 + arch/s390/kvm/kvm-s390.c | 49 +++++++++--- arch/s390/kvm/vsie.c | 17 ++-- arch/x86/kvm/svm/svm.c | 4 +- arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/x86.c | 4 +- arch/x86/kvm/x86.h | 45 ----------- include/linux/entry-common.h | 16 ++++ include/linux/kvm_host.h | 112 ++++++++++++++++++++++++++- kernel/entry/common.c | 3 +- 14 files changed, 286 insertions(+), 113 deletions(-)
Several architectures have latent bugs around guest entry/exit. This
series addresses those for:
arm64, mips, riscv, s390, x86
However, I'm not sure how to address powerpc and could do with some help
there. I have build-tested the arm64, mips, riscv, s390, and x86 cases,
but I don't have a suitable HW setup to test these, so any review and/or
testing would be much appreciated.
Issues include:
1) Several architectures enable interrupts between guest_enter() and
guest_exit(). As this period is an RCU extended quiescent state (EQS)
this is unsound unless the irq entry code explicitly wakes RCU, which
most architectures only do for entry from usersapce or idle.
I believe this affects: arm64, riscv, s390
I am not sure about powerpc.
2) Several architectures permit instrumentation of code between
guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
instrumentation may directly o indirectly use RCU, this has the same
problems as with interrupts.
I believe this affects: arm64, mips, powerpc, riscv, s390
3) Several architectures do not inform lockdep and tracing that
interrupts are enabled during the execution of the guest, or do so in
an incorrect order. Generally this means that logs will report IRQs
being masked for much longer than is actually the case, which is not
ideal for debugging. I don't know whether this affects the
correctness of lockdep.
I believe this affects: arm64, mips, powerpc, riscv, s390
This was previously fixed for x86 specifically in a series of commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
But other architectures were left broken, and the infrastructure for
handling this correctly is x86-specific.
This series introduces generic helper functions which can be used to
handle the problems above, and migrates architectures over to these,
fixing the latent issues. For s390, where the KVM guest EQS is
interruptible, I've added infrastructure to wake RCU during this EQS.
Since v1 [1]:
* Add arch_in_rcu_eqs()
* Convert s390
* Rename exit_to_guest_mode() -> guest_state_enter_irqoff()
* Rename enter_from_guest_mode() -> guest_state_exit_irqoff()
* Various commit message cleanups
I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
... with this version tagged as kvm-entry-rework-20210119.
[1] https://lore.kernel.org/r/20220111153539.2532246-1-mark.rutland@arm.com/
Thanks,
Mark.
Mark Rutland (7):
entry: add arch_in_rcu_eqs()
kvm: add guest_state_{enter,exit}_irqoff()
kvm/arm64: rework guest entry logic
kvm/mips: rework guest entry logic
kvm/riscv: rework guest entry logic
kvm/s390: rework guest entry logic
kvm/x86: rework guest entry logic
arch/arm64/kvm/arm.c | 51 +++++++-----
arch/mips/kvm/mips.c | 37 ++++++++-
arch/riscv/kvm/vcpu.c | 44 +++++++----
arch/s390/include/asm/entry-common.h | 10 +++
arch/s390/include/asm/kvm_host.h | 3 +
arch/s390/kvm/kvm-s390.c | 49 +++++++++---
arch/s390/kvm/vsie.c | 17 ++--
arch/x86/kvm/svm/svm.c | 4 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 4 +-
arch/x86/kvm/x86.h | 45 -----------
include/linux/entry-common.h | 16 ++++
include/linux/kvm_host.h | 112 ++++++++++++++++++++++++++-
kernel/entry/common.c | 3 +-
14 files changed, 286 insertions(+), 113 deletions(-)
--
2.30.2
Am 19.01.22 um 11:58 schrieb Mark Rutland:
CCing new emails for Anup and Atish so that they are aware of this thread.
> Several architectures have latent bugs around guest entry/exit. This
> series addresses those for:
>
> arm64, mips, riscv, s390, x86
>
> However, I'm not sure how to address powerpc and could do with some help
> there. I have build-tested the arm64, mips, riscv, s390, and x86 cases,
> but I don't have a suitable HW setup to test these, so any review and/or
> testing would be much appreciated.
>
> Issues include:
>
> 1) Several architectures enable interrupts between guest_enter() and
> guest_exit(). As this period is an RCU extended quiescent state (EQS)
> this is unsound unless the irq entry code explicitly wakes RCU, which
> most architectures only do for entry from usersapce or idle.
>
> I believe this affects: arm64, riscv, s390
>
> I am not sure about powerpc.
>
> 2) Several architectures permit instrumentation of code between
> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> instrumentation may directly o indirectly use RCU, this has the same
> problems as with interrupts.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> 3) Several architectures do not inform lockdep and tracing that
> interrupts are enabled during the execution of the guest, or do so in
> an incorrect order. Generally this means that logs will report IRQs
> being masked for much longer than is actually the case, which is not
> ideal for debugging. I don't know whether this affects the
> correctness of lockdep.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> This was previously fixed for x86 specifically in a series of commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
>
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues. For s390, where the KVM guest EQS is
> interruptible, I've added infrastructure to wake RCU during this EQS.
>
> Since v1 [1]:
> * Add arch_in_rcu_eqs()
> * Convert s390
> * Rename exit_to_guest_mode() -> guest_state_enter_irqoff()
> * Rename enter_from_guest_mode() -> guest_state_exit_irqoff()
> * Various commit message cleanups
>
> I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
> git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
>
> ... with this version tagged as kvm-entry-rework-20210119.
>
> [1] https://lore.kernel.org/r/20220111153539.2532246-1-mark.rutland@arm.com/
>
> Thanks,
> Mark.
>
> Mark Rutland (7):
> entry: add arch_in_rcu_eqs()
> kvm: add guest_state_{enter,exit}_irqoff()
> kvm/arm64: rework guest entry logic
> kvm/mips: rework guest entry logic
> kvm/riscv: rework guest entry logic
> kvm/s390: rework guest entry logic
> kvm/x86: rework guest entry logic
>
> arch/arm64/kvm/arm.c | 51 +++++++-----
> arch/mips/kvm/mips.c | 37 ++++++++-
> arch/riscv/kvm/vcpu.c | 44 +++++++----
> arch/s390/include/asm/entry-common.h | 10 +++
> arch/s390/include/asm/kvm_host.h | 3 +
> arch/s390/kvm/kvm-s390.c | 49 +++++++++---
> arch/s390/kvm/vsie.c | 17 ++--
> arch/x86/kvm/svm/svm.c | 4 +-
> arch/x86/kvm/vmx/vmx.c | 4 +-
> arch/x86/kvm/x86.c | 4 +-
> arch/x86/kvm/x86.h | 45 -----------
> include/linux/entry-common.h | 16 ++++
> include/linux/kvm_host.h | 112 ++++++++++++++++++++++++++-
> kernel/entry/common.c | 3 +-
> 14 files changed, 286 insertions(+), 113 deletions(-)
>
I just gave this a spin on s390 with debugging on and I got the following:
[ 457.151295] ------------[ cut here ]------------
[ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
[ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
[ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
[ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
[ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
[ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000
[ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
[ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
[ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15)
00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740
#00000000a7c04958: af000000 mc 0,0
>00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2
00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840
00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900
00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840
00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c
[ 457.151527] Call Trace:
[ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
[ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
[ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
[ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0
[ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
[ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
[ 457.151563] no locks held by swapper/14/0.
[ 457.151567] Last Breaking-Event-Address:
[ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
[ 457.151574] irq event stamp: 608654
[ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
[ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
[ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
[ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
[ 457.151600] ---[ end trace 2ae2154f9724de86 ]---
I can not see right now whats wrong, your patches look sane.
> [ 457.151295] ------------[ cut here ]------------
> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> [ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> [ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
> [ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
> [ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
> [ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
> [ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000
> [ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
> [ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
> [ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15)
> 00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740
> #00000000a7c04958: af000000 mc 0,0
> >00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2
> 00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840
> 00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900
> 00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840
> 00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c
> [ 457.151527] Call Trace:
> [ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
> [ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
> [ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
> [ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0
> [ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
> [ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
> [ 457.151563] no locks held by swapper/14/0.
> [ 457.151567] Last Breaking-Event-Address:
> [ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
> [ 457.151574] irq event stamp: 608654
> [ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
> [ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
> [ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
> [ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
> [ 457.151600] ---[ end trace 2ae2154f9724de86 ]---
>
> I can not see right now whats wrong, your patches look sane.
Now followed by (might be a followup to the error above).
[ 574.301736] =============================
[ 574.301741] WARNING: suspicious RCU usage
[ 574.301746] 5.16.0-00007-g89e9021389e2 #3 Tainted: G W
[ 574.301751] -----------------------------
[ 574.301755] kernel/rcu/tree.c:821 Bad RCU dynticks_nmi_nesting counter
!
[ 574.301760]
other info that might help us debug this:
[ 574.301764]
rcu_scheduler_active = 2, debug_locks = 1
[ 574.301769] 2 locks held by CPU 0/KVM/8327:
[ 574.301773] #0: 00000001521956b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x9a/0xa40 [kvm]
[ 574.301843] #1: 000000016e92c6a8 (&kvm->srcu){....}-{0:0}, at: __vcpu_run+0x332/0x5b8 [kvm]
[ 574.301873]
stack backtrace:
[ 574.301878] CPU: 46 PID: 8327 Comm: CPU 0/KVM Tainted: G W 5.16.0-00007-g89e9021389e2 #3
[ 574.301884] Hardware name: IBM 3906 M04 704 (LPAR)
[ 574.301888] Call Trace:
[ 574.301892] [<00000000a7c001c6>] dump_stack_lvl+0x8e/0xc8
[ 574.301903] [<00000000a6fee70e>] rcu_irq_exit_check_preempt+0x136/0x1c8
[ 574.301913] [<00000000a6ff9d1a>] irqentry_exit_cond_resched+0x32/0x80
[ 574.301921] [<00000000a7c0521c>] irqentry_exit+0xac/0x130
[ 574.301927] [<00000000a7c16626>] ext_int_handler+0xe6/0x120
[ 574.301933] [<00000000a6fc2482>] lock_acquire.part.0+0x12a/0x258
[ 574.301939] ([<00000000a6fc2450>] lock_acquire.part.0+0xf8/0x258)
[ 574.301944] [<00000000a6fc2660>] lock_acquire+0xb0/0x200
[ 574.302053] [<000003ff807092ce>] __vcpu_run+0x376/0x5b8 [kvm]
[ 574.302076] [<000003ff807099ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[ 574.302098] [<000003ff806f002c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[ 574.302120] [<00000000a728b5b6>] __s390x_sys_ioctl+0xbe/0x100
[ 574.302129] [<00000000a7c038ea>] __do_syscall+0x1da/0x208
[ 574.302133] [<00000000a7c16352>] system_call+0x82/0xb0
[ 574.302138] 2 locks held by CPU 0/KVM/8327:
[ 574.302143] #0: 00000001521956b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x9a/0xa40 [kvm]
[ 574.302172] #1: 000000016e92c6a8 (&kvm->srcu){....}-{0:0}, at: __vcpu_run+0x332/0x5b8 [kvm]
On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote: > Am 19.01.22 um 11:58 schrieb Mark Rutland: > > > CCing new emails for Anup and Atish so that they are aware of this thread. Ah; whoops. I'd meant to fix the Ccs on the patches. Thanks! [...] > I just gave this a spin on s390 with debugging on and I got the following: > > [ 457.151295] ------------[ cut here ]------------ > [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118 Hmm, so IIUC that's: WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE); ... and we're clearly in the idle thread here. I wonder, is the s390 guest entry/exit *preemptible* ? If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance things before a ctx-switch to the idle thread, which would then be able to hit this. I'll need to go audit the other architectures for similar. Thanks, Mark. > [ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4 > [ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3 > [ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR) > [ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118) > [ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 > [ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024 > [ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000 > [ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00 > [ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0 > [ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15) > 00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740 > #00000000a7c04958: af000000 mc 0,0 > >00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2 > 00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840 > 00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900 > 00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840 > 00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c > [ 457.151527] Call Trace: > [ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118 > [ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118) > [ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8 > [ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0 > [ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40 > [ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90 > [ 457.151563] no locks held by swapper/14/0. > [ 457.151567] Last Breaking-Event-Address: > [ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118 > [ 457.151574] irq event stamp: 608654 > [ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130 > [ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0 > [ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668 > [ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170 > [ 457.151600] ---[ end trace 2ae2154f9724de86 ]--- > > I can not see right now whats wrong, your patches look sane.
Am 19.01.22 um 20:22 schrieb Mark Rutland: > On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote: >> Am 19.01.22 um 11:58 schrieb Mark Rutland: >> >> >> CCing new emails for Anup and Atish so that they are aware of this thread. > > Ah; whoops. I'd meant to fix the Ccs on the patches. > > Thanks! > > [...] > >> I just gave this a spin on s390 with debugging on and I got the following: >> >> [ 457.151295] ------------[ cut here ]------------ >> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118 > > Hmm, so IIUC that's: > > WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE); > > ... and we're clearly in the idle thread here. > > I wonder, is the s390 guest entry/exit *preemptible* ? Looks like debug_defconfig is indeed using preemption: CONFIG_PREEMPT_BUILD=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_PREEMPTION=y CONFIG_PREEMPT_RCU=y CONFIG_PREEMPT_NOTIFIERS=y CONFIG_DEBUG_PREEMPT=y CONFIG_PREEMPTIRQ_TRACEPOINTS=y CONFIG_TRACE_PREEMPT_TOGGLE=y CONFIG_PREEMPT_TRACER=y # CONFIG_PREEMPTIRQ_DELAY_TEST is not set > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance > things before a ctx-switch to the idle thread, which would then be able > to hit this. > > I'll need to go audit the other architectures for similar. > > Thanks, > Mark. > >> [ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4 >> [ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3 >> [ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR) >> [ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118) >> [ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 >> [ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024 >> [ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000 >> [ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00 >> [ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0 >> [ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15) >> 00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740 >> #00000000a7c04958: af000000 mc 0,0 >> >00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2 >> 00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840 >> 00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900 >> 00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840 >> 00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c >> [ 457.151527] Call Trace: >> [ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118 >> [ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118) >> [ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8 >> [ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0 >> [ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40 >> [ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90 >> [ 457.151563] no locks held by swapper/14/0. >> [ 457.151567] Last Breaking-Event-Address: >> [ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118 >> [ 457.151574] irq event stamp: 608654 >> [ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130 >> [ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0 >> [ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668 >> [ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170 >> [ 457.151600] ---[ end trace 2ae2154f9724de86 ]--- >> >> I can not see right now whats wrong, your patches look sane.
On Wed, Jan 19, 2022 at 08:30:17PM +0100, Christian Borntraeger wrote:
>
>
> Am 19.01.22 um 20:22 schrieb Mark Rutland:
> > On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
> > > Am 19.01.22 um 11:58 schrieb Mark Rutland:
> > >
> > >
> > > CCing new emails for Anup and Atish so that they are aware of this thread.
> >
> > Ah; whoops. I'd meant to fix the Ccs on the patches.
> >
> > Thanks!
> >
> > [...]
> >
> > > I just gave this a spin on s390 with debugging on and I got the following:
> > >
> > > [ 457.151295] ------------[ cut here ]------------
> > > [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> >
> > Hmm, so IIUC that's:
> >
> > WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
> >
> > ... and we're clearly in the idle thread here.
> >
> > I wonder, is the s390 guest entry/exit *preemptible* ?
>
> Looks like debug_defconfig is indeed using preemption:
>
> CONFIG_PREEMPT_BUILD=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> CONFIG_DEBUG_PREEMPT=y
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
> CONFIG_TRACE_PREEMPT_TOGGLE=y
> CONFIG_PREEMPT_TRACER=y
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
Thanks for confirming!
Could you try with CONFIG_PROVE_RCU=y ? That can't be selected directly, but
selecting PROVE_LOCKING=y will enable it.
If I'm right, with that we should get a splat out of
rcu_irq_exit_check_preempt().
If so, I think we can solve this with preempt_{disable,enable}() around the
guest_timing_{enter,exit}_irqoff() calls. We'll also need to add some more
comments around arch_in_rcu_eqs() that arch-specific EQSs should be
non-preemptible.
Thanks,
Mark.
Am 20.01.22 um 12:57 schrieb Mark Rutland: > On Wed, Jan 19, 2022 at 08:30:17PM +0100, Christian Borntraeger wrote: >> >> >> Am 19.01.22 um 20:22 schrieb Mark Rutland: >>> On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote: >>>> Am 19.01.22 um 11:58 schrieb Mark Rutland: >>>> >>>> >>>> CCing new emails for Anup and Atish so that they are aware of this thread. >>> >>> Ah; whoops. I'd meant to fix the Ccs on the patches. >>> >>> Thanks! >>> >>> [...] >>> >>>> I just gave this a spin on s390 with debugging on and I got the following: >>>> >>>> [ 457.151295] ------------[ cut here ]------------ >>>> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118 >>> >>> Hmm, so IIUC that's: >>> >>> WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE); >>> >>> ... and we're clearly in the idle thread here. >>> >>> I wonder, is the s390 guest entry/exit *preemptible* ? >> >> Looks like debug_defconfig is indeed using preemption: >> >> CONFIG_PREEMPT_BUILD=y >> # CONFIG_PREEMPT_NONE is not set >> # CONFIG_PREEMPT_VOLUNTARY is not set >> CONFIG_PREEMPT=y >> CONFIG_PREEMPT_COUNT=y >> CONFIG_PREEMPTION=y >> CONFIG_PREEMPT_RCU=y >> CONFIG_PREEMPT_NOTIFIERS=y >> CONFIG_DEBUG_PREEMPT=y >> CONFIG_PREEMPTIRQ_TRACEPOINTS=y >> CONFIG_TRACE_PREEMPT_TOGGLE=y >> CONFIG_PREEMPT_TRACER=y >> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set > > Thanks for confirming! > > Could you try with CONFIG_PROVE_RCU=y ? That can't be selected directly, but > selecting PROVE_LOCKING=y will enable it. PROVE_LOCKING was enabled in my runs as well as PROVE_RCU.
On 1/19/22 20:22, Mark Rutland wrote: > I wonder, is the s390 guest entry/exit*preemptible* ? > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance > things before a ctx-switch to the idle thread, which would then be able > to hit this. > > I'll need to go audit the other architectures for similar. They don't enable interrupts in the entry/exit path so they should be okay. RISC-V and x86 have an explicit preempt_disable/enable, while MIPS only has local_irq_disable/enable. (MIPS is a mess of dead code, I have patches to clean it up). Paolo
On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> On 1/19/22 20:22, Mark Rutland wrote:
> > I wonder, is the s390 guest entry/exit*preemptible* ?
> >
> > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > things before a ctx-switch to the idle thread, which would then be able
> > to hit this.
> >
> > I'll need to go audit the other architectures for similar.
>
> They don't enable interrupts in the entry/exit path so they should be okay.
True.
So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
right thing to do. I'll add that and explanatory commentary.
> RISC-V and x86 have an explicit preempt_disable/enable, while MIPS only has
> local_irq_disable/enable.
>
> (MIPS is a mess of dead code, I have patches to clean it up).
Sure; I haven't wrapped my head around ppc yet, but I assume they keep
interrupts disabled as with the other simple cases.
Thanks,
Mark.
Am 20.01.22 um 13:03 schrieb Mark Rutland:
> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>> On 1/19/22 20:22, Mark Rutland wrote:
>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>
>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>> things before a ctx-switch to the idle thread, which would then be able
>>> to hit this.
>>>
>>> I'll need to go audit the other architectures for similar.
>>
>> They don't enable interrupts in the entry/exit path so they should be okay.
>
> True.
>
> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> right thing to do. I'll add that and explanatory commentary.
That would not be trivial I guess. We do allow (and need) page faults on sie for guest
demand paging and
this piece of arch/s390/mm/fault.c
case GMAP_FAULT:
if (faulthandler_disabled() || !mm)
goto out;
break;
}
would no longer work since faulthandler_disabled checks for the preempt count.
Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>
>
> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>
>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>> things before a ctx-switch to the idle thread, which would then be able
>>>> to hit this.
>>>>
>>>> I'll need to go audit the other architectures for similar.
>>>
>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>
>> True.
>>
>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>> right thing to do. I'll add that and explanatory commentary.
>
> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> demand paging and
>
> this piece of arch/s390/mm/fault.c
>
> case GMAP_FAULT:
> if (faulthandler_disabled() || !mm)
> goto out;
> break;
> }
>
> would no longer work since faulthandler_disabled checks for the preempt count.
>
Something like this
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index d30f5986fa85..1c7d45346e12 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
return 0;
goto out;
case USER_FAULT:
- case GMAP_FAULT:
if (faulthandler_disabled() || !mm)
goto out;
break;
+ /*
+ * We know that we interrupted SIE and we are not in a IRQ.
+ * preemption might be disabled thus checking for in_atomic
+ * would result in failures
+ */
+ case GMAP_FAULT:
+ if (pagefault_disabled() || !mm)
+ goto out;
+ break;
}
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
seems to work with preemption disabled around sie. Not sure yet if this is correct.
Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>
>>
>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>>
>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>> to hit this.
>>>>>
>>>>> I'll need to go audit the other architectures for similar.
>>>>
>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>
>>> True.
>>>
>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>> right thing to do. I'll add that and explanatory commentary.
>>
>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>> demand paging and
>>
>> this piece of arch/s390/mm/fault.c
>>
>> case GMAP_FAULT:
>> if (faulthandler_disabled() || !mm)
>> goto out;
>> break;
>> }
>>
>> would no longer work since faulthandler_disabled checks for the preempt count.
>>
>
>
> Something like this
>
>
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index d30f5986fa85..1c7d45346e12 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> return 0;
> goto out;
> case USER_FAULT:
> - case GMAP_FAULT:
> if (faulthandler_disabled() || !mm)
> goto out;
> break;
> + /*
> + * We know that we interrupted SIE and we are not in a IRQ.
> + * preemption might be disabled thus checking for in_atomic
> + * would result in failures
> + */
> + case GMAP_FAULT:
> + if (pagefault_disabled() || !mm)
> + goto out;
> + break;
> }
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>
> seems to work with preemption disabled around sie. Not sure yet if this is correct.
No it does not work. scheduling while preemption is disabled.
[ 1880.448663] BUG: scheduling while atomic: qemu-system-s39/1806/0x00000002
[ 1880.448674] INFO: lockdep is turned off.
[ 1880.448676] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[ 1880.448730] Preemption disabled at:
[ 1880.448731] [<000003ff8070da88>] ckc_irq_pending+0x30/0xe0 [kvm]
[ 1880.448778] CPU: 47 PID: 1806 Comm: qemu-system-s39 Tainted: G W 5.16.0-00007-g89e9021389e2-dirty #15
[ 1880.448782] Hardware name: IBM 3906 M04 704 (LPAR)
[ 1880.448784] Call Trace:
[ 1880.448785] [<000000000bf001d6>] dump_stack_lvl+0x8e/0xc8
[ 1880.448794] [<000000000b26e08a>] __schedule_bug+0xe2/0xf8
[ 1880.448801] [<000000000b26e212>] schedule_debug+0x172/0x1a8
[ 1880.448804] [<000000000bf0bcae>] __schedule+0x56/0x8b0
[ 1880.448808] [<000000000bf0c570>] schedule+0x68/0x110
[ 1880.448811] [<000000000bf13e76>] schedule_timeout+0x106/0x160
[ 1880.448815] [<000000000bf0ddf2>] wait_for_completion+0xc2/0x110
[ 1880.448818] [<000000000b258674>] __flush_work+0xd4/0x118
[ 1880.448823] [<000000000b4e3c88>] __drain_all_pages+0x218/0x308
[ 1880.448829] [<000000000b4ec3bc>] __alloc_pages_slowpath.constprop.0+0x5bc/0xc98
[ 1880.448832] [<000000000b4ece5c>] __alloc_pages+0x3c4/0x448
[ 1880.448835] [<000000000b5143cc>] alloc_pages_vma+0x9c/0x360
[ 1880.448841] [<000000000b4c0d6e>] do_swap_page+0x66e/0xca0
[ 1880.448845] [<000000000b4c3012>] __handle_mm_fault+0x29a/0x4b0
[ 1880.448869] [<000000000b4c33ac>] handle_mm_fault+0x184/0x3a8
[ 1880.448872] [<000000000b2062ce>] do_exception+0x136/0x490
[ 1880.448877] [<000000000b206b9a>] do_dat_exception+0x2a/0x50
[ 1880.448880] [<000000000bf03650>] __do_pgm_check+0x120/0x1f0
[ 1880.448882] [<000000000bf164ee>] pgm_check_handler+0x11e/0x180
[ 1880.448885] [<000000000bf16298>] sie_exit+0x0/0x48
[ 1880.448888] ([<000003ff8071e954>] kvm_s390_enter_exit_sie+0x64/0x98 [kvm])
[ 1880.448910] [<000003ff807061fa>] __vcpu_run+0x2a2/0x5b8 [kvm]
[ 1880.448931] [<000003ff807069ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[ 1880.448953] [<000003ff806ed02c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[ 1880.448975] [<000000000b58b5c6>] __s390x_sys_ioctl+0xbe/0x100
[ 1880.448982] [<000000000bf038fa>] __do_syscall+0x1da/0x208
[ 1880.448984] [<000000000bf16362>] system_call+0x82/0xb0
On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>
>
> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> > Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> > >
> > >
> > > Am 20.01.22 um 13:03 schrieb Mark Rutland:
> > > > On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> > > > > On 1/19/22 20:22, Mark Rutland wrote:
> > > > > > I wonder, is the s390 guest entry/exit*preemptible* ?
> > > > > >
> > > > > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > > > > > things before a ctx-switch to the idle thread, which would then be able
> > > > > > to hit this.
> > > > > >
> > > > > > I'll need to go audit the other architectures for similar.
> > > > >
> > > > > They don't enable interrupts in the entry/exit path so they should be okay.
> > > >
> > > > True.
> > > >
> > > > So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> > > > right thing to do. I'll add that and explanatory commentary.
> > >
> > > That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> > > demand paging and
> > >
> > > this piece of arch/s390/mm/fault.c
> > >
> > > case GMAP_FAULT:
> > > if (faulthandler_disabled() || !mm)
> > > goto out;
> > > break;
> > > }
> > >
> > > would no longer work since faulthandler_disabled checks for the preempt count.
> > >
> >
> >
> > Something like this
> >
> >
> > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > index d30f5986fa85..1c7d45346e12 100644
> > --- a/arch/s390/mm/fault.c
> > +++ b/arch/s390/mm/fault.c
> > @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > return 0;
> > goto out;
> > case USER_FAULT:
> > - case GMAP_FAULT:
> > if (faulthandler_disabled() || !mm)
> > goto out;
> > break;
> > + /*
> > + * We know that we interrupted SIE and we are not in a IRQ.
> > + * preemption might be disabled thus checking for in_atomic
> > + * would result in failures
> > + */
> > + case GMAP_FAULT:
> > + if (pagefault_disabled() || !mm)
> > + goto out;
> > + break;
> > }
> >
> > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> >
> > seems to work with preemption disabled around sie. Not sure yet if this is correct.
>
>
> No it does not work. scheduling while preemption is disabled.
Hmm... which exceptions do we expect to take from a guest?
I wonder if we can handle those more like other architectures by getting those
to immediately return from the sie64a() call with some status code that we can
handle outside of the guest_state_{enter,exit}_irqoff() critical section.
On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
guest to allow us to do that; I wonder if we could do similar here?
Thanks,
Mark.
Am 21.01.22 um 15:30 schrieb Mark Rutland:
> On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
>>> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>>>
>>>>
>>>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>>>>
>>>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>>>> to hit this.
>>>>>>>
>>>>>>> I'll need to go audit the other architectures for similar.
>>>>>>
>>>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>>>
>>>>> True.
>>>>>
>>>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>>>> right thing to do. I'll add that and explanatory commentary.
>>>>
>>>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>>>> demand paging and
>>>>
>>>> this piece of arch/s390/mm/fault.c
>>>>
>>>> case GMAP_FAULT:
>>>> if (faulthandler_disabled() || !mm)
>>>> goto out;
>>>> break;
>>>> }
>>>>
>>>> would no longer work since faulthandler_disabled checks for the preempt count.
>>>>
>>>
>>>
>>> Something like this
>>>
>>>
>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>> index d30f5986fa85..1c7d45346e12 100644
>>> --- a/arch/s390/mm/fault.c
>>> +++ b/arch/s390/mm/fault.c
>>> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>> return 0;
>>> goto out;
>>> case USER_FAULT:
>>> - case GMAP_FAULT:
>>> if (faulthandler_disabled() || !mm)
>>> goto out;
>>> break;
>>> + /*
>>> + * We know that we interrupted SIE and we are not in a IRQ.
>>> + * preemption might be disabled thus checking for in_atomic
>>> + * would result in failures
>>> + */
>>> + case GMAP_FAULT:
>>> + if (pagefault_disabled() || !mm)
>>> + goto out;
>>> + break;
>>> }
>>>
>>> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>>
>>> seems to work with preemption disabled around sie. Not sure yet if this is correct.
>>
>>
>> No it does not work. scheduling while preemption is disabled.
>
> Hmm... which exceptions do we expect to take from a guest?
>
> I wonder if we can handle those more like other architectures by getting those
> to immediately return from the sie64a() call with some status code that we can
> handle outside of the guest_state_{enter,exit}_irqoff() critical section.
We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
run that in the context of the pgm_check handler just like for userspace. the pgm_check
handler does a sie_exit (similar to the interrupt handlers) by setting the return IA.
> On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
> guest to allow us to do that; I wonder if we could do similar here?
On Fri, Jan 21, 2022 at 03:42:48PM +0100, Christian Borntraeger wrote:
> Am 21.01.22 um 15:30 schrieb Mark Rutland:
> > On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
> > > Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> > > > Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> > > > > Am 20.01.22 um 13:03 schrieb Mark Rutland:
> > > > > > On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> > > > > > > On 1/19/22 20:22, Mark Rutland wrote:
> > > > > > > > I wonder, is the s390 guest entry/exit*preemptible* ?
> > > > > > > >
> > > > > > > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > > > > > > > things before a ctx-switch to the idle thread, which would then be able
> > > > > > > > to hit this.
> > > > > > > >
> > > > > > > > I'll need to go audit the other architectures for similar.
> > > > > > >
> > > > > > > They don't enable interrupts in the entry/exit path so they should be okay.
> > > > > >
> > > > > > True.
> > > > > >
> > > > > > So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> > > > > > right thing to do. I'll add that and explanatory commentary.
> > > > >
> > > > > That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> > > > > demand paging and
> > > > >
> > > > > this piece of arch/s390/mm/fault.c
> > > > >
> > > > > case GMAP_FAULT:
> > > > > if (faulthandler_disabled() || !mm)
> > > > > goto out;
> > > > > break;
> > > > > }
> > > > >
> > > > > would no longer work since faulthandler_disabled checks for the preempt count.
> > > > >
> > > >
> > > >
> > > > Something like this
> > > >
> > > >
> > > > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > > > index d30f5986fa85..1c7d45346e12 100644
> > > > --- a/arch/s390/mm/fault.c
> > > > +++ b/arch/s390/mm/fault.c
> > > > @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > > > return 0;
> > > > goto out;
> > > > case USER_FAULT:
> > > > - case GMAP_FAULT:
> > > > if (faulthandler_disabled() || !mm)
> > > > goto out;
> > > > break;
> > > > + /*
> > > > + * We know that we interrupted SIE and we are not in a IRQ.
> > > > + * preemption might be disabled thus checking for in_atomic
> > > > + * would result in failures
> > > > + */
> > > > + case GMAP_FAULT:
> > > > + if (pagefault_disabled() || !mm)
> > > > + goto out;
> > > > + break;
> > > > }
> > > >
> > > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> > > >
> > > > seems to work with preemption disabled around sie. Not sure yet if this is correct.
> > >
> > >
> > > No it does not work. scheduling while preemption is disabled.
> >
> > Hmm... which exceptions do we expect to take from a guest?
> >
> > I wonder if we can handle those more like other architectures by getting those
> > to immediately return from the sie64a() call with some status code that we can
> > handle outside of the guest_state_{enter,exit}_irqoff() critical section.
>
> We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
> run that in the context of the pgm_check handler just like for userspace.
Do we only expect to tak faults from a guest (which IIUC at the GMAP_FAULT
cases in the bit above), or are there other esceptions we expect to take from
the middle of a SIE?
> the pgm_check handler does a sie_exit (similar to the interrupt handlers) by
> setting the return IA.
Sure, but that sie_exit happens *after* the handler runs, where as I'm asking
whether we can structure this like all the other architectures and turn all
exceptions into returns from sie64a() that we can handle after having called
guest_state_exit_irqoff().
> > On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
> > guest to allow us to do that; I wonder if we could do similar here?
Does this idea sound at all plausible?
Thanks,
Mark.
Am 21.01.22 um 16:29 schrieb Mark Rutland:
> On Fri, Jan 21, 2022 at 03:42:48PM +0100, Christian Borntraeger wrote:
>> Am 21.01.22 um 15:30 schrieb Mark Rutland:
>>> On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>>>> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
>>>>> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>>>>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>>>>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>>>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>>>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>>>>>>
>>>>>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>>>>>> to hit this.
>>>>>>>>>
>>>>>>>>> I'll need to go audit the other architectures for similar.
>>>>>>>>
>>>>>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>>>>>
>>>>>>> True.
>>>>>>>
>>>>>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>>>>>> right thing to do. I'll add that and explanatory commentary.
>>>>>>
>>>>>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>>>>>> demand paging and
>>>>>>
>>>>>> this piece of arch/s390/mm/fault.c
>>>>>>
>>>>>> case GMAP_FAULT:
>>>>>> if (faulthandler_disabled() || !mm)
>>>>>> goto out;
>>>>>> break;
>>>>>> }
>>>>>>
>>>>>> would no longer work since faulthandler_disabled checks for the preempt count.
>>>>>>
>>>>>
>>>>>
>>>>> Something like this
>>>>>
>>>>>
>>>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>>>> index d30f5986fa85..1c7d45346e12 100644
>>>>> --- a/arch/s390/mm/fault.c
>>>>> +++ b/arch/s390/mm/fault.c
>>>>> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>> return 0;
>>>>> goto out;
>>>>> case USER_FAULT:
>>>>> - case GMAP_FAULT:
>>>>> if (faulthandler_disabled() || !mm)
>>>>> goto out;
>>>>> break;
>>>>> + /*
>>>>> + * We know that we interrupted SIE and we are not in a IRQ.
>>>>> + * preemption might be disabled thus checking for in_atomic
>>>>> + * would result in failures
>>>>> + */
>>>>> + case GMAP_FAULT:
>>>>> + if (pagefault_disabled() || !mm)
>>>>> + goto out;
>>>>> + break;
>>>>> }
>>>>>
>>>>> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>>>>
>>>>> seems to work with preemption disabled around sie. Not sure yet if this is correct.
>>>>
>>>>
>>>> No it does not work. scheduling while preemption is disabled.
>>>
>>> Hmm... which exceptions do we expect to take from a guest?
>>>
>>> I wonder if we can handle those more like other architectures by getting those
>>> to immediately return from the sie64a() call with some status code that we can
>>> handle outside of the guest_state_{enter,exit}_irqoff() critical section.
>>
>> We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
>> run that in the context of the pgm_check handler just like for userspace.
>
> Do we only expect to tak faults from a guest (which IIUC at the GMAP_FAULT
> cases in the bit above), or are there other esceptions we expect to take from
> the middle of a SIE?
>
>> the pgm_check handler does a sie_exit (similar to the interrupt handlers) by
>> setting the return IA.
>
> Sure, but that sie_exit happens *after* the handler runs, where as I'm asking
> whether we can structure this like all the other architectures and turn all
> exceptions into returns from sie64a() that we can handle after having called
> guest_state_exit_irqoff().
>
>>> On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
>>> guest to allow us to do that; I wonder if we could do similar here?
>
> Does this idea sound at all plausible?
Maybe. We already have something like that for async_pf (see kvm_arch_setup_async_pf)
where we handle the pagefault later on after first kicking off the resolution for major
page faults (via FAULT_FLAG_RETRY_NOWAIT) in the low level handler.
Let me think about it.
© 2016 - 2026 Red Hat, Inc.