[PATCH v2 0/7] kvm: fix latent guest entry/exit bugs

Mark Rutland posted 7 patches 4 years, 5 months ago
There is a newer version of this series
arch/arm64/kvm/arm.c                 |  51 +++++++-----
arch/mips/kvm/mips.c                 |  37 ++++++++-
arch/riscv/kvm/vcpu.c                |  44 +++++++----
arch/s390/include/asm/entry-common.h |  10 +++
arch/s390/include/asm/kvm_host.h     |   3 +
arch/s390/kvm/kvm-s390.c             |  49 +++++++++---
arch/s390/kvm/vsie.c                 |  17 ++--
arch/x86/kvm/svm/svm.c               |   4 +-
arch/x86/kvm/vmx/vmx.c               |   4 +-
arch/x86/kvm/x86.c                   |   4 +-
arch/x86/kvm/x86.h                   |  45 -----------
include/linux/entry-common.h         |  16 ++++
include/linux/kvm_host.h             | 112 ++++++++++++++++++++++++++-
kernel/entry/common.c                |   3 +-
14 files changed, 286 insertions(+), 113 deletions(-)
[PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Mark Rutland 4 years, 5 months ago
Several architectures have latent bugs around guest entry/exit. This
series addresses those for:

	arm64, mips, riscv, s390, x86

However, I'm not sure how to address powerpc and could do with some help
there. I have build-tested the arm64, mips, riscv, s390, and x86 cases,
but I don't have a suitable HW setup to test these, so any review and/or
testing would be much appreciated.

Issues include:

1) Several architectures enable interrupts between guest_enter() and
   guest_exit(). As this period is an RCU extended quiescent state (EQS)
   this is unsound unless the irq entry code explicitly wakes RCU, which
   most architectures only do for entry from usersapce or idle.

   I believe this affects: arm64, riscv, s390

   I am not sure about powerpc.

2) Several architectures permit instrumentation of code between
   guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
   instrumentation may directly o indirectly use RCU, this has the same
   problems as with interrupts.

   I believe this affects: arm64, mips, powerpc, riscv, s390

3) Several architectures do not inform lockdep and tracing that
   interrupts are enabled during the execution of the guest, or do so in
   an incorrect order. Generally this means that logs will report IRQs
   being masked for much longer than is actually the case, which is not
   ideal for debugging. I don't know whether this affects the
   correctness of lockdep.

   I believe this affects: arm64, mips, powerpc, riscv, s390

This was previously fixed for x86 specifically in a series of commits:

  87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
  0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
  9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
  3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
  135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
  160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
  bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")

But other architectures were left broken, and the infrastructure for
handling this correctly is x86-specific.

This series introduces generic helper functions which can be used to
handle the problems above, and migrates architectures over to these,
fixing the latent issues. For s390, where the KVM guest EQS is
interruptible, I've added infrastructure to wake RCU during this EQS.

Since v1 [1]:
* Add arch_in_rcu_eqs()
* Convert s390
* Rename exit_to_guest_mode() -> guest_state_enter_irqoff()
* Rename enter_from_guest_mode() -> guest_state_exit_irqoff()
* Various commit message cleanups

I've pushed the series (based on v5.16) to my kvm/entry-rework branch:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
  git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework

... with this version tagged as kvm-entry-rework-20210119.

[1] https://lore.kernel.org/r/20220111153539.2532246-1-mark.rutland@arm.com/

Thanks,
Mark.

Mark Rutland (7):
  entry: add arch_in_rcu_eqs()
  kvm: add guest_state_{enter,exit}_irqoff()
  kvm/arm64: rework guest entry logic
  kvm/mips: rework guest entry logic
  kvm/riscv: rework guest entry logic
  kvm/s390: rework guest entry logic
  kvm/x86: rework guest entry logic

 arch/arm64/kvm/arm.c                 |  51 +++++++-----
 arch/mips/kvm/mips.c                 |  37 ++++++++-
 arch/riscv/kvm/vcpu.c                |  44 +++++++----
 arch/s390/include/asm/entry-common.h |  10 +++
 arch/s390/include/asm/kvm_host.h     |   3 +
 arch/s390/kvm/kvm-s390.c             |  49 +++++++++---
 arch/s390/kvm/vsie.c                 |  17 ++--
 arch/x86/kvm/svm/svm.c               |   4 +-
 arch/x86/kvm/vmx/vmx.c               |   4 +-
 arch/x86/kvm/x86.c                   |   4 +-
 arch/x86/kvm/x86.h                   |  45 -----------
 include/linux/entry-common.h         |  16 ++++
 include/linux/kvm_host.h             | 112 ++++++++++++++++++++++++++-
 kernel/entry/common.c                |   3 +-
 14 files changed, 286 insertions(+), 113 deletions(-)

-- 
2.30.2

Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago
Am 19.01.22 um 11:58 schrieb Mark Rutland:


CCing new emails for Anup and Atish so that they are aware of this thread.

> Several architectures have latent bugs around guest entry/exit. This
> series addresses those for:
> 
> 	arm64, mips, riscv, s390, x86
> 
> However, I'm not sure how to address powerpc and could do with some help
> there. I have build-tested the arm64, mips, riscv, s390, and x86 cases,
> but I don't have a suitable HW setup to test these, so any review and/or
> testing would be much appreciated.
> 
> Issues include:
> 
> 1) Several architectures enable interrupts between guest_enter() and
>     guest_exit(). As this period is an RCU extended quiescent state (EQS)
>     this is unsound unless the irq entry code explicitly wakes RCU, which
>     most architectures only do for entry from usersapce or idle.
> 
>     I believe this affects: arm64, riscv, s390
> 
>     I am not sure about powerpc.
> 
> 2) Several architectures permit instrumentation of code between
>     guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
>     instrumentation may directly o indirectly use RCU, this has the same
>     problems as with interrupts.
> 
>     I believe this affects: arm64, mips, powerpc, riscv, s390
> 
> 3) Several architectures do not inform lockdep and tracing that
>     interrupts are enabled during the execution of the guest, or do so in
>     an incorrect order. Generally this means that logs will report IRQs
>     being masked for much longer than is actually the case, which is not
>     ideal for debugging. I don't know whether this affects the
>     correctness of lockdep.
> 
>     I believe this affects: arm64, mips, powerpc, riscv, s390
> 
> This was previously fixed for x86 specifically in a series of commits:
> 
>    87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
>    0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
>    9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
>    3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
>    135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
>    160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
>    bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
> 
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
> 
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues. For s390, where the KVM guest EQS is
> interruptible, I've added infrastructure to wake RCU during this EQS.
> 
> Since v1 [1]:
> * Add arch_in_rcu_eqs()
> * Convert s390
> * Rename exit_to_guest_mode() -> guest_state_enter_irqoff()
> * Rename enter_from_guest_mode() -> guest_state_exit_irqoff()
> * Various commit message cleanups
> 
> I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
>    git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
> 
> ... with this version tagged as kvm-entry-rework-20210119.
> 
> [1] https://lore.kernel.org/r/20220111153539.2532246-1-mark.rutland@arm.com/
> 
> Thanks,
> Mark.
> 
> Mark Rutland (7):
>    entry: add arch_in_rcu_eqs()
>    kvm: add guest_state_{enter,exit}_irqoff()
>    kvm/arm64: rework guest entry logic
>    kvm/mips: rework guest entry logic
>    kvm/riscv: rework guest entry logic
>    kvm/s390: rework guest entry logic
>    kvm/x86: rework guest entry logic
> 
>   arch/arm64/kvm/arm.c                 |  51 +++++++-----
>   arch/mips/kvm/mips.c                 |  37 ++++++++-
>   arch/riscv/kvm/vcpu.c                |  44 +++++++----
>   arch/s390/include/asm/entry-common.h |  10 +++
>   arch/s390/include/asm/kvm_host.h     |   3 +
>   arch/s390/kvm/kvm-s390.c             |  49 +++++++++---
>   arch/s390/kvm/vsie.c                 |  17 ++--
>   arch/x86/kvm/svm/svm.c               |   4 +-
>   arch/x86/kvm/vmx/vmx.c               |   4 +-
>   arch/x86/kvm/x86.c                   |   4 +-
>   arch/x86/kvm/x86.h                   |  45 -----------
>   include/linux/entry-common.h         |  16 ++++
>   include/linux/kvm_host.h             | 112 ++++++++++++++++++++++++++-
>   kernel/entry/common.c                |   3 +-
>   14 files changed, 286 insertions(+), 113 deletions(-)
> 


I just gave this a spin on s390 with debugging on and I got the following:

[  457.151295] ------------[ cut here ]------------
[  457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
[  457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[  457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
[  457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
[  457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
[  457.151440]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[  457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
[  457.151450]            0000000000000000 0000000000000001 0000000000000000 0000000000000000
[  457.151454]            000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
[  457.151458]            0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
[  457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004	lmg	%r10,%r15,160(%r15)
                           00000000a7c04952: c0f4fffffef7	brcl	15,00000000a7c04740
                          #00000000a7c04958: af000000		mc	0,0
                          >00000000a7c0495c: a7f4ffa3		brc	15,00000000a7c048a2
                           00000000a7c04960: c0e500003f70	brasl	%r14,00000000a7c0c840
                           00000000a7c04966: a7f4ffcd		brc	15,00000000a7c04900
                           00000000a7c0496a: c0e500003f6b	brasl	%r14,00000000a7c0c840
                           00000000a7c04970: a7f4ffde		brc	15,00000000a7c0492c
[  457.151527] Call Trace:
[  457.151530]  [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
[  457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
[  457.151540]  [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
[  457.151545]  [<00000000a6f816c6>] do_idle+0xf6/0x1b0
[  457.151553]  [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
[  457.151558]  [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
[  457.151563] no locks held by swapper/14/0.
[  457.151567] Last Breaking-Event-Address:
[  457.151570]  [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
[  457.151574] irq event stamp: 608654
[  457.151578] hardirqs last  enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
[  457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
[  457.151589] softirqs last  enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
[  457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
[  457.151600] ---[ end trace 2ae2154f9724de86 ]---

I can not see right now whats wrong, your patches look sane.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago
> [  457.151295] ------------[ cut here ]------------
> [  457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> [  457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> [  457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
> [  457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
> [  457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
> [  457.151440]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [  457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
> [  457.151450]            0000000000000000 0000000000000001 0000000000000000 0000000000000000
> [  457.151454]            000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
> [  457.151458]            0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
> [  457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004    lmg    %r10,%r15,160(%r15)
>                            00000000a7c04952: c0f4fffffef7    brcl    15,00000000a7c04740
>                           #00000000a7c04958: af000000        mc    0,0
>                           >00000000a7c0495c: a7f4ffa3        brc    15,00000000a7c048a2
>                            00000000a7c04960: c0e500003f70    brasl    %r14,00000000a7c0c840
>                            00000000a7c04966: a7f4ffcd        brc    15,00000000a7c04900
>                            00000000a7c0496a: c0e500003f6b    brasl    %r14,00000000a7c0c840
>                            00000000a7c04970: a7f4ffde        brc    15,00000000a7c0492c
> [  457.151527] Call Trace:
> [  457.151530]  [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
> [  457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
> [  457.151540]  [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
> [  457.151545]  [<00000000a6f816c6>] do_idle+0xf6/0x1b0
> [  457.151553]  [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
> [  457.151558]  [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
> [  457.151563] no locks held by swapper/14/0.
> [  457.151567] Last Breaking-Event-Address:
> [  457.151570]  [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
> [  457.151574] irq event stamp: 608654
> [  457.151578] hardirqs last  enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
> [  457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
> [  457.151589] softirqs last  enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
> [  457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
> [  457.151600] ---[ end trace 2ae2154f9724de86 ]---
> 
> I can not see right now whats wrong, your patches look sane.


Now followed by (might be a followup to the error above).

[  574.301736] =============================
[  574.301741] WARNING: suspicious RCU usage
[  574.301746] 5.16.0-00007-g89e9021389e2 #3 Tainted: G        W
[  574.301751] -----------------------------
[  574.301755] kernel/rcu/tree.c:821 Bad RCU  dynticks_nmi_nesting counter
                !
[  574.301760]
                other info that might help us debug this:

[  574.301764]
                rcu_scheduler_active = 2, debug_locks = 1
[  574.301769] 2 locks held by CPU 0/KVM/8327:
[  574.301773]  #0: 00000001521956b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x9a/0xa40 [kvm]
[  574.301843]  #1: 000000016e92c6a8 (&kvm->srcu){....}-{0:0}, at: __vcpu_run+0x332/0x5b8 [kvm]
[  574.301873]
                stack backtrace:
[  574.301878] CPU: 46 PID: 8327 Comm: CPU 0/KVM Tainted: G        W         5.16.0-00007-g89e9021389e2 #3
[  574.301884] Hardware name: IBM 3906 M04 704 (LPAR)
[  574.301888] Call Trace:
[  574.301892]  [<00000000a7c001c6>] dump_stack_lvl+0x8e/0xc8
[  574.301903]  [<00000000a6fee70e>] rcu_irq_exit_check_preempt+0x136/0x1c8
[  574.301913]  [<00000000a6ff9d1a>] irqentry_exit_cond_resched+0x32/0x80
[  574.301921]  [<00000000a7c0521c>] irqentry_exit+0xac/0x130
[  574.301927]  [<00000000a7c16626>] ext_int_handler+0xe6/0x120
[  574.301933]  [<00000000a6fc2482>] lock_acquire.part.0+0x12a/0x258
[  574.301939] ([<00000000a6fc2450>] lock_acquire.part.0+0xf8/0x258)
[  574.301944]  [<00000000a6fc2660>] lock_acquire+0xb0/0x200
[  574.302053]  [<000003ff807092ce>] __vcpu_run+0x376/0x5b8 [kvm]
[  574.302076]  [<000003ff807099ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[  574.302098]  [<000003ff806f002c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[  574.302120]  [<00000000a728b5b6>] __s390x_sys_ioctl+0xbe/0x100
[  574.302129]  [<00000000a7c038ea>] __do_syscall+0x1da/0x208
[  574.302133]  [<00000000a7c16352>] system_call+0x82/0xb0
[  574.302138] 2 locks held by CPU 0/KVM/8327:
[  574.302143]  #0: 00000001521956b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x9a/0xa40 [kvm]
[  574.302172]  #1: 000000016e92c6a8 (&kvm->srcu){....}-{0:0}, at: __vcpu_run+0x332/0x5b8 [kvm]

Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Mark Rutland 4 years, 5 months ago
On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
> Am 19.01.22 um 11:58 schrieb Mark Rutland:
> 
> 
> CCing new emails for Anup and Atish so that they are aware of this thread.

Ah; whoops. I'd meant to fix the Ccs on the patches.

Thanks!

[...]

> I just gave this a spin on s390 with debugging on and I got the following:
> 
> [  457.151295] ------------[ cut here ]------------
> [  457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118

Hmm, so IIUC that's:

	WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);

... and we're clearly in the idle thread here.

I wonder, is the s390 guest entry/exit *preemptible* ?

If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
things before a ctx-switch to the idle thread, which would then be able
to hit this.

I'll need to go audit the other architectures for similar.

Thanks,
Mark.

> [  457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> [  457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
> [  457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
> [  457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
> [  457.151440]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [  457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
> [  457.151450]            0000000000000000 0000000000000001 0000000000000000 0000000000000000
> [  457.151454]            000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
> [  457.151458]            0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
> [  457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004	lmg	%r10,%r15,160(%r15)
>                           00000000a7c04952: c0f4fffffef7	brcl	15,00000000a7c04740
>                          #00000000a7c04958: af000000		mc	0,0
>                          >00000000a7c0495c: a7f4ffa3		brc	15,00000000a7c048a2
>                           00000000a7c04960: c0e500003f70	brasl	%r14,00000000a7c0c840
>                           00000000a7c04966: a7f4ffcd		brc	15,00000000a7c04900
>                           00000000a7c0496a: c0e500003f6b	brasl	%r14,00000000a7c0c840
>                           00000000a7c04970: a7f4ffde		brc	15,00000000a7c0492c
> [  457.151527] Call Trace:
> [  457.151530]  [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
> [  457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
> [  457.151540]  [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
> [  457.151545]  [<00000000a6f816c6>] do_idle+0xf6/0x1b0
> [  457.151553]  [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
> [  457.151558]  [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
> [  457.151563] no locks held by swapper/14/0.
> [  457.151567] Last Breaking-Event-Address:
> [  457.151570]  [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
> [  457.151574] irq event stamp: 608654
> [  457.151578] hardirqs last  enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
> [  457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
> [  457.151589] softirqs last  enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
> [  457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
> [  457.151600] ---[ end trace 2ae2154f9724de86 ]---
> 
> I can not see right now whats wrong, your patches look sane.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago

Am 19.01.22 um 20:22 schrieb Mark Rutland:
> On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
>> Am 19.01.22 um 11:58 schrieb Mark Rutland:
>>
>>
>> CCing new emails for Anup and Atish so that they are aware of this thread.
> 
> Ah; whoops. I'd meant to fix the Ccs on the patches.
> 
> Thanks!
> 
> [...]
> 
>> I just gave this a spin on s390 with debugging on and I got the following:
>>
>> [  457.151295] ------------[ cut here ]------------
>> [  457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> 
> Hmm, so IIUC that's:
> 
> 	WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
> 
> ... and we're clearly in the idle thread here.
> 
> I wonder, is the s390 guest entry/exit *preemptible* ?

Looks like debug_defconfig is indeed using preemption:

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_TRACE_PREEMPT_TOGGLE=y
CONFIG_PREEMPT_TRACER=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

> 
> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> things before a ctx-switch to the idle thread, which would then be able
> to hit this.
> 
> I'll need to go audit the other architectures for similar.
> 
> Thanks,
> Mark.
> 
>> [  457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
>> [  457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
>> [  457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
>> [  457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
>> [  457.151440]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>> [  457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
>> [  457.151450]            0000000000000000 0000000000000001 0000000000000000 0000000000000000
>> [  457.151454]            000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
>> [  457.151458]            0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
>> [  457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004	lmg	%r10,%r15,160(%r15)
>>                            00000000a7c04952: c0f4fffffef7	brcl	15,00000000a7c04740
>>                           #00000000a7c04958: af000000		mc	0,0
>>                           >00000000a7c0495c: a7f4ffa3		brc	15,00000000a7c048a2
>>                            00000000a7c04960: c0e500003f70	brasl	%r14,00000000a7c0c840
>>                            00000000a7c04966: a7f4ffcd		brc	15,00000000a7c04900
>>                            00000000a7c0496a: c0e500003f6b	brasl	%r14,00000000a7c0c840
>>                            00000000a7c04970: a7f4ffde		brc	15,00000000a7c0492c
>> [  457.151527] Call Trace:
>> [  457.151530]  [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
>> [  457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
>> [  457.151540]  [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
>> [  457.151545]  [<00000000a6f816c6>] do_idle+0xf6/0x1b0
>> [  457.151553]  [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
>> [  457.151558]  [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
>> [  457.151563] no locks held by swapper/14/0.
>> [  457.151567] Last Breaking-Event-Address:
>> [  457.151570]  [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
>> [  457.151574] irq event stamp: 608654
>> [  457.151578] hardirqs last  enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
>> [  457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
>> [  457.151589] softirqs last  enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
>> [  457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
>> [  457.151600] ---[ end trace 2ae2154f9724de86 ]---
>>
>> I can not see right now whats wrong, your patches look sane.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Mark Rutland 4 years, 5 months ago
On Wed, Jan 19, 2022 at 08:30:17PM +0100, Christian Borntraeger wrote:
> 
> 
> Am 19.01.22 um 20:22 schrieb Mark Rutland:
> > On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
> > > Am 19.01.22 um 11:58 schrieb Mark Rutland:
> > > 
> > > 
> > > CCing new emails for Anup and Atish so that they are aware of this thread.
> > 
> > Ah; whoops. I'd meant to fix the Ccs on the patches.
> > 
> > Thanks!
> > 
> > [...]
> > 
> > > I just gave this a spin on s390 with debugging on and I got the following:
> > > 
> > > [  457.151295] ------------[ cut here ]------------
> > > [  457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> > 
> > Hmm, so IIUC that's:
> > 
> > 	WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
> > 
> > ... and we're clearly in the idle thread here.
> > 
> > I wonder, is the s390 guest entry/exit *preemptible* ?
> 
> Looks like debug_defconfig is indeed using preemption:
> 
> CONFIG_PREEMPT_BUILD=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> CONFIG_DEBUG_PREEMPT=y
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
> CONFIG_TRACE_PREEMPT_TOGGLE=y
> CONFIG_PREEMPT_TRACER=y
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set

Thanks for confirming!

Could you try with CONFIG_PROVE_RCU=y ? That can't be selected directly, but
selecting PROVE_LOCKING=y will enable it.

If I'm right, with that we should get a splat out of
rcu_irq_exit_check_preempt().

If so, I think we can solve this with preempt_{disable,enable}() around the
guest_timing_{enter,exit}_irqoff() calls. We'll also need to add some more
comments around arch_in_rcu_eqs() that arch-specific EQSs should be
non-preemptible.

Thanks,
Mark.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago

Am 20.01.22 um 12:57 schrieb Mark Rutland:
> On Wed, Jan 19, 2022 at 08:30:17PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 19.01.22 um 20:22 schrieb Mark Rutland:
>>> On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
>>>> Am 19.01.22 um 11:58 schrieb Mark Rutland:
>>>>
>>>>
>>>> CCing new emails for Anup and Atish so that they are aware of this thread.
>>>
>>> Ah; whoops. I'd meant to fix the Ccs on the patches.
>>>
>>> Thanks!
>>>
>>> [...]
>>>
>>>> I just gave this a spin on s390 with debugging on and I got the following:
>>>>
>>>> [  457.151295] ------------[ cut here ]------------
>>>> [  457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
>>>
>>> Hmm, so IIUC that's:
>>>
>>> 	WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
>>>
>>> ... and we're clearly in the idle thread here.
>>>
>>> I wonder, is the s390 guest entry/exit *preemptible* ?
>>
>> Looks like debug_defconfig is indeed using preemption:
>>
>> CONFIG_PREEMPT_BUILD=y
>> # CONFIG_PREEMPT_NONE is not set
>> # CONFIG_PREEMPT_VOLUNTARY is not set
>> CONFIG_PREEMPT=y
>> CONFIG_PREEMPT_COUNT=y
>> CONFIG_PREEMPTION=y
>> CONFIG_PREEMPT_RCU=y
>> CONFIG_PREEMPT_NOTIFIERS=y
>> CONFIG_DEBUG_PREEMPT=y
>> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
>> CONFIG_TRACE_PREEMPT_TOGGLE=y
>> CONFIG_PREEMPT_TRACER=y
>> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> 
> Thanks for confirming!
> 
> Could you try with CONFIG_PROVE_RCU=y ? That can't be selected directly, but
> selecting PROVE_LOCKING=y will enable it.

PROVE_LOCKING was enabled in my runs as well as PROVE_RCU.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Paolo Bonzini 4 years, 5 months ago
On 1/19/22 20:22, Mark Rutland wrote:
> I wonder, is the s390 guest entry/exit*preemptible*  ?
> 
> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> things before a ctx-switch to the idle thread, which would then be able
> to hit this.
> 
> I'll need to go audit the other architectures for similar.

They don't enable interrupts in the entry/exit path so they should be 
okay.  RISC-V and x86 have an explicit preempt_disable/enable, while 
MIPS only has local_irq_disable/enable.

(MIPS is a mess of dead code, I have patches to clean it up).

Paolo

Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Mark Rutland 4 years, 5 months ago
On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> On 1/19/22 20:22, Mark Rutland wrote:
> > I wonder, is the s390 guest entry/exit*preemptible*  ?
> > 
> > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > things before a ctx-switch to the idle thread, which would then be able
> > to hit this.
> > 
> > I'll need to go audit the other architectures for similar.
> 
> They don't enable interrupts in the entry/exit path so they should be okay.

True.

So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
right thing to do. I'll add that and explanatory commentary.

> RISC-V and x86 have an explicit preempt_disable/enable, while MIPS only has
> local_irq_disable/enable.
> 
> (MIPS is a mess of dead code, I have patches to clean it up).

Sure; I haven't wrapped my head around ppc yet, but I assume they keep
interrupts disabled as with the other simple cases.

Thanks,
Mark.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago

Am 20.01.22 um 13:03 schrieb Mark Rutland:
> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>> On 1/19/22 20:22, Mark Rutland wrote:
>>> I wonder, is the s390 guest entry/exit*preemptible*  ?
>>>
>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>> things before a ctx-switch to the idle thread, which would then be able
>>> to hit this.
>>>
>>> I'll need to go audit the other architectures for similar.
>>
>> They don't enable interrupts in the entry/exit path so they should be okay.
> 
> True.
> 
> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> right thing to do. I'll add that and explanatory commentary.

That would not be trivial I guess. We do allow (and need) page faults on sie for guest
demand paging and

this piece of arch/s390/mm/fault.c

        case GMAP_FAULT:
                 if (faulthandler_disabled() || !mm)
                         goto out;
                 break;
         }

would no longer work since faulthandler_disabled checks for the preempt count.

Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago
Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> 
> 
> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>> I wonder, is the s390 guest entry/exit*preemptible*  ?
>>>>
>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>> things before a ctx-switch to the idle thread, which would then be able
>>>> to hit this.
>>>>
>>>> I'll need to go audit the other architectures for similar.
>>>
>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>
>> True.
>>
>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>> right thing to do. I'll add that and explanatory commentary.
> 
> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> demand paging and
> 
> this piece of arch/s390/mm/fault.c
> 
>         case GMAP_FAULT:
>                  if (faulthandler_disabled() || !mm)
>                          goto out;
>                  break;
>          }
> 
> would no longer work since faulthandler_disabled checks for the preempt count.
> 


Something like this


diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index d30f5986fa85..1c7d45346e12 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
                         return 0;
                 goto out;
         case USER_FAULT:
-       case GMAP_FAULT:
                 if (faulthandler_disabled() || !mm)
                         goto out;
                 break;
+               /*
+                * We know that we interrupted SIE and we are not in a IRQ.
+                * preemption might be disabled thus checking for in_atomic
+                * would result in failures
+                */
+       case GMAP_FAULT:
+               if (pagefault_disabled() || !mm)
+                       goto out;
+               break;
         }
  
         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

seems to work with preemption disabled around sie. Not sure yet if this is correct.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago

Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>
>>
>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>> I wonder, is the s390 guest entry/exit*preemptible*  ?
>>>>>
>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>> to hit this.
>>>>>
>>>>> I'll need to go audit the other architectures for similar.
>>>>
>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>
>>> True.
>>>
>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>> right thing to do. I'll add that and explanatory commentary.
>>
>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>> demand paging and
>>
>> this piece of arch/s390/mm/fault.c
>>
>>         case GMAP_FAULT:
>>                  if (faulthandler_disabled() || !mm)
>>                          goto out;
>>                  break;
>>          }
>>
>> would no longer work since faulthandler_disabled checks for the preempt count.
>>
> 
> 
> Something like this
> 
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index d30f5986fa85..1c7d45346e12 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>                          return 0;
>                  goto out;
>          case USER_FAULT:
> -       case GMAP_FAULT:
>                  if (faulthandler_disabled() || !mm)
>                          goto out;
>                  break;
> +               /*
> +                * We know that we interrupted SIE and we are not in a IRQ.
> +                * preemption might be disabled thus checking for in_atomic
> +                * would result in failures
> +                */
> +       case GMAP_FAULT:
> +               if (pagefault_disabled() || !mm)
> +                       goto out;
> +               break;
>          }
> 
>          perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> 
> seems to work with preemption disabled around sie. Not sure yet if this is correct.


No it does not work. scheduling while preemption is disabled.
[ 1880.448663] BUG: scheduling while atomic: qemu-system-s39/1806/0x00000002
[ 1880.448674] INFO: lockdep is turned off.
[ 1880.448676] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[ 1880.448730] Preemption disabled at:
[ 1880.448731] [<000003ff8070da88>] ckc_irq_pending+0x30/0xe0 [kvm]
[ 1880.448778] CPU: 47 PID: 1806 Comm: qemu-system-s39 Tainted: G        W         5.16.0-00007-g89e9021389e2-dirty #15
[ 1880.448782] Hardware name: IBM 3906 M04 704 (LPAR)
[ 1880.448784] Call Trace:
[ 1880.448785]  [<000000000bf001d6>] dump_stack_lvl+0x8e/0xc8
[ 1880.448794]  [<000000000b26e08a>] __schedule_bug+0xe2/0xf8
[ 1880.448801]  [<000000000b26e212>] schedule_debug+0x172/0x1a8
[ 1880.448804]  [<000000000bf0bcae>] __schedule+0x56/0x8b0
[ 1880.448808]  [<000000000bf0c570>] schedule+0x68/0x110
[ 1880.448811]  [<000000000bf13e76>] schedule_timeout+0x106/0x160
[ 1880.448815]  [<000000000bf0ddf2>] wait_for_completion+0xc2/0x110
[ 1880.448818]  [<000000000b258674>] __flush_work+0xd4/0x118
[ 1880.448823]  [<000000000b4e3c88>] __drain_all_pages+0x218/0x308
[ 1880.448829]  [<000000000b4ec3bc>] __alloc_pages_slowpath.constprop.0+0x5bc/0xc98
[ 1880.448832]  [<000000000b4ece5c>] __alloc_pages+0x3c4/0x448
[ 1880.448835]  [<000000000b5143cc>] alloc_pages_vma+0x9c/0x360
[ 1880.448841]  [<000000000b4c0d6e>] do_swap_page+0x66e/0xca0
[ 1880.448845]  [<000000000b4c3012>] __handle_mm_fault+0x29a/0x4b0
[ 1880.448869]  [<000000000b4c33ac>] handle_mm_fault+0x184/0x3a8
[ 1880.448872]  [<000000000b2062ce>] do_exception+0x136/0x490
[ 1880.448877]  [<000000000b206b9a>] do_dat_exception+0x2a/0x50
[ 1880.448880]  [<000000000bf03650>] __do_pgm_check+0x120/0x1f0
[ 1880.448882]  [<000000000bf164ee>] pgm_check_handler+0x11e/0x180
[ 1880.448885]  [<000000000bf16298>] sie_exit+0x0/0x48
[ 1880.448888] ([<000003ff8071e954>] kvm_s390_enter_exit_sie+0x64/0x98 [kvm])
[ 1880.448910]  [<000003ff807061fa>] __vcpu_run+0x2a2/0x5b8 [kvm]
[ 1880.448931]  [<000003ff807069ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[ 1880.448953]  [<000003ff806ed02c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[ 1880.448975]  [<000000000b58b5c6>] __s390x_sys_ioctl+0xbe/0x100
[ 1880.448982]  [<000000000bf038fa>] __do_syscall+0x1da/0x208
[ 1880.448984]  [<000000000bf16362>] system_call+0x82/0xb0
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Mark Rutland 4 years, 5 months ago
On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
> 
> 
> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> > Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> > > 
> > > 
> > > Am 20.01.22 um 13:03 schrieb Mark Rutland:
> > > > On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> > > > > On 1/19/22 20:22, Mark Rutland wrote:
> > > > > > I wonder, is the s390 guest entry/exit*preemptible*  ?
> > > > > > 
> > > > > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > > > > > things before a ctx-switch to the idle thread, which would then be able
> > > > > > to hit this.
> > > > > > 
> > > > > > I'll need to go audit the other architectures for similar.
> > > > > 
> > > > > They don't enable interrupts in the entry/exit path so they should be okay.
> > > > 
> > > > True.
> > > > 
> > > > So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> > > > right thing to do. I'll add that and explanatory commentary.
> > > 
> > > That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> > > demand paging and
> > > 
> > > this piece of arch/s390/mm/fault.c
> > > 
> > >         case GMAP_FAULT:
> > >                  if (faulthandler_disabled() || !mm)
> > >                          goto out;
> > >                  break;
> > >          }
> > > 
> > > would no longer work since faulthandler_disabled checks for the preempt count.
> > > 
> > 
> > 
> > Something like this
> > 
> > 
> > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > index d30f5986fa85..1c7d45346e12 100644
> > --- a/arch/s390/mm/fault.c
> > +++ b/arch/s390/mm/fault.c
> > @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >                          return 0;
> >                  goto out;
> >          case USER_FAULT:
> > -       case GMAP_FAULT:
> >                  if (faulthandler_disabled() || !mm)
> >                          goto out;
> >                  break;
> > +               /*
> > +                * We know that we interrupted SIE and we are not in a IRQ.
> > +                * preemption might be disabled thus checking for in_atomic
> > +                * would result in failures
> > +                */
> > +       case GMAP_FAULT:
> > +               if (pagefault_disabled() || !mm)
> > +                       goto out;
> > +               break;
> >          }
> > 
> >          perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> > 
> > seems to work with preemption disabled around sie. Not sure yet if this is correct.
> 
> 
> No it does not work. scheduling while preemption is disabled.

Hmm... which exceptions do we expect to take from a guest?

I wonder if we can handle those more like other architectures by getting those
to immediately return from the sie64a() call with some status code that we can
handle outside of the guest_state_{enter,exit}_irqoff() critical section.

On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
guest to allow us to do that; I wonder if we could do similar here?

Thanks,
Mark.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago

Am 21.01.22 um 15:30 schrieb Mark Rutland:
> On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
>>> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>>>
>>>>
>>>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>>>> I wonder, is the s390 guest entry/exit*preemptible*  ?
>>>>>>>
>>>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>>>> to hit this.
>>>>>>>
>>>>>>> I'll need to go audit the other architectures for similar.
>>>>>>
>>>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>>>
>>>>> True.
>>>>>
>>>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>>>> right thing to do. I'll add that and explanatory commentary.
>>>>
>>>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>>>> demand paging and
>>>>
>>>> this piece of arch/s390/mm/fault.c
>>>>
>>>>          case GMAP_FAULT:
>>>>                   if (faulthandler_disabled() || !mm)
>>>>                           goto out;
>>>>                   break;
>>>>           }
>>>>
>>>> would no longer work since faulthandler_disabled checks for the preempt count.
>>>>
>>>
>>>
>>> Something like this
>>>
>>>
>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>> index d30f5986fa85..1c7d45346e12 100644
>>> --- a/arch/s390/mm/fault.c
>>> +++ b/arch/s390/mm/fault.c
>>> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>                           return 0;
>>>                   goto out;
>>>           case USER_FAULT:
>>> -       case GMAP_FAULT:
>>>                   if (faulthandler_disabled() || !mm)
>>>                           goto out;
>>>                   break;
>>> +               /*
>>> +                * We know that we interrupted SIE and we are not in a IRQ.
>>> +                * preemption might be disabled thus checking for in_atomic
>>> +                * would result in failures
>>> +                */
>>> +       case GMAP_FAULT:
>>> +               if (pagefault_disabled() || !mm)
>>> +                       goto out;
>>> +               break;
>>>           }
>>>
>>>           perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>>
>>> seems to work with preemption disabled around sie. Not sure yet if this is correct.
>>
>>
>> No it does not work. scheduling while preemption is disabled.
> 
> Hmm... which exceptions do we expect to take from a guest?
> 
> I wonder if we can handle those more like other architectures by getting those
> to immediately return from the sie64a() call with some status code that we can
> handle outside of the guest_state_{enter,exit}_irqoff() critical section.

We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
run that in the context of the pgm_check handler just like for userspace. the pgm_check
handler does a sie_exit (similar to the interrupt handlers) by setting the return IA.
  
> On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
> guest to allow us to do that; I wonder if we could do similar here?


Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Mark Rutland 4 years, 5 months ago
On Fri, Jan 21, 2022 at 03:42:48PM +0100, Christian Borntraeger wrote:
> Am 21.01.22 um 15:30 schrieb Mark Rutland:
> > On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
> > > Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> > > > Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> > > > > Am 20.01.22 um 13:03 schrieb Mark Rutland:
> > > > > > On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> > > > > > > On 1/19/22 20:22, Mark Rutland wrote:
> > > > > > > > I wonder, is the s390 guest entry/exit*preemptible*  ?
> > > > > > > > 
> > > > > > > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > > > > > > > things before a ctx-switch to the idle thread, which would then be able
> > > > > > > > to hit this.
> > > > > > > > 
> > > > > > > > I'll need to go audit the other architectures for similar.
> > > > > > > 
> > > > > > > They don't enable interrupts in the entry/exit path so they should be okay.
> > > > > > 
> > > > > > True.
> > > > > > 
> > > > > > So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> > > > > > right thing to do. I'll add that and explanatory commentary.
> > > > > 
> > > > > That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> > > > > demand paging and
> > > > > 
> > > > > this piece of arch/s390/mm/fault.c
> > > > > 
> > > > >          case GMAP_FAULT:
> > > > >                   if (faulthandler_disabled() || !mm)
> > > > >                           goto out;
> > > > >                   break;
> > > > >           }
> > > > > 
> > > > > would no longer work since faulthandler_disabled checks for the preempt count.
> > > > > 
> > > > 
> > > > 
> > > > Something like this
> > > > 
> > > > 
> > > > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > > > index d30f5986fa85..1c7d45346e12 100644
> > > > --- a/arch/s390/mm/fault.c
> > > > +++ b/arch/s390/mm/fault.c
> > > > @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > > >                           return 0;
> > > >                   goto out;
> > > >           case USER_FAULT:
> > > > -       case GMAP_FAULT:
> > > >                   if (faulthandler_disabled() || !mm)
> > > >                           goto out;
> > > >                   break;
> > > > +               /*
> > > > +                * We know that we interrupted SIE and we are not in a IRQ.
> > > > +                * preemption might be disabled thus checking for in_atomic
> > > > +                * would result in failures
> > > > +                */
> > > > +       case GMAP_FAULT:
> > > > +               if (pagefault_disabled() || !mm)
> > > > +                       goto out;
> > > > +               break;
> > > >           }
> > > > 
> > > >           perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> > > > 
> > > > seems to work with preemption disabled around sie. Not sure yet if this is correct.
> > > 
> > > 
> > > No it does not work. scheduling while preemption is disabled.
> > 
> > Hmm... which exceptions do we expect to take from a guest?
> > 
> > I wonder if we can handle those more like other architectures by getting those
> > to immediately return from the sie64a() call with some status code that we can
> > handle outside of the guest_state_{enter,exit}_irqoff() critical section.
> 
> We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
> run that in the context of the pgm_check handler just like for userspace.

Do we only expect to tak faults from a guest (which IIUC at the GMAP_FAULT
cases in the bit above), or are there other esceptions we expect to take from
the middle of a SIE?

> the pgm_check handler does a sie_exit (similar to the interrupt handlers) by
> setting the return IA.

Sure, but that sie_exit happens *after* the handler runs, where as I'm asking
whether we can structure this like all the other architectures and turn all
exceptions into returns from sie64a() that we can handle after having called
guest_state_exit_irqoff().

> > On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
> > guest to allow us to do that; I wonder if we could do similar here?

Does this idea sound at all plausible?

Thanks,
Mark.
Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs
Posted by Christian Borntraeger 4 years, 5 months ago

Am 21.01.22 um 16:29 schrieb Mark Rutland:
> On Fri, Jan 21, 2022 at 03:42:48PM +0100, Christian Borntraeger wrote:
>> Am 21.01.22 um 15:30 schrieb Mark Rutland:
>>> On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>>>> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
>>>>> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>>>>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>>>>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>>>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>>>>>> I wonder, is the s390 guest entry/exit*preemptible*  ?
>>>>>>>>>
>>>>>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>>>>>> to hit this.
>>>>>>>>>
>>>>>>>>> I'll need to go audit the other architectures for similar.
>>>>>>>>
>>>>>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>>>>>
>>>>>>> True.
>>>>>>>
>>>>>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>>>>>> right thing to do. I'll add that and explanatory commentary.
>>>>>>
>>>>>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>>>>>> demand paging and
>>>>>>
>>>>>> this piece of arch/s390/mm/fault.c
>>>>>>
>>>>>>           case GMAP_FAULT:
>>>>>>                    if (faulthandler_disabled() || !mm)
>>>>>>                            goto out;
>>>>>>                    break;
>>>>>>            }
>>>>>>
>>>>>> would no longer work since faulthandler_disabled checks for the preempt count.
>>>>>>
>>>>>
>>>>>
>>>>> Something like this
>>>>>
>>>>>
>>>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>>>> index d30f5986fa85..1c7d45346e12 100644
>>>>> --- a/arch/s390/mm/fault.c
>>>>> +++ b/arch/s390/mm/fault.c
>>>>> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>>                            return 0;
>>>>>                    goto out;
>>>>>            case USER_FAULT:
>>>>> -       case GMAP_FAULT:
>>>>>                    if (faulthandler_disabled() || !mm)
>>>>>                            goto out;
>>>>>                    break;
>>>>> +               /*
>>>>> +                * We know that we interrupted SIE and we are not in a IRQ.
>>>>> +                * preemption might be disabled thus checking for in_atomic
>>>>> +                * would result in failures
>>>>> +                */
>>>>> +       case GMAP_FAULT:
>>>>> +               if (pagefault_disabled() || !mm)
>>>>> +                       goto out;
>>>>> +               break;
>>>>>            }
>>>>>
>>>>>            perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>>>>
>>>>> seems to work with preemption disabled around sie. Not sure yet if this is correct.
>>>>
>>>>
>>>> No it does not work. scheduling while preemption is disabled.
>>>
>>> Hmm... which exceptions do we expect to take from a guest?
>>>
>>> I wonder if we can handle those more like other architectures by getting those
>>> to immediately return from the sie64a() call with some status code that we can
>>> handle outside of the guest_state_{enter,exit}_irqoff() critical section.
>>
>> We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
>> run that in the context of the pgm_check handler just like for userspace.
> 
> Do we only expect to tak faults from a guest (which IIUC at the GMAP_FAULT
> cases in the bit above), or are there other esceptions we expect to take from
> the middle of a SIE?
> 
>> the pgm_check handler does a sie_exit (similar to the interrupt handlers) by
>> setting the return IA.
> 
> Sure, but that sie_exit happens *after* the handler runs, where as I'm asking
> whether we can structure this like all the other architectures and turn all
> exceptions into returns from sie64a() that we can handle after having called
> guest_state_exit_irqoff().
> 
>>> On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
>>> guest to allow us to do that; I wonder if we could do similar here?
> 
> Does this idea sound at all plausible?

Maybe. We already have something like that for async_pf (see kvm_arch_setup_async_pf)
where we handle the pagefault later on after first kicking off the resolution for major
page faults (via FAULT_FLAG_RETRY_NOWAIT) in the low level handler.
Let me think about it.