[PATCH] Updates to Xen hypercall preemption

Per Bilse posted 1 patch 10 months, 1 week ago
Failed in applying to current master (apply log)
There is a newer version of this series
arch/x86/entry/common.c | 34 +++++++++++++---------------------
drivers/xen/privcmd.c   | 12 ++++++------
include/xen/xen-ops.h   | 14 +++++++-------
3 files changed, 26 insertions(+), 34 deletions(-)
[PATCH] Updates to Xen hypercall preemption
Posted by Per Bilse 10 months, 1 week ago
Some Xen hypercalls issued by dom0 guests may run for many 10s of
seconds, potentially causing watchdog timeouts and other problems.
It's rare for this to happen, but it does in extreme circumstances,
for instance when shutting down VMs with very large memory allocations
(> 0.5 - 1TB).  These hypercalls are preemptible, but the fixes in the
kernel to ensure preemption have fallen into a state of disrepair, and
are currently ineffective.  This patch brings things up to date by way of:

1) Update general feature selection from XEN_PV to XEN_DOM0.
The issue is unique to dom0 Xen guests, but isn't unique to PV dom0s,
and will occur in future PVH dom0s.  XEN_DOM0 depends on either PV or PVH,
as well as the appropriate details for dom0.

2) Update specific feature selection from !PREEMPTION to !PREEMPT.
The following table shows the relationship between different preemption
features and their indicators/selectors (Y = "=Y", N = "is not set",
. = absent):

                            | np-s | np-d | vp-s | vp-d | fp-s | fp-d
    CONFIG_PREEMPT_DYNAMIC      N      Y      N      Y      N      Y
         CONFIG_PREEMPTION      .      Y      .      Y      Y      Y
            CONFIG_PREEMPT      N      N      N      N      Y      Y
  CONFIG_PREEMPT_VOLUNTARY      N      N      Y      Y      N      N
       CONFIG_PREEMPT_NONE      Y      Y      N      N      N      N

Unless PREEMPT is set, we need to enable the fixes.

3) Update flag access from __this_cpu_XXX() to raw_cpu_XXX().
The long-running hypercalls are flagged by way of a per-cpu variable
which is set before and cleared after the relevant calls.  This elicits
a warning "BUG: using __this_cpu_write() in preemptible [00000000] code",
but xen_pv_evtchn_do_upcall() deals specifically with this.  For
consistency, flag testing is also updated, and the code is simplified
and tidied accordingly.

4) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
The code will call irqentry_exit_cond_resched() if the flag (as noted
above) is set, but the dynamic preemption feature will livepatch that
function to a no-op unless full preemption is selected.  The code is
therefore updated to call raw_irqentry_exit_cond_resched().

Signed-off-by: Per Bilse <per.bilse@citrix.com>
---
 arch/x86/entry/common.c | 34 +++++++++++++---------------------
 drivers/xen/privcmd.c   | 12 ++++++------
 include/xen/xen-ops.h   | 14 +++++++-------
 3 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..19e8609a7a5a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -20,7 +20,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 
-#ifdef CONFIG_XEN_PV
+#ifdef CONFIG_XEN_DOM0
 #include <xen/xen-ops.h>
 #include <xen/events.h>
 #endif
@@ -252,17 +252,17 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
-#ifdef CONFIG_XEN_PV
-#ifndef CONFIG_PREEMPTION
+#ifdef CONFIG_XEN_DOM0
+#ifndef CONFIG_PREEMPT
 /*
  * Some hypercalls issued by the toolstack can take many 10s of
  * seconds. Allow tasks running hypercalls via the privcmd driver to
  * be voluntarily preempted even if full kernel preemption is
  * disabled.
  *
- * Such preemptible hypercalls are bracketed by
- * xen_preemptible_hcall_begin() and xen_preemptible_hcall_end()
- * calls.
+ * Such preemptible hypercalls are flagged by
+ * xen_preemptible_hcall_set(true/false), and status is
+ * returned by xen_preemptible_hcall_get().
  */
 DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
 EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
@@ -271,21 +271,15 @@ EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
  * In case of scheduling the flag must be cleared and restored after
  * returning from schedule as the task might move to a different CPU.
  */
-static __always_inline bool get_and_clear_inhcall(void)
+static __always_inline bool get_and_unset_hcall(void)
 {
-	bool inhcall = __this_cpu_read(xen_in_preemptible_hcall);
+	bool inhcall = xen_preemptible_hcall_get();
 
-	__this_cpu_write(xen_in_preemptible_hcall, false);
+	xen_preemptible_hcall_set(false);
 	return inhcall;
 }
-
-static __always_inline void restore_inhcall(bool inhcall)
-{
-	__this_cpu_write(xen_in_preemptible_hcall, inhcall);
-}
 #else
-static __always_inline bool get_and_clear_inhcall(void) { return false; }
-static __always_inline void restore_inhcall(bool inhcall) { }
+static __always_inline bool get_and_unset_hcall(void) { return false; }
 #endif
 
 static void __xen_pv_evtchn_do_upcall(struct pt_regs *regs)
@@ -302,16 +296,14 @@ static void __xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 {
 	irqentry_state_t state = irqentry_enter(regs);
-	bool inhcall;
 
 	instrumentation_begin();
 	run_sysvec_on_irqstack_cond(__xen_pv_evtchn_do_upcall, regs);
 
-	inhcall = get_and_clear_inhcall();
-	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
-		irqentry_exit_cond_resched();
+	if (get_and_unset_hcall() && !WARN_ON_ONCE(state.exit_rcu)) {
+		raw_irqentry_exit_cond_resched();
 		instrumentation_end();
-		restore_inhcall(inhcall);
+		xen_preemptible_hcall_set(true);
 	} else {
 		instrumentation_end();
 		irqentry_exit(regs, state);
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index e2f580e30a86..78c91227d2a5 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -77,12 +77,12 @@ static long privcmd_ioctl_hypercall(struct file *file, void __user *udata)
 	if (copy_from_user(&hypercall, udata, sizeof(hypercall)))
 		return -EFAULT;
 
-	xen_preemptible_hcall_begin();
+	xen_preemptible_hcall_set(true);
 	ret = privcmd_call(hypercall.op,
 			   hypercall.arg[0], hypercall.arg[1],
 			   hypercall.arg[2], hypercall.arg[3],
 			   hypercall.arg[4]);
-	xen_preemptible_hcall_end();
+	xen_preemptible_hcall_set(false);
 
 	return ret;
 }
@@ -688,9 +688,9 @@ static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
 		xbufs[i].size = kbufs[i].size;
 	}
 
-	xen_preemptible_hcall_begin();
+	xen_preemptible_hcall_set(true);
 	rc = HYPERVISOR_dm_op(kdata.dom, kdata.num, xbufs);
-	xen_preemptible_hcall_end();
+	xen_preemptible_hcall_set(false);
 
 out:
 	unlock_pages(pages, pinned);
@@ -790,9 +790,9 @@ static long privcmd_ioctl_mmap_resource(struct file *file,
 	xdata.nr_frames = kdata.num;
 	set_xen_guest_handle(xdata.frame_list, pfns);
 
-	xen_preemptible_hcall_begin();
+	xen_preemptible_hcall_set(true);
 	rc = HYPERVISOR_memory_op(XENMEM_acquire_resource, &xdata);
-	xen_preemptible_hcall_end();
+	xen_preemptible_hcall_set(false);
 
 	if (rc)
 		goto out;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 47f11bec5e90..c5b06405f355 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -194,24 +194,24 @@ bool xen_running_on_version_or_later(unsigned int major, unsigned int minor);
 void xen_efi_runtime_setup(void);
 
 
-#if defined(CONFIG_XEN_PV) && !defined(CONFIG_PREEMPTION)
+#if defined(CONFIG_XEN_DOM0) && !defined(CONFIG_PREEMPT)
 
 DECLARE_PER_CPU(bool, xen_in_preemptible_hcall);
 
-static inline void xen_preemptible_hcall_begin(void)
+static inline void xen_preemptible_hcall_set(bool status)
 {
-	__this_cpu_write(xen_in_preemptible_hcall, true);
+	raw_cpu_write(xen_in_preemptible_hcall, status);
 }
 
-static inline void xen_preemptible_hcall_end(void)
+static inline bool xen_preemptible_hcall_get(void)
 {
-	__this_cpu_write(xen_in_preemptible_hcall, false);
+	return raw_cpu_read(xen_in_preemptible_hcall);
 }
 
 #else
 
-static inline void xen_preemptible_hcall_begin(void) { }
-static inline void xen_preemptible_hcall_end(void) { }
+static inline void xen_preemptible_hcall_set(bool status) { }
+static inline bool xen_preemptible_hcall_get(void) { return false; }
 
 #endif /* CONFIG_XEN_PV && !CONFIG_PREEMPTION */
 
-- 
2.31.1
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Peter Zijlstra 10 months, 1 week ago
On Wed, Jun 21, 2023 at 03:14:42PM +0000, Per Bilse wrote:
> Some Xen hypercalls issued by dom0 guests may run for many 10s of
> seconds, potentially causing watchdog timeouts and other problems.
> It's rare for this to happen, but it does in extreme circumstances,
> for instance when shutting down VMs with very large memory allocations
> (> 0.5 - 1TB).  These hypercalls are preemptible, but the fixes in the
> kernel to ensure preemption have fallen into a state of disrepair, and
> are currently ineffective.  This patch brings things up to date by way of:

I don't understand it -- fundamentally, how can linux schedule when the
guest isn't even running? Hypercall transfers control to the
host/hypervisor and leaves the guest suspended.

> 1) Update general feature selection from XEN_PV to XEN_DOM0.
> The issue is unique to dom0 Xen guests, but isn't unique to PV dom0s,
> and will occur in future PVH dom0s.  XEN_DOM0 depends on either PV or PVH,
> as well as the appropriate details for dom0.
> 
> 2) Update specific feature selection from !PREEMPTION to !PREEMPT.
> The following table shows the relationship between different preemption
> features and their indicators/selectors (Y = "=Y", N = "is not set",
> . = absent):
> 
>                             | np-s | np-d | vp-s | vp-d | fp-s | fp-d
>     CONFIG_PREEMPT_DYNAMIC      N      Y      N      Y      N      Y
>          CONFIG_PREEMPTION      .      Y      .      Y      Y      Y
>             CONFIG_PREEMPT      N      N      N      N      Y      Y
>   CONFIG_PREEMPT_VOLUNTARY      N      N      Y      Y      N      N
>        CONFIG_PREEMPT_NONE      Y      Y      N      N      N      N
> 
> Unless PREEMPT is set, we need to enable the fixes.
> 
> 3) Update flag access from __this_cpu_XXX() to raw_cpu_XXX().
> The long-running hypercalls are flagged by way of a per-cpu variable
> which is set before and cleared after the relevant calls.  This elicits
> a warning "BUG: using __this_cpu_write() in preemptible [00000000] code",
> but xen_pv_evtchn_do_upcall() deals specifically with this.  For
> consistency, flag testing is also updated, and the code is simplified
> and tidied accordingly.

This makes no sense; the race that warning warns about is:

	CPU0			CPU1
	per-cpu write
	<preempt-out>
				<preempt-in>
				do-hypercall

So you wrote the value on CPU0, got migrated to CPU1 because you had
preemptioned enabled, and then continue with the percpu value of CPU1
because that's where you're at now.

Simply making the warning go away doesn't help, CPU1 does hypercall
while store was on CPU0.

> 4) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
> The code will call irqentry_exit_cond_resched() if the flag (as noted
> above) is set, but the dynamic preemption feature will livepatch that
> function to a no-op unless full preemption is selected.  The code is
> therefore updated to call raw_irqentry_exit_cond_resched().

That, again meeds more explanation. Why do you want this if not
preemptible?

You're doing 4 things, that should be 4 patches. Also, please give more
clues for how this is supposed to work at all.
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Per Bilse 10 months, 1 week ago
On 6/21/2023 5:40 PM, Peter Zijlstra wrote:
> I don't understand it -- fundamentally, how can linux schedule when the
> guest isn't even running? Hypercall transfers control to the
> host/hypervisor and leaves the guest suspended.

Hi Peter, as noted in earlier note to Andy, this is essentially existing
code that other commits have rendered ineffective over time.  Hence,
the finer details of how or why it works haven't changed since it was
first introduced.

> This makes no sense; the race that warning warns about is:
> 
> 	CPU0			CPU1
> 	per-cpu write
> 	<preempt-out>
> 				<preempt-in>
> 				do-hypercall
> 
> So you wrote the value on CPU0, got migrated to CPU1 because you had
> preemptioned enabled, and then continue with the percpu value of CPU1
> because that's where you're at now.

This issue was raised internally, and it was noted that the only way
for the preemptible code to switch task is via an interrupt that goes
through xen_pv_evtchn_do_upcall(), which handles this.  I'm happy to
check with my sources, but it's holiday season right now.

>> 4) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
>> The code will call irqentry_exit_cond_resched() if the flag (as noted
>> above) is set, but the dynamic preemption feature will livepatch that
>> function to a no-op unless full preemption is selected.  The code is
>> therefore updated to call raw_irqentry_exit_cond_resched().
> 
> That, again meeds more explanation. Why do you want this if not
> preemptible?

I'm not quite sure what you mean here.  Dynamic preemption
will livepatch irqentry_exit_cond_resched() to be a no-op, while
raw_irqentry_exit_cond_resched() remains functional.  This was 
introduced in commit 4624a14f4daa last year which was said to fix
the problem, but doesn't.  You may remember, it was signed off by 
yourself and Mark Rutland.

> You're doing 4 things, that should be 4 patches. Also, please give more
> clues for how this is supposed to work at all.

I respectfully have to disagree with that.  The fixes here are very
closely related, and we're not introducing anything new, we're merely
re-enabling code which has been rendered ineffective due to oversights
in commits made after the code was first introduced.  How the code is
supposed to work hasn't changed, and is beyond the scope of these fixes;
I'm sure it must have been discussed at great length at the time (commit 
fdfd811ddde3).

Best,

   -- Per

Re: [PATCH] Updates to Xen hypercall preemption
Posted by Peter Zijlstra 10 months, 1 week ago
On Wed, Jun 21, 2023 at 07:19:21PM +0000, Per Bilse wrote:
> On 6/21/2023 5:40 PM, Peter Zijlstra wrote:
> > I don't understand it -- fundamentally, how can linux schedule when the
> > guest isn't even running? Hypercall transfers control to the
> > host/hypervisor and leaves the guest suspended.
> 
> Hi Peter, as noted in earlier note to Andy, this is essentially existing
> code that other commits have rendered ineffective over time.  Hence,
> the finer details of how or why it works haven't changed since it was
> first introduced.

That doesn't mean you don't have to explain how stuff works.

> > This makes no sense; the race that warning warns about is:
> > 
> > 	CPU0			CPU1
> > 	per-cpu write
> > 	<preempt-out>
> > 				<preempt-in>
> > 				do-hypercall
> > 
> > So you wrote the value on CPU0, got migrated to CPU1 because you had
> > preemptioned enabled, and then continue with the percpu value of CPU1
> > because that's where you're at now.
> 
> This issue was raised internally, and it was noted that the only way
> for the preemptible code to switch task is via an interrupt that goes
> through xen_pv_evtchn_do_upcall(), which handles this.  I'm happy to
> check with my sources, but it's holiday season right now.

Then it should have all sorts of comments on and a comprehensive
changelog.

> >> 4) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
> >> The code will call irqentry_exit_cond_resched() if the flag (as noted
> >> above) is set, but the dynamic preemption feature will livepatch that
> >> function to a no-op unless full preemption is selected.  The code is
> >> therefore updated to call raw_irqentry_exit_cond_resched().
> > 
> > That, again meeds more explanation. Why do you want this if not
> > preemptible?
> 
> I'm not quite sure what you mean here.  Dynamic preemption
> will livepatch irqentry_exit_cond_resched() to be a no-op, while
> raw_irqentry_exit_cond_resched() remains functional.  This was 
> introduced in commit 4624a14f4daa last year which was said to fix
> the problem, but doesn't.  You may remember, it was signed off by 
> yourself and Mark Rutland.

I don't see the relation; what you're doing is making dynamic preempt
that's not configured for full preempt do preemption. That's weird, and
again no comments.

I'm with Andy in that simply forcing full preemption would make far more
sense -- but I'm still missing something fundamental, see below.

> > You're doing 4 things, that should be 4 patches. Also, please give more
> > clues for how this is supposed to work at all.
> 
> I respectfully have to disagree with that.  The fixes here are very
> closely related, and we're not introducing anything new, we're merely
> re-enabling code which has been rendered ineffective due to oversights
> in commits made after the code was first introduced.  How the code is
> supposed to work hasn't changed, and is beyond the scope of these fixes;
> I'm sure it must have been discussed at great length at the time (commit 
> fdfd811ddde3).

You didn't even so much as reference that commit, nor provide any other
explanation. And having now read that commit, I'm not much enlightend.

*HOW* can a hypercall, something that exits the Guest and has the
Host/Hypervisor run get preempted in the Guest -- that isn't running.

Or are you calling apples pears?
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Per Bilse 10 months, 1 week ago
On 6/21/2023 9:04 PM, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 07:19:21PM +0000, Per Bilse wrote:
>> On 6/21/2023 5:40 PM, Peter Zijlstra wrote:
>>> I don't understand it -- fundamentally, how can linux schedule when the
>>> guest isn't even running? Hypercall transfers control to the
>>> host/hypervisor and leaves the guest suspended.
>>
>> Hi Peter, as noted in earlier note to Andy, this is essentially existing
>> code that other commits have rendered ineffective over time.  Hence,
>> the finer details of how or why it works haven't changed since it was
>> first introduced.
> 
> That doesn't mean you don't have to explain how stuff works
Hi Peter, first of all I want to say that the vigilance and dedication
you and so many other people show for the kernel project and the
integrity of the code is highly appreciated.  It isn't so that "we"
(whomever that may be) are ignorant of that, and I have personally
enjoyed the freedoms of BSD, and later Linux, systems for longer than
I care to remember.  There is, however, a scope to every patch.

In this case we change CONFIG_PREEMPTION to CONFIG_PREEMPT in two
places; the other (quite minor) changes follow on from that, partly
to avoid a repeat in the future.  The code that is en- or disabled
by this change has already been through all the dscussion, reviews,
considerations, and gnashing of teeth that time-honoured traditions
require.  Maybe things were different at the time it was committed,
but that's nothing new -- a few years ago I was fixing IPv6 code
from the end of the last century, and it's entirely unsustainable
to repeat and rehash libraries of old discussions and code simply
because an ifdef is changed.

In all its brutality, the current state of affairs is that the kernel
has bugs that will cause it to crash or otherwise malfunction in
certain circumstances.  There is code present in the kernel that will
prevent that, but this code has become ineffective due to oversights
in commits made since the code was introduced.  We thought it was
in everybody's best interests to ensure this code was re-enabled,
that's all.  I fully support you and everybody else in your desire to
maintain the highest possible standards, and as I said to Andy wrt
his suggestion, I'll see what I can do.  In the meantime, I highly
recommend not shooting the messenger.

Best,

   -- Per

Re: [PATCH] Updates to Xen hypercall preemption
Posted by Juergen Gross 10 months, 1 week ago
On 21.06.23 22:04, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 07:19:21PM +0000, Per Bilse wrote:
>> On 6/21/2023 5:40 PM, Peter Zijlstra wrote:
>>> I don't understand it -- fundamentally, how can linux schedule when the
>>> guest isn't even running? Hypercall transfers control to the
>>> host/hypervisor and leaves the guest suspended.
>>
>> Hi Peter, as noted in earlier note to Andy, this is essentially existing
>> code that other commits have rendered ineffective over time.  Hence,
>> the finer details of how or why it works haven't changed since it was
>> first introduced.
> 
> That doesn't mean you don't have to explain how stuff works.
> 
>>> This makes no sense; the race that warning warns about is:
>>>
>>> 	CPU0			CPU1
>>> 	per-cpu write
>>> 	<preempt-out>
>>> 				<preempt-in>
>>> 				do-hypercall
>>>
>>> So you wrote the value on CPU0, got migrated to CPU1 because you had
>>> preemptioned enabled, and then continue with the percpu value of CPU1
>>> because that's where you're at now.
>>
>> This issue was raised internally, and it was noted that the only way
>> for the preemptible code to switch task is via an interrupt that goes
>> through xen_pv_evtchn_do_upcall(), which handles this.  I'm happy to
>> check with my sources, but it's holiday season right now.
> 
> Then it should have all sorts of comments on and a comprehensive
> changelog.
> 
>>>> 4) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
>>>> The code will call irqentry_exit_cond_resched() if the flag (as noted
>>>> above) is set, but the dynamic preemption feature will livepatch that
>>>> function to a no-op unless full preemption is selected.  The code is
>>>> therefore updated to call raw_irqentry_exit_cond_resched().
>>>
>>> That, again meeds more explanation. Why do you want this if not
>>> preemptible?
>>
>> I'm not quite sure what you mean here.  Dynamic preemption
>> will livepatch irqentry_exit_cond_resched() to be a no-op, while
>> raw_irqentry_exit_cond_resched() remains functional.  This was
>> introduced in commit 4624a14f4daa last year which was said to fix
>> the problem, but doesn't.  You may remember, it was signed off by
>> yourself and Mark Rutland.
> 
> I don't see the relation; what you're doing is making dynamic preempt
> that's not configured for full preempt do preemption. That's weird, and
> again no comments.
> 
> I'm with Andy in that simply forcing full preemption would make far more
> sense -- but I'm still missing something fundamental, see below.
> 
>>> You're doing 4 things, that should be 4 patches. Also, please give more
>>> clues for how this is supposed to work at all.
>>
>> I respectfully have to disagree with that.  The fixes here are very
>> closely related, and we're not introducing anything new, we're merely
>> re-enabling code which has been rendered ineffective due to oversights
>> in commits made after the code was first introduced.  How the code is
>> supposed to work hasn't changed, and is beyond the scope of these fixes;
>> I'm sure it must have been discussed at great length at the time (commit
>> fdfd811ddde3).
> 
> You didn't even so much as reference that commit, nor provide any other
> explanation. And having now read that commit, I'm not much enlightend.
> 
> *HOW* can a hypercall, something that exits the Guest and has the
> Host/Hypervisor run get preempted in the Guest -- that isn't running.
> 
> Or are you calling apples pears?

The hypercalls we are talking of are synchronous ones. They are running
in the context of the vcpu doing the call (like a syscall from userland is
running in the process context).

The hypervisor will return to guest context from time to time by modifying
the registers such that the guest will do the hypercall again with different
input values for the hypervisor, resulting in a proper continuation of the
hypercall processing.

It is an awful interface and I agree that switching to full preemption in
dom0 seems to be the route which we should try to take. The downside would
be that some workloads might see worse performance due to backend I/O
handling might get preempted.

Just thinking - can full preemption be enabled per process?


Juergen
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Peter Zijlstra 10 months, 1 week ago
On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:

> The hypercalls we are talking of are synchronous ones. They are running
> in the context of the vcpu doing the call (like a syscall from userland is
> running in the process context).

(so time actually passes from the guest's pov?)

> The hypervisor will return to guest context from time to time by modifying
> the registers such that the guest will do the hypercall again with different
> input values for the hypervisor, resulting in a proper continuation of the
> hypercall processing.

Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
like at all?

That is, why isn't this whole thing written like:

	for (;;) {
		ret = hypercall(foo);
		if (ret == -EAGAIN) {
			cond_resched();
			continue;
		}
		break;
	}

> It is an awful interface and I agree that switching to full preemption in
> dom0 seems to be the route which we should try to take.

Well, I would very strongly suggest the route to take is to scrap the
whole thing and invest in doing something saner so we don't have to jump
through hoops like this.

This is quite possibly the worst possible interface for this Xen could
have come up with -- awards material for sure.

> The downside would be that some workloads might see worse performance
> due to backend I/O handling might get preempted.

Is that an actual concern? Mark this a legaxy inteface and anybody who
wants to get away from it updates.

> Just thinking - can full preemption be enabled per process?

Nope, that's a system wide thing. Preemption is something that's driven
by the requirements of the tasks that preempt, not something by the
tasks that get preempted.

Andy's idea of having that thing intercepted as an exception (EXTABLE
like) and relocating the IP to a place that does cond_resched() before
going back is an option.. gross, but possibly better, dunno.

Quite the mess indeed :/
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Andrew Cooper 10 months, 1 week ago
On 22/06/2023 9:26 am, Peter Zijlstra wrote:
> On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
>
>> The hypercalls we are talking of are synchronous ones. They are running
>> in the context of the vcpu doing the call (like a syscall from userland is
>> running in the process context).
> (so time actually passes from the guest's pov?)

Yes.  And in principle it's wired into stolen time.

>> The hypervisor will return to guest context from time to time by modifying
>> the registers such that the guest will do the hypercall again with different
>> input values for the hypervisor, resulting in a proper continuation of the
>> hypercall processing.
> Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
> like at all?
>
> That is, why isn't this whole thing written like:
>
> 	for (;;) {
> 		ret = hypercall(foo);
> 		if (ret == -EAGAIN) {
> 			cond_resched();
> 			continue;
> 		}
> 		break;
> 	}

No, because that would have required the original authors to write a
coherent interface.

As example, I present "long hypercall_get_dr(int reg);" which returns a
full %dr, or -EINVAL.  But other hypercalls have completely disjoint
API/ABIs so even if you could tell Xen not subtract %rip to repeat the
SYSCALL/etc instruction, you still don't have any kind of way to spot a
continuation.

Making a new ABI/API which looks like that is specifically on the cards,
in some copious free time.

~Andrew
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Juergen Gross 10 months, 1 week ago
On 22.06.23 15:05, Andrew Cooper wrote:
> On 22/06/2023 9:26 am, Peter Zijlstra wrote:
>> On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
>>
>>> The hypercalls we are talking of are synchronous ones. They are running
>>> in the context of the vcpu doing the call (like a syscall from userland is
>>> running in the process context).
>> (so time actually passes from the guest's pov?)
> 
> Yes.  And in principle it's wired into stolen time.

Sure? I think stolen time is only increased if the vcpu is being descheduled
by Xen. Synchronous hypercalls should be accounted for the calling vcpu.


Juergen
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Peter Zijlstra 10 months, 1 week ago
On Thu, Jun 22, 2023 at 02:05:13PM +0100, Andrew Cooper wrote:
> On 22/06/2023 9:26 am, Peter Zijlstra wrote:
> > On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
> >
> >> The hypercalls we are talking of are synchronous ones. They are running
> >> in the context of the vcpu doing the call (like a syscall from userland is
> >> running in the process context).
> > (so time actually passes from the guest's pov?)
> 
> Yes.  And in principle it's wired into stolen time.

Hmm, that means that if the scheduler accounts for stolen time (see
update_rq_clock_task(), PARAVIRT_TIME_ACCOUNTING hunk) then it appears
as if no time has passed, which might affect preemption decisions.

Oh well, we'll think about it when it's shown to be a problem I suppose
:-)
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Juergen Gross 10 months, 1 week ago
On 22.06.23 10:26, Peter Zijlstra wrote:
> On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
> 
>> The hypercalls we are talking of are synchronous ones. They are running
>> in the context of the vcpu doing the call (like a syscall from userland is
>> running in the process context).
> 
> (so time actually passes from the guest's pov?)

Correct.

> 
>> The hypervisor will return to guest context from time to time by modifying
>> the registers such that the guest will do the hypercall again with different
>> input values for the hypervisor, resulting in a proper continuation of the
>> hypercall processing.
> 
> Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
> like at all?

In the long run this should be possible, but not for already existing Xen
versions.

> 
> That is, why isn't this whole thing written like:
> 
> 	for (;;) {
> 		ret = hypercall(foo);
> 		if (ret == -EAGAIN) {
> 			cond_resched();
> 			continue;
> 		}
> 		break;
> 	}

The hypervisor doesn't return -EAGAIN for hysterical reasons.

This would be one of the options to change the interface. OTOH there are cases
where already existing hypercalls need to be modified in the hypervisor to do
preemption in the middle due to e.g. security reasons (avoiding cpu hogging in
special cases).

Additionally some of the hypercalls being subject to preemption are allowed in
unprivileged guests, too. Those are mostly hypercalls allowed for PV guests
only, but some are usable by all guests.

> 
>> It is an awful interface and I agree that switching to full preemption in
>> dom0 seems to be the route which we should try to take.
> 
> Well, I would very strongly suggest the route to take is to scrap the
> whole thing and invest in doing something saner so we don't have to jump
> through hoops like this.
> 
> This is quite possibly the worst possible interface for this Xen could
> have come up with -- awards material for sure.

Yes.

> 
>> The downside would be that some workloads might see worse performance
>> due to backend I/O handling might get preempted.
> 
> Is that an actual concern? Mark this a legaxy inteface and anybody who
> wants to get away from it updates.

It isn't that easy. See above.

> 
>> Just thinking - can full preemption be enabled per process?
> 
> Nope, that's a system wide thing. Preemption is something that's driven
> by the requirements of the tasks that preempt, not something by the
> tasks that get preempted.

Depends. If a task in a non-preempt system could switch itself to be
preemptable, we could do so around hypercalls without compromising the
general preemption setting. Disabling preemption in a preemptable system
should continue to be possible for short code paths only, of course.

> Andy's idea of having that thing intercepted as an exception (EXTABLE
> like) and relocating the IP to a place that does cond_resched() before
> going back is an option.. gross, but possibly better, dunno.
> 
> Quite the mess indeed :/

Yeah.


Juergen
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Andy Lutomirski 10 months, 1 week ago
On Thu, Jun 22, 2023, at 3:33 AM, Juergen Gross wrote:
> On 22.06.23 10:26, Peter Zijlstra wrote:
>> On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
>> 
>>> The hypercalls we are talking of are synchronous ones. They are running
>>> in the context of the vcpu doing the call (like a syscall from userland is
>>> running in the process context).
>> 
>> (so time actually passes from the guest's pov?)
>
> Correct.
>
>> 
>>> The hypervisor will return to guest context from time to time by modifying
>>> the registers such that the guest will do the hypercall again with different
>>> input values for the hypervisor, resulting in a proper continuation of the
>>> hypercall processing.
>> 
>> Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
>> like at all?
>
> In the long run this should be possible, but not for already existing Xen
> versions.
>
>> 
>> That is, why isn't this whole thing written like:
>> 
>> 	for (;;) {
>> 		ret = hypercall(foo);
>> 		if (ret == -EAGAIN) {
>> 			cond_resched();
>> 			continue;
>> 		}
>> 		break;
>> 	}
>
> The hypervisor doesn't return -EAGAIN for hysterical reasons.
>
> This would be one of the options to change the interface. OTOH there are cases
> where already existing hypercalls need to be modified in the hypervisor to do
> preemption in the middle due to e.g. security reasons (avoiding cpu hogging in
> special cases).
>
> Additionally some of the hypercalls being subject to preemption are allowed in
> unprivileged guests, too. Those are mostly hypercalls allowed for PV guests
> only, but some are usable by all guests.
>
>> 
>>> It is an awful interface and I agree that switching to full preemption in
>>> dom0 seems to be the route which we should try to take.
>> 
>> Well, I would very strongly suggest the route to take is to scrap the
>> whole thing and invest in doing something saner so we don't have to jump
>> through hoops like this.
>> 
>> This is quite possibly the worst possible interface for this Xen could
>> have come up with -- awards material for sure.
>
> Yes.
>
>> 
>>> The downside would be that some workloads might see worse performance
>>> due to backend I/O handling might get preempted.
>> 
>> Is that an actual concern? Mark this a legaxy inteface and anybody who
>> wants to get away from it updates.
>
> It isn't that easy. See above.
>
>> 
>>> Just thinking - can full preemption be enabled per process?
>> 
>> Nope, that's a system wide thing. Preemption is something that's driven
>> by the requirements of the tasks that preempt, not something by the
>> tasks that get preempted.
>
> Depends. If a task in a non-preempt system could switch itself to be
> preemptable, we could do so around hypercalls without compromising the
> general preemption setting. Disabling preemption in a preemptable system
> should continue to be possible for short code paths only, of course.
>
>> Andy's idea of having that thing intercepted as an exception (EXTABLE
>> like) and relocating the IP to a place that does cond_resched() before
>> going back is an option.. gross, but possibly better, dunno.
>> 
>> Quite the mess indeed :/
>
> Yeah.

Having one implementation of interrupt handlers that schedule when they interrupt kernel code (the normal full preempt path) is one thing.  Having two of them (full preempt and super-special-Xen) is IMO quite a bit worse.  Especially since no one tests the latter very well.

Having a horrible Xen-specific extable-like thingy seems honestly rather less bad.  It could even have a little self-contained test that runs at boot, I bet.

But I'll bite on the performance impact issue.  What, exactly, is wrong with full preemption?  Full preemption has two sources of overhead, I think.  One is a bit of bookkeeping.  The other is the overhead inherent in actually rescheduling -- context switch cost, losing things from cache, etc.

The bookkeeping part should have quite low overhead.  The scheduling part sounds like it might just need some scheduler tuning if it's really a problem.

In any case, for backend IO, full preemption sounds like it should be a win, not a loss.  If I'm asking dom0 to do backend IO for me, I don't want it delayed because dom0 was busy doing something else boring.  IO is faster when the latency between requesting it and actually submitting it to hardware is lower.

Can anyone actually demonstrate full preemption being a loss on a real Xen PV workload?

--Andy
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Juergen Gross 10 months, 1 week ago
On 22.06.23 18:39, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 3:33 AM, Juergen Gross wrote:
>> On 22.06.23 10:26, Peter Zijlstra wrote:
>>> On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
>>>
>>>> The hypercalls we are talking of are synchronous ones. They are running
>>>> in the context of the vcpu doing the call (like a syscall from userland is
>>>> running in the process context).
>>>
>>> (so time actually passes from the guest's pov?)
>>
>> Correct.
>>
>>>
>>>> The hypervisor will return to guest context from time to time by modifying
>>>> the registers such that the guest will do the hypercall again with different
>>>> input values for the hypervisor, resulting in a proper continuation of the
>>>> hypercall processing.
>>>
>>> Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
>>> like at all?
>>
>> In the long run this should be possible, but not for already existing Xen
>> versions.
>>
>>>
>>> That is, why isn't this whole thing written like:
>>>
>>> 	for (;;) {
>>> 		ret = hypercall(foo);
>>> 		if (ret == -EAGAIN) {
>>> 			cond_resched();
>>> 			continue;
>>> 		}
>>> 		break;
>>> 	}
>>
>> The hypervisor doesn't return -EAGAIN for hysterical reasons.
>>
>> This would be one of the options to change the interface. OTOH there are cases
>> where already existing hypercalls need to be modified in the hypervisor to do
>> preemption in the middle due to e.g. security reasons (avoiding cpu hogging in
>> special cases).
>>
>> Additionally some of the hypercalls being subject to preemption are allowed in
>> unprivileged guests, too. Those are mostly hypercalls allowed for PV guests
>> only, but some are usable by all guests.
>>
>>>
>>>> It is an awful interface and I agree that switching to full preemption in
>>>> dom0 seems to be the route which we should try to take.
>>>
>>> Well, I would very strongly suggest the route to take is to scrap the
>>> whole thing and invest in doing something saner so we don't have to jump
>>> through hoops like this.
>>>
>>> This is quite possibly the worst possible interface for this Xen could
>>> have come up with -- awards material for sure.
>>
>> Yes.
>>
>>>
>>>> The downside would be that some workloads might see worse performance
>>>> due to backend I/O handling might get preempted.
>>>
>>> Is that an actual concern? Mark this a legaxy inteface and anybody who
>>> wants to get away from it updates.
>>
>> It isn't that easy. See above.
>>
>>>
>>>> Just thinking - can full preemption be enabled per process?
>>>
>>> Nope, that's a system wide thing. Preemption is something that's driven
>>> by the requirements of the tasks that preempt, not something by the
>>> tasks that get preempted.
>>
>> Depends. If a task in a non-preempt system could switch itself to be
>> preemptable, we could do so around hypercalls without compromising the
>> general preemption setting. Disabling preemption in a preemptable system
>> should continue to be possible for short code paths only, of course.
>>
>>> Andy's idea of having that thing intercepted as an exception (EXTABLE
>>> like) and relocating the IP to a place that does cond_resched() before
>>> going back is an option.. gross, but possibly better, dunno.
>>>
>>> Quite the mess indeed :/
>>
>> Yeah.
> 
> Having one implementation of interrupt handlers that schedule when they interrupt kernel code (the normal full preempt path) is one thing.  Having two of them (full preempt and super-special-Xen) is IMO quite a bit worse.  Especially since no one tests the latter very well.
> 
> Having a horrible Xen-specific extable-like thingy seems honestly rather less bad.  It could even have a little self-contained test that runs at boot, I bet.
> 
> But I'll bite on the performance impact issue.  What, exactly, is wrong with full preemption?  Full preemption has two sources of overhead, I think.  One is a bit of bookkeeping.  The other is the overhead inherent in actually rescheduling -- context switch cost, losing things from cache, etc.
> 
> The bookkeeping part should have quite low overhead.  The scheduling part sounds like it might just need some scheduler tuning if it's really a problem.
> 
> In any case, for backend IO, full preemption sounds like it should be a win, not a loss.  If I'm asking dom0 to do backend IO for me, I don't want it delayed because dom0 was busy doing something else boring.  IO is faster when the latency between requesting it and actually submitting it to hardware is lower.

Maybe. I was assuming that full preemption would result in more context
switches, especially in case many guests are hammering dom0 with I/Os.
This means that more time is spent with switching instead of doing real
work, resulting in dom0 being at 100% cpu faster with doing less work.

IMHO the reason is similar to the reason why servers tend to be run
without preemption (higher throughput at the expense of higher latency).
Full preemption is preferred for systems being used interactively, like
workstations and laptops, as here latency does matter, as long as the
system isn't limited by cpu power most of the time.

I'm pretty sure Xen installations like in QubesOS will prefer to run the
guests fully preemptive for that very reason.

> Can anyone actually demonstrate full preemption being a loss on a real Xen PV workload?

Should be doable, but I think above reasoning is pointing into the right
direction already.


Juergen
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Andy Lutomirski 10 months, 1 week ago
On Thu, Jun 22, 2023, at 10:20 AM, Juergen Gross wrote:
> On 22.06.23 18:39, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2023, at 3:33 AM, Juergen Gross wrote:
>>> On 22.06.23 10:26, Peter Zijlstra wrote:
>>>> On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:
>>>>
>>>>> The hypercalls we are talking of are synchronous ones. They are running
>>>>> in the context of the vcpu doing the call (like a syscall from userland is
>>>>> running in the process context).
>>>>
>>>> (so time actually passes from the guest's pov?)
>>>
>>> Correct.
>>>
>>>>
>>>>> The hypervisor will return to guest context from time to time by modifying
>>>>> the registers such that the guest will do the hypercall again with different
>>>>> input values for the hypervisor, resulting in a proper continuation of the
>>>>> hypercall processing.
>>>>
>>>> Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
>>>> like at all?
>>>
>>> In the long run this should be possible, but not for already existing Xen
>>> versions.
>>>
>>>>
>>>> That is, why isn't this whole thing written like:
>>>>
>>>> 	for (;;) {
>>>> 		ret = hypercall(foo);
>>>> 		if (ret == -EAGAIN) {
>>>> 			cond_resched();
>>>> 			continue;
>>>> 		}
>>>> 		break;
>>>> 	}
>>>
>>> The hypervisor doesn't return -EAGAIN for hysterical reasons.
>>>
>>> This would be one of the options to change the interface. OTOH there are cases
>>> where already existing hypercalls need to be modified in the hypervisor to do
>>> preemption in the middle due to e.g. security reasons (avoiding cpu hogging in
>>> special cases).
>>>
>>> Additionally some of the hypercalls being subject to preemption are allowed in
>>> unprivileged guests, too. Those are mostly hypercalls allowed for PV guests
>>> only, but some are usable by all guests.
>>>
>>>>
>>>>> It is an awful interface and I agree that switching to full preemption in
>>>>> dom0 seems to be the route which we should try to take.
>>>>
>>>> Well, I would very strongly suggest the route to take is to scrap the
>>>> whole thing and invest in doing something saner so we don't have to jump
>>>> through hoops like this.
>>>>
>>>> This is quite possibly the worst possible interface for this Xen could
>>>> have come up with -- awards material for sure.
>>>
>>> Yes.
>>>
>>>>
>>>>> The downside would be that some workloads might see worse performance
>>>>> due to backend I/O handling might get preempted.
>>>>
>>>> Is that an actual concern? Mark this a legaxy inteface and anybody who
>>>> wants to get away from it updates.
>>>
>>> It isn't that easy. See above.
>>>
>>>>
>>>>> Just thinking - can full preemption be enabled per process?
>>>>
>>>> Nope, that's a system wide thing. Preemption is something that's driven
>>>> by the requirements of the tasks that preempt, not something by the
>>>> tasks that get preempted.
>>>
>>> Depends. If a task in a non-preempt system could switch itself to be
>>> preemptable, we could do so around hypercalls without compromising the
>>> general preemption setting. Disabling preemption in a preemptable system
>>> should continue to be possible for short code paths only, of course.
>>>
>>>> Andy's idea of having that thing intercepted as an exception (EXTABLE
>>>> like) and relocating the IP to a place that does cond_resched() before
>>>> going back is an option.. gross, but possibly better, dunno.
>>>>
>>>> Quite the mess indeed :/
>>>
>>> Yeah.
>> 
>> Having one implementation of interrupt handlers that schedule when they interrupt kernel code (the normal full preempt path) is one thing.  Having two of them (full preempt and super-special-Xen) is IMO quite a bit worse.  Especially since no one tests the latter very well.
>> 
>> Having a horrible Xen-specific extable-like thingy seems honestly rather less bad.  It could even have a little self-contained test that runs at boot, I bet.
>> 
>> But I'll bite on the performance impact issue.  What, exactly, is wrong with full preemption?  Full preemption has two sources of overhead, I think.  One is a bit of bookkeeping.  The other is the overhead inherent in actually rescheduling -- context switch cost, losing things from cache, etc.
>> 
>> The bookkeeping part should have quite low overhead.  The scheduling part sounds like it might just need some scheduler tuning if it's really a problem.
>> 
>> In any case, for backend IO, full preemption sounds like it should be a win, not a loss.  If I'm asking dom0 to do backend IO for me, I don't want it delayed because dom0 was busy doing something else boring.  IO is faster when the latency between requesting it and actually submitting it to hardware is lower.
>
> Maybe. I was assuming that full preemption would result in more context
> switches, especially in case many guests are hammering dom0 with I/Os.
> This means that more time is spent with switching instead of doing real
> work, resulting in dom0 being at 100% cpu faster with doing less work.

It ought to just result in context switches happening a bit earlier when the scheduler decides it wants one.  When a non-fully-preemptible kernel gets an interrupt and need_resched gets set, it will still schedule as soon as it hits a cond_resched() or a return to usermode or anything else that explicitly allows scheduling.

If you're hammering dom0 with IO and it's getting swamped by context switches, the problem is the code handling the IO (too many threads or something), not the preemption.
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Peter Zijlstra 10 months, 1 week ago
On Thu, Jun 22, 2023 at 12:33:31PM +0200, Juergen Gross wrote:
> On 22.06.23 10:26, Peter Zijlstra wrote:

> > > The downside would be that some workloads might see worse performance
> > > due to backend I/O handling might get preempted.
> > 
> > Is that an actual concern? Mark this a legaxy inteface and anybody who
> > wants to get away from it updates.
> 
> It isn't that easy. See above.

Well, the old stuff gets to use full preemption on Dom0, then the new
stuff gets more shiny options.

> > > Just thinking - can full preemption be enabled per process?
> > 
> > Nope, that's a system wide thing. Preemption is something that's driven
> > by the requirements of the tasks that preempt, not something by the
> > tasks that get preempted.
> 
> Depends. If a task in a non-preempt system could switch itself to be
> preemptable, we could do so around hypercalls without compromising the
> general preemption setting. Disabling preemption in a preemptable system
> should continue to be possible for short code paths only, of course.

So something along those lines was suggested elsewhere, and I'm still
not entirely sure how I feel about it, but look here:

  https://lkml.kernel.org/r/20230403052233.1880567-1-ankur.a.arora@oracle.com

Specifically patches 7 and 8. It is very close so that you currently
do/want. Those patches are many moons old and i've not seen an update on
them, so I've no idea where they are.

It solves a similar problem except it is 'rep string' instructions
that's being interrupted.
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Juergen Gross 10 months, 1 week ago
On 22.06.23 13:15, Peter Zijlstra wrote:
> On Thu, Jun 22, 2023 at 12:33:31PM +0200, Juergen Gross wrote:
>> On 22.06.23 10:26, Peter Zijlstra wrote:
> 
>>>> The downside would be that some workloads might see worse performance
>>>> due to backend I/O handling might get preempted.
>>>
>>> Is that an actual concern? Mark this a legaxy inteface and anybody who
>>> wants to get away from it updates.
>>
>> It isn't that easy. See above.
> 
> Well, the old stuff gets to use full preemption on Dom0, then the new
> stuff gets more shiny options.

Yeah, but what about the hypercalls from non-dom0 systems needing the same
handling? This would require to run all guests which are using hypercalls
fully preemptive.

> 
>>>> Just thinking - can full preemption be enabled per process?
>>>
>>> Nope, that's a system wide thing. Preemption is something that's driven
>>> by the requirements of the tasks that preempt, not something by the
>>> tasks that get preempted.
>>
>> Depends. If a task in a non-preempt system could switch itself to be
>> preemptable, we could do so around hypercalls without compromising the
>> general preemption setting. Disabling preemption in a preemptable system
>> should continue to be possible for short code paths only, of course.
> 
> So something along those lines was suggested elsewhere, and I'm still
> not entirely sure how I feel about it, but look here:
> 
>    https://lkml.kernel.org/r/20230403052233.1880567-1-ankur.a.arora@oracle.com
> 
> Specifically patches 7 and 8. It is very close so that you currently
> do/want. Those patches are many moons old and i've not seen an update on
> them, so I've no idea where they are.
> 
> It solves a similar problem except it is 'rep string' instructions
> that's being interrupted.

Right. I'll ping Ankur.


Juergen
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Andy Lutomirski 10 months, 1 week ago
On Wed, Jun 21, 2023, at 8:14 AM, Per Bilse wrote:
> Some Xen hypercalls issued by dom0 guests may run for many 10s of
> seconds, potentially causing watchdog timeouts and other problems.
> It's rare for this to happen, but it does in extreme circumstances,
> for instance when shutting down VMs with very large memory allocations
> (> 0.5 - 1TB).  These hypercalls are preemptible, but the fixes in the
> kernel to ensure preemption have fallen into a state of disrepair, and
> are currently ineffective.  This patch brings things up to date by way of:
>
> 1) Update general feature selection from XEN_PV to XEN_DOM0.
> The issue is unique to dom0 Xen guests, but isn't unique to PV dom0s,
> and will occur in future PVH dom0s.  XEN_DOM0 depends on either PV or PVH,
> as well as the appropriate details for dom0.
>
> 2) Update specific feature selection from !PREEMPTION to !PREEMPT.
> The following table shows the relationship between different preemption
> features and their indicators/selectors (Y = "=Y", N = "is not set",
> . = absent):
>
>                             | np-s | np-d | vp-s | vp-d | fp-s | fp-d
>     CONFIG_PREEMPT_DYNAMIC      N      Y      N      Y      N      Y
>          CONFIG_PREEMPTION      .      Y      .      Y      Y      Y
>             CONFIG_PREEMPT      N      N      N      N      Y      Y
>   CONFIG_PREEMPT_VOLUNTARY      N      N      Y      Y      N      N
>        CONFIG_PREEMPT_NONE      Y      Y      N      N      N      N
>
> Unless PREEMPT is set, we need to enable the fixes.

This code is a horrible mess, with and without your patches.  I think that, if this were new, there's no way it would make it in to the kernel.

I propose one of two rather radical changes:

1. (preferred) Just delete all of it and make support for dom0 require either full or dynamic preempt, and make a dynamic preempt kernel booting as dom0 run as full preempt.

2. Forget about trying to preempt a hypercall in the sense of scheduling from an interrupt.  Instead teach the interrupt code to detect that it's in a preemptible hypercall and change RIP to a landing pad that does a cond_resched() and then resumes the hypercall.

I don't think the entry code should have a whole special preempt implementation just for this nasty special case.

--Andy
Re: [PATCH] Updates to Xen hypercall preemption
Posted by Per Bilse 10 months, 1 week ago
On 6/21/2023 5:27 PM, Andy Lutomirski wrote:
> This code is a horrible mess, with and without your patches.  I think that, if this were new, there's no way it would make it in to the kernel.

Hi Andy, and many thanks for your frank assessments.  Generally, this
is indeed somewhat old code, first introduced in 2015 by way of commit
fdfd811ddde3.  There's more information in the notes to that, and it's
maybe worth noting that we're not trying to introduce anything new,
merely fix what various commits since then have broken.

> I propose one of two rather radical changes:
> 
> 1. (preferred) Just delete all of it and make support for dom0 require either full or dynamic preempt, and make a dynamic preempt kernel booting as dom0 run as full preempt.

Personally I think that's a good idea; a machine so limited in resources
that a fully preemptible dom0 kernel would be a problem wouldn't work as
a Xen server anyway.  Having said that, what to do about this isn't
really in my hands; the issues came to light because the kernel for
Citrix's XenServer product is being upgraded, and it was considered in
everybody's interest to upstream the fixes.  I'll see what I can do.

Best,

   -- Per

Re: [PATCH] Updates to Xen hypercall preemption
Posted by Andy Lutomirski 10 months, 1 week ago

On Wed, Jun 21, 2023, at 12:05 PM, Per Bilse wrote:
> On 6/21/2023 5:27 PM, Andy Lutomirski wrote:
>> This code is a horrible mess, with and without your patches.  I think that, if this were new, there's no way it would make it in to the kernel.
>
> Hi Andy, and many thanks for your frank assessments.  Generally, this
> is indeed somewhat old code, first introduced in 2015 by way of commit
> fdfd811ddde3.  There's more information in the notes to that, and it's
> maybe worth noting that we're not trying to introduce anything new,
> merely fix what various commits since then have broken.
>
>> I propose one of two rather radical changes:
>> 
>> 1. (preferred) Just delete all of it and make support for dom0 require either full or dynamic preempt, and make a dynamic preempt kernel booting as dom0 run as full preempt.
>
> Personally I think that's a good idea; a machine so limited in resources
> that a fully preemptible dom0 kernel would be a problem wouldn't work as
> a Xen server anyway.  Having said that, what to do about this isn't
> really in my hands; the issues came to light because the kernel for
> Citrix's XenServer product is being upgraded, and it was considered in
> everybody's interest to upstream the fixes.  I'll see what I can do.

This isn’t actually a resource thing. It’s a distro thing.

Historically, full preempt was a config option only, and distros, for whatever reason, often shipped kernels with full preempt disabled.  (There we probably decent reasons. There may still be decent reasons.). And Xen needed to work on these kernels.  Hence the mess.

But Linux recently gained the ability to build a kernel that, by default, is not full preempt but can be switched *at boot or runtime* to full preempt. And distros should ship *that* (or a kernel that does have full preempt by default).

So let’s just make Xen PV dom0 depend on this. It wasn’t an option a couple years ago. Now it is.

>
> Best,
>
>    -- Per