[PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads

Seth Forshee (DigitalOcean) posted 2 patches 2 years, 7 months ago
drivers/vhost/vhost.c         |  4 ++++
include/linux/livepatch.h     |  2 ++
kernel/livepatch/transition.c | 11 +++++++++++
3 files changed, 17 insertions(+)
[PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Seth Forshee (DigitalOcean) 2 years, 7 months ago
We've fairly regularaly seen liveptches which cannot transition within kpatch's
timeout period due to busy vhost worker kthreads. In looking for a solution the
only answer I found was to call klp_update_patch_state() from a safe location.
I tried adding this call to vhost_worker(), and it works, but this creates the
potential for problems if a livepatch attempted to patch vhost_worker().
Without a call to klp_update_patch_state() fully loaded vhost kthreads can
never switch because vhost_worker() will always appear on the stack, but with
the call these kthreads can switch but will still be running the old version of
vhost_worker().

To avoid this situation I've added a new function, klp_switch_current(), which
switches the current task only if its stack does not include any function being
patched. This allows kthreads to safely attempt switching themselves if a patch
is pending. There is at least one downside, however. Since there's no way for
the kthread to track whether it has already tried to switch for a pending patch
it can end up calling klp_switch_current() repeatedly when it can never be
safely switched.

I don't know whether this is the right solution, and I'm happy to try out other
suggestions. But in my testing these patches proved effective in consistently
switching heavily loaded vhost kthreads almost immediately.

To: Josh Poimboeuf <jpoimboe@kernel.org>
To: Jiri Kosina <jikos@kernel.org>
To: Miroslav Benes <mbenes@suse.cz>
To: Petr Mladek <pmladek@suse.com>
To: Joe Lawrence <joe.lawrence@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: live-patching@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>

---
Seth Forshee (DigitalOcean) (2):
      livepatch: add an interface for safely switching kthreads
      vhost: check for pending livepatches from vhost worker kthreads

 drivers/vhost/vhost.c         |  4 ++++
 include/linux/livepatch.h     |  2 ++
 kernel/livepatch/transition.c | 11 +++++++++++
 3 files changed, 17 insertions(+)
---
base-commit: 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
change-id: 20230120-vhost-klp-switching-ba9a3ae38b8a

Best regards,
-- 
Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Petr Mladek 2 years, 7 months ago
On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> We've fairly regularaly seen liveptches which cannot transition within kpatch's
> timeout period due to busy vhost worker kthreads.

I have missed this detail. Miroslav told me that we have solved
something similar some time ago, see
https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/

Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
that the transition is tried only once per minute. It means that there
are "only" 60 attempts.

Just by chance, does it help you to increase the timeout, please?

This low timeout might be useful for testing. But in practice, it does
not matter when the transition is lasting one hour or even longer.
It takes much longer time to prepare the livepatch.

Best Regards,
Petr
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Seth Forshee (DigitalOcean) 2 years, 7 months ago
On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > timeout period due to busy vhost worker kthreads.
> 
> I have missed this detail. Miroslav told me that we have solved
> something similar some time ago, see
> https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/

Interesting thread. I had thought about something along the lines of the
original patch, but there are some ideas in there that I hadn't
considered.

> Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
> that the transition is tried only once per minute. It means that there
> are "only" 60 attempts.
> 
> Just by chance, does it help you to increase the timeout, please?

To be honest my test setup reproduces the problem well enough to make
KLP wait significant time due to vhost threads, but it seldom causes it
to hit kpatch's timeout.

Our system management software will try to load a patch tens of times in
a day, and we've seen real-world cases where patches couldn't load
within kpatch's timeout for multiple days. But I don't have such an
environment readily accessible for my own testing. I can try to refine
my test case and see if I can get it to that point.

> This low timeout might be useful for testing. But in practice, it does
> not matter when the transition is lasting one hour or even longer.
> It takes much longer time to prepare the livepatch.

Agreed. And to be clear, we cope with the fact that patches may take
hours or even days to get applied in some cases. The patches I sent are
just about improving the only case I've identified which has lead to
kpatch failing to load a patch for a day or longer.

Thanks,
Seth
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Petr Mladek 2 years, 7 months ago
On Thu 2023-01-26 15:12:35, Seth Forshee (DigitalOcean) wrote:
> On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > timeout period due to busy vhost worker kthreads.
> > 
> > I have missed this detail. Miroslav told me that we have solved
> > something similar some time ago, see
> > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> 
> Interesting thread. I had thought about something along the lines of the
> original patch, but there are some ideas in there that I hadn't
> considered.

Could you please provide some more details about the test system?
Is there anything important to make it reproducible?

The following aspects come to my mind. It might require:

   + more workers running on the same system
   + have a dedicated CPU for the worker
   + livepatching the function called by work->fn()
   + running the same work again and again
   + huge and overloaded system


> > Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
> > that the transition is tried only once per minute. It means that there
> > are "only" 60 attempts.
> > 
> > Just by chance, does it help you to increase the timeout, please?
> 
> To be honest my test setup reproduces the problem well enough to make
> KLP wait significant time due to vhost threads, but it seldom causes it
> to hit kpatch's timeout.
> 
> Our system management software will try to load a patch tens of times in
> a day, and we've seen real-world cases where patches couldn't load
> within kpatch's timeout for multiple days. But I don't have such an
> environment readily accessible for my own testing. I can try to refine
> my test case and see if I can get it to that point.

My understanding is that you try to load the patch repeatedly but
it always fails after the 1 minute timeout. It means that it always
starts from the beginning (no livepatched process).

Is there any chance to try it with a longer timeout, for example, one
hour? It should increase the chance if there are more problematic kthreads.

> > This low timeout might be useful for testing. But in practice, it does
> > not matter when the transition is lasting one hour or even longer.
> > It takes much longer time to prepare the livepatch.
> 
> Agreed. And to be clear, we cope with the fact that patches may take
> hours or even days to get applied in some cases. The patches I sent are
> just about improving the only case I've identified which has lead to
> kpatch failing to load a patch for a day or longer.

If it is acceptable to wait hours or even days then the 1 minute
timeout is quite contra-productive. We actually do not use any timeout
at all in livepatches provided by SUSE.

Best Regards,
Petr
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Seth Forshee 2 years, 7 months ago
On Fri, Jan 27, 2023 at 12:19:03PM +0100, Petr Mladek wrote:
> On Thu 2023-01-26 15:12:35, Seth Forshee (DigitalOcean) wrote:
> > On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > > timeout period due to busy vhost worker kthreads.
> > > 
> > > I have missed this detail. Miroslav told me that we have solved
> > > something similar some time ago, see
> > > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> > 
> > Interesting thread. I had thought about something along the lines of the
> > original patch, but there are some ideas in there that I hadn't
> > considered.
> 
> Could you please provide some more details about the test system?
> Is there anything important to make it reproducible?
> 
> The following aspects come to my mind. It might require:
> 
>    + more workers running on the same system
>    + have a dedicated CPU for the worker
>    + livepatching the function called by work->fn()
>    + running the same work again and again
>    + huge and overloaded system

I'm isolating a CPU, starting a KVM guest with a virtio-net device, and
setting the affinity of the vhost worker thread to only the isolated
CPU. Thus the vhost-worker thread has a dedicated CPU, as you say. (I'll
note that in real-world cases the systems have many CPUs, and while the
vhost threads aren't each given a dedicated CPU, if the system load is
light enough a thread can end up with exlusive use of a CPU).

Then all I do is run iperf between the guest and the host with several
parallel streams. I seem to be hitting the limits of the guest vCPUs
before the vhost thread is fully saturated, as this gets it to about 90%
CPU utilization by the vhost thread.

> > > Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
> > > that the transition is tried only once per minute. It means that there
> > > are "only" 60 attempts.
> > > 
> > > Just by chance, does it help you to increase the timeout, please?
> > 
> > To be honest my test setup reproduces the problem well enough to make
> > KLP wait significant time due to vhost threads, but it seldom causes it
> > to hit kpatch's timeout.
> > 
> > Our system management software will try to load a patch tens of times in
> > a day, and we've seen real-world cases where patches couldn't load
> > within kpatch's timeout for multiple days. But I don't have such an
> > environment readily accessible for my own testing. I can try to refine
> > my test case and see if I can get it to that point.
> 
> My understanding is that you try to load the patch repeatedly but
> it always fails after the 1 minute timeout. It means that it always
> starts from the beginning (no livepatched process).
> 
> Is there any chance to try it with a longer timeout, for example, one
> hour? It should increase the chance if there are more problematic kthreads.

Yes, I can try it. But I think I already mentioned that we are somewhat
limited by our system management software and how livepatch loading is
currently implemented there. I'd need to consult with others about how
long we could make the timeout, but 1 hour is definitely too long under
our current system.

> > > This low timeout might be useful for testing. But in practice, it does
> > > not matter when the transition is lasting one hour or even longer.
> > > It takes much longer time to prepare the livepatch.
> > 
> > Agreed. And to be clear, we cope with the fact that patches may take
> > hours or even days to get applied in some cases. The patches I sent are
> > just about improving the only case I've identified which has lead to
> > kpatch failing to load a patch for a day or longer.
> 
> If it is acceptable to wait hours or even days then the 1 minute
> timeout is quite contra-productive. We actually do not use any timeout
> at all in livepatches provided by SUSE.

I agree, though I'd still prefer it didn't take days. Based on this
discussion I do plan to look at changing how we load livepatches to make
this possible, but it will take some time.

Thanks,
Seth
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Petr Mladek 2 years, 7 months ago
On Fri 2023-01-27 08:57:40, Seth Forshee wrote:
> On Fri, Jan 27, 2023 at 12:19:03PM +0100, Petr Mladek wrote:
> > Could you please provide some more details about the test system?
> > Is there anything important to make it reproducible?
> > 
> > The following aspects come to my mind. It might require:
> > 
> >    + more workers running on the same system
> >    + have a dedicated CPU for the worker
> >    + livepatching the function called by work->fn()
> >    + running the same work again and again
> >    + huge and overloaded system
> 
> I'm isolating a CPU, starting a KVM guest with a virtio-net device, and
> setting the affinity of the vhost worker thread to only the isolated
> CPU. Thus the vhost-worker thread has a dedicated CPU, as you say. (I'll
> note that in real-world cases the systems have many CPUs, and while the
> vhost threads aren't each given a dedicated CPU, if the system load is
> light enough a thread can end up with exlusive use of a CPU).
> 
> Then all I do is run iperf between the guest and the host with several
> parallel streams. I seem to be hitting the limits of the guest vCPUs
> before the vhost thread is fully saturated, as this gets it to about 90%
> CPU utilization by the vhost thread.

Thanks for the info!

> > > > Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
> > > > that the transition is tried only once per minute. It means that there
> > > > are "only" 60 attempts.
> > > > 
> > > > Just by chance, does it help you to increase the timeout, please?
> > > 
> > > To be honest my test setup reproduces the problem well enough to make
> > > KLP wait significant time due to vhost threads, but it seldom causes it
> > > to hit kpatch's timeout.
> > > 
> > > Our system management software will try to load a patch tens of times in
> > > a day, and we've seen real-world cases where patches couldn't load
> > > within kpatch's timeout for multiple days. But I don't have such an
> > > environment readily accessible for my own testing. I can try to refine
> > > my test case and see if I can get it to that point.
> > 
> > My understanding is that you try to load the patch repeatedly but
> > it always fails after the 1 minute timeout. It means that it always
> > starts from the beginning (no livepatched process).
> > 
> > Is there any chance to try it with a longer timeout, for example, one
> > hour? It should increase the chance if there are more problematic kthreads.
> 
> Yes, I can try it. But I think I already mentioned that we are somewhat
> limited by our system management software and how livepatch loading is
> currently implemented there. I'd need to consult with others about how
> long we could make the timeout, but 1 hour is definitely too long under
> our current system.

Another possibility is to do not wait at all. SUSE livepatch packages load
the livepatch module, remove not longer used livepatch modules and are
done with it.

Note that the module is loaded quickly. The transition is finished
asynchronously using workqueues.

Of course, there is a risk that the transition will never finish.
It would prevent loading any newer livepatch. But it might be handled
when the the newer livepatch is loaded. It might revert the pending
transition, ...

Of course, it would be great to make the transition more reliable.
It would be nice to add the hook into the scheduler as discussed
in another branch of this thread. But it might bring another problems,
for example, affect the system performance. Well, it probably can
be optimized or ratelimited.

Anyway, I wanted to say that there is a way to get rid of the timeout
completely.

Best Regards,
Petr
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Thu, Jan 26, 2023 at 03:12:35PM -0600, Seth Forshee (DigitalOcean) wrote:
> On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > timeout period due to busy vhost worker kthreads.
> > 
> > I have missed this detail. Miroslav told me that we have solved
> > something similar some time ago, see
> > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> 
> Interesting thread. I had thought about something along the lines of the
> original patch, but there are some ideas in there that I hadn't
> considered.

Here's another idea, have we considered this?  Have livepatch set
TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
set.

Not sure how scheduler folks would feel about that ;-)

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Peter Zijlstra 2 years, 7 months ago
On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 26, 2023 at 03:12:35PM -0600, Seth Forshee (DigitalOcean) wrote:
> > On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > > timeout period due to busy vhost worker kthreads.
> > > 
> > > I have missed this detail. Miroslav told me that we have solved
> > > something similar some time ago, see
> > > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> > 
> > Interesting thread. I had thought about something along the lines of the
> > original patch, but there are some ideas in there that I hadn't
> > considered.
> 
> Here's another idea, have we considered this?  Have livepatch set
> TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> set.
> 
> Not sure how scheduler folks would feel about that ;-)

So, let me try and page all that back in.... :-)

KLP needs to unwind the stack to see if any of the patched functions are
active, if not, flip task to new set.

Unwinding the stack of a task can be done when:

 - task is inactive (stable reg and stack) -- provided it stays inactive
   while unwinding etc..

 - task is current (guarantees stack doesn't dip below where we started
   due to being busy on top etc..)

Can NOT be done from interrupt context, because can hit in the middle of
setting up stack frames etc..

The issue at hand is that some tasks run for a long time without passing
through an explicit check.

The thread above tried sticking something in cond_resched() which is a
problem for PREEMPT=y since cond_resched() is a no-op.

Preempt notifiers were raised, and those would actually be nice, except
you can only install a notifier on current and you need some memory
allocated per task, which makes it less than ideal. Plus ...

... putting something in finish_task_switch() wouldn't be the end of the
world I suppose, but then you still need to force schedule the task --
imagine it being the only runnable task on the CPU, there's nothing
going to make it actually switch.

Which then leads me to suggest something daft like this.. does that
help?


diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..06746095a724 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
+#include <linux/stop_machine.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -334,6 +335,16 @@ static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
+static int __stop_try_switch(void *arg)
+{
+	return klp_try_switch_task(arg) ? 0 : -EBUSY;
+}
+
+static bool klp_try_switch_task_harder(struct task_struct *task)
+{
+	return !stop_one_cpu(task_cpu(task), __stop_try_switch, task);
+}
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Seth Forshee 2 years, 7 months ago
On Fri, Jan 27, 2023 at 11:37:02AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> > On Thu, Jan 26, 2023 at 03:12:35PM -0600, Seth Forshee (DigitalOcean) wrote:
> > > On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > > > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > > > timeout period due to busy vhost worker kthreads.
> > > > 
> > > > I have missed this detail. Miroslav told me that we have solved
> > > > something similar some time ago, see
> > > > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> > > 
> > > Interesting thread. I had thought about something along the lines of the
> > > original patch, but there are some ideas in there that I hadn't
> > > considered.
> > 
> > Here's another idea, have we considered this?  Have livepatch set
> > TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> > have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> > set.
> > 
> > Not sure how scheduler folks would feel about that ;-)
> 
> So, let me try and page all that back in.... :-)
> 
> KLP needs to unwind the stack to see if any of the patched functions are
> active, if not, flip task to new set.
> 
> Unwinding the stack of a task can be done when:
> 
>  - task is inactive (stable reg and stack) -- provided it stays inactive
>    while unwinding etc..
> 
>  - task is current (guarantees stack doesn't dip below where we started
>    due to being busy on top etc..)
> 
> Can NOT be done from interrupt context, because can hit in the middle of
> setting up stack frames etc..
> 
> The issue at hand is that some tasks run for a long time without passing
> through an explicit check.
> 
> The thread above tried sticking something in cond_resched() which is a
> problem for PREEMPT=y since cond_resched() is a no-op.
> 
> Preempt notifiers were raised, and those would actually be nice, except
> you can only install a notifier on current and you need some memory
> allocated per task, which makes it less than ideal. Plus ...
> 
> ... putting something in finish_task_switch() wouldn't be the end of the
> world I suppose, but then you still need to force schedule the task --
> imagine it being the only runnable task on the CPU, there's nothing
> going to make it actually switch.
> 
> Which then leads me to suggest something daft like this.. does that
> help?

The changes below are working well for me. The context has a read lock
on tasklist_lock so it can't sleep, so I'm using stop_one_cpu_nowait()
with per-CPU state.

Based on Josh's comments it sounds like the klp_have_reliable_stack()
check probably isn't quite right, and we might want to add something
else for PREEMPT+!ORC. But I wanted to go ahead and see if this approach
seems reasonable to everyone.

Thanks,
Seth


diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..9f3898f02828 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
+#include <linux/stop_machine.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -334,9 +335,16 @@ static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
+static int __try_switch_kthread(void *arg)
+{
+	return klp_try_switch_task(arg) ? 0 : -EBUSY;
+}
+
+static DEFINE_PER_CPU(struct cpu_stop_work, klp_stop_work);
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
- * Kthreads with TIF_PATCH_PENDING set are woken up.
+ * Kthreads with TIF_PATCH_PENDING set are preempted or woken up.
  */
 static void klp_send_signals(void)
 {
@@ -357,11 +365,22 @@ static void klp_send_signals(void)
 		 * would be meaningless. It is not serious though.
 		 */
 		if (task->flags & PF_KTHREAD) {
-			/*
-			 * Wake up a kthread which sleeps interruptedly and
-			 * still has not been migrated.
-			 */
-			wake_up_state(task, TASK_INTERRUPTIBLE);
+			if (task_curr(task) && klp_have_reliable_stack()) {
+				/*
+				 * kthread is currently running on a CPU; try
+				 * to preempt it.
+				 */
+				stop_one_cpu_nowait(task_cpu(task),
+						    __try_switch_kthread,
+						    task,
+						    this_cpu_ptr(&klp_stop_work));
+			} else {
+				/*
+				 * Wake up a kthread which sleeps interruptedly
+				 * and still has not been migrated.
+				 */
+				wake_up_state(task, TASK_INTERRUPTIBLE);
+			}
 		} else {
 			/*
 			 * Send fake signal to all non-kthread tasks which are
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Fri, Jan 27, 2023 at 11:37:02AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> > Here's another idea, have we considered this?  Have livepatch set
> > TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> > have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> > set.
> > 
> > Not sure how scheduler folks would feel about that ;-)

Hmmmm, with preemption I guess the above doesn't work for kthreads
calling cond_resched() instead of what vhost_worker() does (explicit
need_resched/schedule).

> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index f1b25ec581e0..06746095a724 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -9,6 +9,7 @@
>  
>  #include <linux/cpu.h>
>  #include <linux/stacktrace.h>
> +#include <linux/stop_machine.h>
>  #include "core.h"
>  #include "patch.h"
>  #include "transition.h"
> @@ -334,6 +335,16 @@ static bool klp_try_switch_task(struct task_struct *task)
>  	return !ret;
>  }
>  
> +static int __stop_try_switch(void *arg)
> +{
> +	return klp_try_switch_task(arg) ? 0 : -EBUSY;
> +}
> +
> +static bool klp_try_switch_task_harder(struct task_struct *task)
> +{
> +	return !stop_one_cpu(task_cpu(task), __stop_try_switch, task);
> +}
> +
>  /*
>   * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
>   * Kthreads with TIF_PATCH_PENDING set are woken up.

Doesn't work for PREEMPT+!ORC.  Non-ORC reliable unwinders will detect
preemption on the stack and automatically report unreliable.

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Fri, Jan 27, 2023 at 08:52:38AM -0800, Josh Poimboeuf wrote:
> On Fri, Jan 27, 2023 at 11:37:02AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> > > Here's another idea, have we considered this?  Have livepatch set
> > > TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> > > have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> > > set.
> > > 
> > > Not sure how scheduler folks would feel about that ;-)
> 
> Hmmmm, with preemption I guess the above doesn't work for kthreads
> calling cond_resched() instead of what vhost_worker() does (explicit
> need_resched/schedule).

Though I guess we could hook into cond_resched() too if we make it a
non-NOP for PREEMPT+LIVEPATCH?

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Fri, Jan 27, 2023 at 09:09:48AM -0800, Josh Poimboeuf wrote:
> On Fri, Jan 27, 2023 at 08:52:38AM -0800, Josh Poimboeuf wrote:
> > On Fri, Jan 27, 2023 at 11:37:02AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> > > > Here's another idea, have we considered this?  Have livepatch set
> > > > TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> > > > have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> > > > set.
> > > > 
> > > > Not sure how scheduler folks would feel about that ;-)
> > 
> > Hmmmm, with preemption I guess the above doesn't work for kthreads
> > calling cond_resched() instead of what vhost_worker() does (explicit
> > need_resched/schedule).
> 
> Though I guess we could hook into cond_resched() too if we make it a
> non-NOP for PREEMPT+LIVEPATCH?

I discussed this idea with Peter on IRC and he didn't immediately shoot
it down.

It compiles...

Thoughts?

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 293e29960c6e..937816d0867c 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -14,6 +14,8 @@
 #include <linux/completion.h>
 #include <linux/list.h>
 
+#include <linux/livepatch_sched.h>
+
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
 /* task patch states */
diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
new file mode 100644
index 000000000000..3237bc6a5b01
--- /dev/null
+++ b/include/linux/livepatch_sched.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_LIVEPATCH_SCHED_H_
+#define _LINUX_LIVEPATCH_SCHED_H_
+
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_LIVEPATCH
+
+void __klp_sched_try_switch(void);
+DECLARE_STATIC_CALL(klp_sched_try_switch, __klp_sched_try_switch);
+
+static __always_inline void klp_sched_try_switch(void)
+{
+	//FIXME need static_call_cond_mod() ?
+	static_call_mod(klp_sched_try_switch)();
+}
+
+#else /* !CONFIG_LIVEPATCH */
+static inline void klp_sched_try_switch(void) {}
+#endif /* CONFIG_LIVEPATCH */
+
+#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4df2b3e76b30..fbcd3acca25c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
+#include <linux/livepatch_sched.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -2074,6 +2075,9 @@ DECLARE_STATIC_CALL(cond_resched, __cond_resched);
 
 static __always_inline int _cond_resched(void)
 {
+	//FIXME this is a bit redundant with preemption disabled
+	klp_sched_try_switch();
+
 	return static_call_mod(cond_resched)();
 }
 
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..042e34c9389c 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
+#include <linux/static_call.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -24,6 +25,9 @@ static int klp_target_state = KLP_UNDEFINED;
 
 static unsigned int klp_signals_cnt;
 
+DEFINE_STATIC_CALL_NULL(klp_sched_try_switch, __klp_sched_try_switch);
+EXPORT_STATIC_CALL_TRAMP(klp_sched_try_switch);
+
 /*
  * This work can be performed periodically to finish patching or unpatching any
  * "straggler" tasks which failed to transition in the first attempt.
@@ -76,6 +80,8 @@ static void klp_complete_transition(void)
 		 klp_transition_patch->mod->name,
 		 klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
 
+	static_call_update(klp_sched_try_switch, NULL);
+
 	if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) {
 		klp_unpatch_replaced_patches(klp_transition_patch);
 		klp_discard_nops(klp_transition_patch);
@@ -256,7 +262,8 @@ static int klp_check_stack(struct task_struct *task, const char **oldname)
 		klp_for_each_func(obj, func) {
 			ret = klp_check_stack_func(func, entries, nr_entries);
 			if (ret) {
-				*oldname = func->old_name;
+				if (oldname)
+					*oldname = func->old_name;
 				return -EADDRINUSE;
 			}
 		}
@@ -307,7 +314,11 @@ static bool klp_try_switch_task(struct task_struct *task)
 	 * functions.  If all goes well, switch the task to the target patch
 	 * state.
 	 */
-	ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+	if (task == current)
+		ret = klp_check_and_switch_task(current, &old_name);
+	else
+		ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+
 	switch (ret) {
 	case 0:		/* success */
 		break;
@@ -334,6 +345,15 @@ static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
+void __klp_sched_try_switch(void)
+{
+	if (likely(!klp_patch_pending(current)))
+		return;
+
+	//FIXME locking
+	klp_try_switch_task(current);
+}
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -401,8 +421,10 @@ void klp_try_complete_transition(void)
 	 */
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, task)
-		if (!klp_try_switch_task(task))
+		if (!klp_try_switch_task(task)) {
+			set_tsk_need_resched(task);
 			complete = false;
+		}
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -413,6 +435,7 @@ void klp_try_complete_transition(void)
 		task = idle_task(cpu);
 		if (cpu_online(cpu)) {
 			if (!klp_try_switch_task(task)) {
+				set_tsk_need_resched(task);
 				complete = false;
 				/* Make idle task go through the main loop. */
 				wake_up_if_idle(cpu);
@@ -492,6 +515,8 @@ void klp_start_transition(void)
 			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	}
 
+	static_call_update(klp_sched_try_switch, __klp_sched_try_switch);
+
 	klp_signals_cnt = 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3a0ef2fefbd5..01e32d242ef6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6506,6 +6506,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	struct rq *rq;
 	int cpu;
 
+	klp_sched_try_switch();
+
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
@@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
 static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
 int __sched dynamic_cond_resched(void)
 {
-	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
+	if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
+		klp_sched_try_switch();
 		return 0;
+	}
 	return __cond_resched();
 }
 EXPORT_SYMBOL(dynamic_cond_resched);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index e9ef66be2870..27ba93930584 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -308,9 +308,6 @@ static void do_idle(void)
 	 */
 	flush_smp_call_function_queue();
 	schedule_idle();
-
-	if (unlikely(klp_patch_pending(current)))
-		klp_update_patch_state(current);
 }
 
 bool cpu_in_idle(unsigned long pc)
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Peter Zijlstra 2 years, 7 months ago
On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:


> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4df2b3e76b30..fbcd3acca25c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,6 +36,7 @@
>  #include <linux/seqlock.h>
>  #include <linux/kcsan.h>
>  #include <linux/rv.h>
> +#include <linux/livepatch_sched.h>
>  #include <asm/kmap_size.h>
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -2074,6 +2075,9 @@ DECLARE_STATIC_CALL(cond_resched, __cond_resched);
>  
>  static __always_inline int _cond_resched(void)
>  {
> +	//FIXME this is a bit redundant with preemption disabled
> +	klp_sched_try_switch();
> +
>  	return static_call_mod(cond_resched)();
>  }

Right, I was thinking you'd do something like:

	static_call_update(cond_resched, klp_cond_resched);

With:

static int klp_cond_resched(void)
{
	klp_try_switch_task(current);
	return __cond_resched();
}

That would force cond_resched() into doing the transition thing,
irrespective of the preemption mode at hand. And then, when KLP be done,
re-run sched_dynamic_update() to reset it to whatever it ought to be.

> @@ -401,8 +421,10 @@ void klp_try_complete_transition(void)
>  	 */
>  	read_lock(&tasklist_lock);
>  	for_each_process_thread(g, task)
> -		if (!klp_try_switch_task(task))
> +		if (!klp_try_switch_task(task)) {
> +			set_tsk_need_resched(task);
>  			complete = false;
> +		}

Yeah, no, that's broken -- preemption state live in more than just the
TIF bit.

>  	read_unlock(&tasklist_lock);
>  
>  	/*
> @@ -413,6 +435,7 @@ void klp_try_complete_transition(void)
>  		task = idle_task(cpu);
>  		if (cpu_online(cpu)) {
>  			if (!klp_try_switch_task(task)) {
> +				set_tsk_need_resched(task);
>  				complete = false;
>  				/* Make idle task go through the main loop. */
>  				wake_up_if_idle(cpu);

Idem.

Also, I don't see the point of this and the __schedule() hook here:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3a0ef2fefbd5..01e32d242ef6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6506,6 +6506,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  	struct rq *rq;
>  	int cpu;
>  
> +	klp_sched_try_switch();
> +
>  	cpu = smp_processor_id();
>  	rq = cpu_rq(cpu);
>  	prev = rq->curr;

If it schedules, you'll get it with the normal switcheroo, because it'll
be inactive some of the time. If it doesn't schedule, it'll run into
cond_resched().

> @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
>  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
>  int __sched dynamic_cond_resched(void)
>  {
> -	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
> +	if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
> +		klp_sched_try_switch();
>  		return 0;
> +	}
>  	return __cond_resched();
>  }
>  EXPORT_SYMBOL(dynamic_cond_resched);

I would make the klp_sched_try_switch() not depend on
sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
pass through __schedule().

But you'll probably want to check with Mark here, this all might
generate crap code on arm64.

Both ways this seems to make KLP 'depend' (or at least work lots better)
when PREEMPT_DYNAMIC=y. Do we want a PREEMPT_DYNAMIC=n fallback for
_cond_resched() too?
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> Right, I was thinking you'd do something like:
> 
> 	static_call_update(cond_resched, klp_cond_resched);
> 
> With:
> 
> static int klp_cond_resched(void)
> {
> 	klp_try_switch_task(current);
> 	return __cond_resched();
> }

Something like this?

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bfd2f1f..424c0c939f57 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -363,8 +363,7 @@ static int vhost_worker(void *data)
 			kcov_remote_start_common(dev->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
-			if (need_resched())
-				schedule();
+			cond_resched();
 		}
 	}
 	kthread_unuse_mm(dev->mm);
diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 293e29960c6e..937816d0867c 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -14,6 +14,8 @@
 #include <linux/completion.h>
 #include <linux/list.h>
 
+#include <linux/livepatch_sched.h>
+
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
 /* task patch states */
diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
new file mode 100644
index 000000000000..3237bc6a5b01
--- /dev/null
+++ b/include/linux/livepatch_sched.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_LIVEPATCH_SCHED_H_
+#define _LINUX_LIVEPATCH_SCHED_H_
+
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_LIVEPATCH
+
+void __klp_sched_try_switch(void);
+DECLARE_STATIC_CALL(klp_sched_try_switch, __klp_sched_try_switch);
+
+static __always_inline void klp_sched_try_switch(void)
+{
+	//FIXME need static_call_cond_mod() ?
+	static_call_mod(klp_sched_try_switch)();
+}
+
+#else /* !CONFIG_LIVEPATCH */
+static inline void klp_sched_try_switch(void) {}
+#endif /* CONFIG_LIVEPATCH */
+
+#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4df2b3e76b30..a7acf9ae9b90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
+#include <linux/livepatch_sched.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -2077,11 +2078,15 @@ static __always_inline int _cond_resched(void)
 	return static_call_mod(cond_resched)();
 }
 
+void sched_dynamic_klp_enable(void);
+void sched_dynamic_klp_disable(void);
+
 #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
 extern int dynamic_cond_resched(void);
 
 static __always_inline int _cond_resched(void)
 {
+	klp_sched_try_switch();
 	return dynamic_cond_resched();
 }
 
@@ -2089,6 +2094,7 @@ static __always_inline int _cond_resched(void)
 
 static inline int _cond_resched(void)
 {
+	klp_sched_try_switch();
 	return __cond_resched();
 }
 
@@ -2096,7 +2102,10 @@ static inline int _cond_resched(void)
 
 #else
 
-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void) {
+	klp_sched_try_switch();
+	return 0;
+}
 
 #endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
 
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..3cc4e0a24dc6 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
+#include <linux/static_call.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -24,6 +25,9 @@ static int klp_target_state = KLP_UNDEFINED;
 
 static unsigned int klp_signals_cnt;
 
+DEFINE_STATIC_CALL_NULL(klp_sched_try_switch, __klp_sched_try_switch);
+EXPORT_STATIC_CALL_TRAMP(klp_sched_try_switch);
+
 /*
  * This work can be performed periodically to finish patching or unpatching any
  * "straggler" tasks which failed to transition in the first attempt.
@@ -61,6 +65,28 @@ static void klp_synchronize_transition(void)
 	schedule_on_each_cpu(klp_sync);
 }
 
+/*
+ * Enable the klp hooks in cond_resched() while livepatching is in progress.
+ * This helps CPU-bound kthreads get patched.
+ */
+static void klp_sched_hook_enable(void)
+{
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+	sched_dynamic_klp_enable();
+#else
+	static_call_update(klp_sched_try_switch, __klp_sched_try_switch);
+#endif
+}
+
+static void klp_sched_hook_disable(void)
+{
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+	sched_dynamic_klp_disable();
+#else
+	static_call_update(klp_sched_try_switch, NULL);
+#endif
+}
+
 /*
  * The transition to the target patch state is complete.  Clean up the data
  * structures.
@@ -76,6 +102,8 @@ static void klp_complete_transition(void)
 		 klp_transition_patch->mod->name,
 		 klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
 
+	klp_sched_hook_disable();
+
 	if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) {
 		klp_unpatch_replaced_patches(klp_transition_patch);
 		klp_discard_nops(klp_transition_patch);
@@ -307,7 +335,11 @@ static bool klp_try_switch_task(struct task_struct *task)
 	 * functions.  If all goes well, switch the task to the target patch
 	 * state.
 	 */
-	ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+	if (task == current)
+		ret = klp_check_and_switch_task(current, &old_name);
+	else
+		ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+
 	switch (ret) {
 	case 0:		/* success */
 		break;
@@ -334,6 +366,15 @@ static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
+void __klp_sched_try_switch(void)
+{
+	if (likely(!klp_patch_pending(current)))
+		return;
+
+	//FIXME locking
+	klp_try_switch_task(current);
+}
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -492,6 +533,8 @@ void klp_start_transition(void)
 			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	}
 
+	klp_sched_hook_enable();
+
 	klp_signals_cnt = 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3a0ef2fefbd5..4fbf70b05576 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8648,13 +8648,16 @@ int sched_dynamic_mode(const char *str)
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
 
+static bool klp_override;
+
 void sched_dynamic_update(int mode)
 {
 	/*
 	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
 	 * the ZERO state, which is invalid.
 	 */
-	preempt_dynamic_enable(cond_resched);
+	if (!klp_override)
+		preempt_dynamic_enable(cond_resched);
 	preempt_dynamic_enable(might_resched);
 	preempt_dynamic_enable(preempt_schedule);
 	preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8662,16 +8665,19 @@ void sched_dynamic_update(int mode)
 
 	switch (mode) {
 	case preempt_dynamic_none:
-		preempt_dynamic_enable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
+		//FIXME avoid printk for klp restore
 		pr_info("Dynamic Preempt: none\n");
 		break;
 
 	case preempt_dynamic_voluntary:
-		preempt_dynamic_enable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_enable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
@@ -8680,7 +8686,8 @@ void sched_dynamic_update(int mode)
 		break;
 
 	case preempt_dynamic_full:
-		preempt_dynamic_disable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_disable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_enable(preempt_schedule);
 		preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8692,6 +8699,28 @@ void sched_dynamic_update(int mode)
 	preempt_dynamic_mode = mode;
 }
 
+#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
+static int klp_cond_resched(void)
+{
+	__klp_sched_try_switch();
+	return __cond_resched();
+}
+
+void sched_dynamic_klp_enable(void)
+{
+	//FIXME locking
+	klp_override = true;
+	static_call_update(cond_resched, klp_cond_resched);
+}
+
+void sched_dynamic_klp_disable(void)
+{
+	//FIXME locking
+	klp_override = false;
+	sched_dynamic_update(preempt_dynamic_mode);
+}
+#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL*/
+
 static int __init setup_preempt_mode(char *str)
 {
 	int mode = sched_dynamic_mode(str);
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Peter Zijlstra 2 years, 7 months ago
On Mon, Jan 30, 2023 at 11:59:30AM -0800, Josh Poimboeuf wrote:

> @@ -8662,16 +8665,19 @@ void sched_dynamic_update(int mode)
>  
>  	switch (mode) {
>  	case preempt_dynamic_none:
> -		preempt_dynamic_enable(cond_resched);
> +		if (!klp_override)
> +			preempt_dynamic_enable(cond_resched);
>  		preempt_dynamic_disable(might_resched);
>  		preempt_dynamic_disable(preempt_schedule);
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
> +		//FIXME avoid printk for klp restore

		if (mode != preempt_dynamic_mode)

>  		pr_info("Dynamic Preempt: none\n");
>  		break;
>  
>  	case preempt_dynamic_voluntary:
> -		preempt_dynamic_enable(cond_resched);
> +		if (!klp_override)
> +			preempt_dynamic_enable(cond_resched);
>  		preempt_dynamic_enable(might_resched);
>  		preempt_dynamic_disable(preempt_schedule);
>  		preempt_dynamic_disable(preempt_schedule_notrace);
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Mark Rutland 2 years, 7 months ago
On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:
> > @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
> >  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
> >  int __sched dynamic_cond_resched(void)
> >  {
> > -	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
> > +	if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
> > +		klp_sched_try_switch();
> >  		return 0;
> > +	}
> >  	return __cond_resched();
> >  }
> >  EXPORT_SYMBOL(dynamic_cond_resched);
> 
> I would make the klp_sched_try_switch() not depend on
> sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
> pass through __schedule().
> 
> But you'll probably want to check with Mark here, this all might
> generate crap code on arm64.

IIUC here klp_sched_try_switch() is a static call, so on arm64 this'll generate
at least a load, a conditional branch, and an indirect branch. That's not
ideal, but I'd have to benchmark it to find out whether it's a significant
overhead relative to the baseline of PREEMPT_DYNAMIC.

For arm64 it'd be a bit nicer to have another static key check, and a call to
__klp_sched_try_switch(). That way the static key check gets turned into a NOP
in the common case, and the call to __klp_sched_try_switch() can be a direct
call (potentially a tail-call if we made it return 0).

Thanks,
Mark.
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Mon, Jan 30, 2023 at 06:36:32PM +0000, Mark Rutland wrote:
> On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:
> > > @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
> > >  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
> > >  int __sched dynamic_cond_resched(void)
> > >  {
> > > -	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
> > > +	if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
> > > +		klp_sched_try_switch();
> > >  		return 0;
> > > +	}
> > >  	return __cond_resched();
> > >  }
> > >  EXPORT_SYMBOL(dynamic_cond_resched);
> > 
> > I would make the klp_sched_try_switch() not depend on
> > sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
> > pass through __schedule().
> > 
> > But you'll probably want to check with Mark here, this all might
> > generate crap code on arm64.
> 
> IIUC here klp_sched_try_switch() is a static call, so on arm64 this'll generate
> at least a load, a conditional branch, and an indirect branch. That's not
> ideal, but I'd have to benchmark it to find out whether it's a significant
> overhead relative to the baseline of PREEMPT_DYNAMIC.
> 
> For arm64 it'd be a bit nicer to have another static key check, and a call to
> __klp_sched_try_switch(). That way the static key check gets turned into a NOP
> in the common case, and the call to __klp_sched_try_switch() can be a direct
> call (potentially a tail-call if we made it return 0).

Hm, it might be nice if our out-of-line static call implementation would
automatically do a static key check as part of static_call_cond() for
NULL-type static calls.

But the best answer is probably to just add inline static calls to
arm64.  Is the lack of objtool the only thing blocking that?

Objtool is now modular, so all the controversial CFG reverse engineering
is now optional, so it shouldn't be too hard to just enable objtool for
static call inlines.

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Mark Rutland 2 years, 7 months ago
On Mon, Jan 30, 2023 at 11:48:23AM -0800, Josh Poimboeuf wrote:
> On Mon, Jan 30, 2023 at 06:36:32PM +0000, Mark Rutland wrote:
> > On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:
> > > > @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
> > > >  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
> > > >  int __sched dynamic_cond_resched(void)
> > > >  {
> > > > -	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
> > > > +	if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
> > > > +		klp_sched_try_switch();
> > > >  		return 0;
> > > > +	}
> > > >  	return __cond_resched();
> > > >  }
> > > >  EXPORT_SYMBOL(dynamic_cond_resched);
> > > 
> > > I would make the klp_sched_try_switch() not depend on
> > > sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
> > > pass through __schedule().
> > > 
> > > But you'll probably want to check with Mark here, this all might
> > > generate crap code on arm64.
> > 
> > IIUC here klp_sched_try_switch() is a static call, so on arm64 this'll generate
> > at least a load, a conditional branch, and an indirect branch. That's not
> > ideal, but I'd have to benchmark it to find out whether it's a significant
> > overhead relative to the baseline of PREEMPT_DYNAMIC.
> > 
> > For arm64 it'd be a bit nicer to have another static key check, and a call to
> > __klp_sched_try_switch(). That way the static key check gets turned into a NOP
> > in the common case, and the call to __klp_sched_try_switch() can be a direct
> > call (potentially a tail-call if we made it return 0).
> 
> Hm, it might be nice if our out-of-line static call implementation would
> automatically do a static key check as part of static_call_cond() for
> NULL-type static calls.
> 
> But the best answer is probably to just add inline static calls to
> arm64.  Is the lack of objtool the only thing blocking that?

The major issues were branch range limitations (and needing the linker to add
PLTs), and painful instruction patching requirements (e.g. the architecture's
"CMODX" rules for Concurrent MODification and eXecution of instructions). We
went with the static key scheme above because that was what our assembled code
generation would devolve to anyway.

If we knew each call-site would only call a particular function or skip the
call, then we could do better (and would probably need something like objtool
to NOP that out at compile time), but since we don't know the callee at build
time we can't ensure we have a PLT in range when necessary.

> Objtool is now modular, so all the controversial CFG reverse engineering
> is now optional, so it shouldn't be too hard to just enable objtool for
> static call inlines.

Funnily enough, I spent some time yesterday looking at enabling a trivial
objtool for arm64 as I wanted some basic ELF rewriting functionality (to
manipulate the mcount_loc table). So I'll likely be looking at that soon
regardless of static calls. :)

Thanks,
Mark.
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Tue, Jan 31, 2023 at 10:22:09AM +0000, Mark Rutland wrote:
> > Hm, it might be nice if our out-of-line static call implementation would
> > automatically do a static key check as part of static_call_cond() for
> > NULL-type static calls.
> > 
> > But the best answer is probably to just add inline static calls to
> > arm64.  Is the lack of objtool the only thing blocking that?
> 
> The major issues were branch range limitations (and needing the linker to add
> PLTs),

Does the compiler do the right thing (e.g., force PLT) if the branch
target is outside the translation unit?  I'm wondering if we could for
example use objtool to help enforce such rules at the call site.

> and painful instruction patching requirements (e.g. the architecture's
> "CMODX" rules for Concurrent MODification and eXecution of instructions). We
> went with the static key scheme above because that was what our assembled code
> generation would devolve to anyway.
> 
> If we knew each call-site would only call a particular function or skip the
> call, then we could do better (and would probably need something like objtool
> to NOP that out at compile time), but since we don't know the callee at build
> time we can't ensure we have a PLT in range when necessary.

Unfortunately most static calls have multiple destinations.  And most
don't have the option of being NULL.

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Mark Rutland 2 years, 7 months ago
On Tue, Jan 31, 2023 at 08:38:32AM -0800, Josh Poimboeuf wrote:
> On Tue, Jan 31, 2023 at 10:22:09AM +0000, Mark Rutland wrote:
> > > Hm, it might be nice if our out-of-line static call implementation would
> > > automatically do a static key check as part of static_call_cond() for
> > > NULL-type static calls.
> > > 
> > > But the best answer is probably to just add inline static calls to
> > > arm64.  Is the lack of objtool the only thing blocking that?
> > 
> > The major issues were branch range limitations (and needing the linker to add
> > PLTs),
> 
> Does the compiler do the right thing (e.g., force PLT) if the branch
> target is outside the translation unit?  I'm wondering if we could for
> example use objtool to help enforce such rules at the call site.

It's the linker (rather than the compiler) that'll generate the PLT if the
caller and callee are out of range at link time. There are a few other issues
too (e.g. no guarnatee that the PLT isn't used by multiple distinct callers,
CMODX patching requirements), so we'd have to generate a pseudo-PLT ourselves
at build time with a patching-friendly code sequence. Ard had a prototype for
that:

  https://lore.kernel.org/linux-arm-kernel/20211105145917.2828911-1-ardb@kernel.org/

... but that was sufficiently painful that we went with the current static key
approach:

  https://lore.kernel.org/all/20211109172408.49641-1-mark.rutland@arm.com/

> > and painful instruction patching requirements (e.g. the architecture's
> > "CMODX" rules for Concurrent MODification and eXecution of instructions). We
> > went with the static key scheme above because that was what our assembled code
> > generation would devolve to anyway.
> > 
> > If we knew each call-site would only call a particular function or skip the
> > call, then we could do better (and would probably need something like objtool
> > to NOP that out at compile time), but since we don't know the callee at build
> > time we can't ensure we have a PLT in range when necessary.
> 
> Unfortunately most static calls have multiple destinations.

Sure, but here we're just enabling/disabling a call, which we could treat
differently, or wrap at a different level within the scheduler code. I'm happy
to take a look at that.

> And most don't have the option of being NULL.

Oh, I was under the impression that all could be disabled/skipped, which is
what a NULL target implied.

Thanks,
Mark.
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Wed, Feb 01, 2023 at 11:10:20AM +0000, Mark Rutland wrote:
> On Tue, Jan 31, 2023 at 08:38:32AM -0800, Josh Poimboeuf wrote:
> > On Tue, Jan 31, 2023 at 10:22:09AM +0000, Mark Rutland wrote:
> > > > Hm, it might be nice if our out-of-line static call implementation would
> > > > automatically do a static key check as part of static_call_cond() for
> > > > NULL-type static calls.
> > > > 
> > > > But the best answer is probably to just add inline static calls to
> > > > arm64.  Is the lack of objtool the only thing blocking that?
> > > 
> > > The major issues were branch range limitations (and needing the linker to add
> > > PLTs),
> > 
> > Does the compiler do the right thing (e.g., force PLT) if the branch
> > target is outside the translation unit?  I'm wondering if we could for
> > example use objtool to help enforce such rules at the call site.
> 
> It's the linker (rather than the compiler) that'll generate the PLT if the
> caller and callee are out of range at link time. There are a few other issues
> too (e.g. no guarnatee that the PLT isn't used by multiple distinct callers,
> CMODX patching requirements), so we'd have to generate a pseudo-PLT ourselves
> at build time with a patching-friendly code sequence. Ard had a prototype for
> that:
> 
>   https://lore.kernel.org/linux-arm-kernel/20211105145917.2828911-1-ardb@kernel.org/
> 
> ... but that was sufficiently painful that we went with the current static key
> approach:
> 
>   https://lore.kernel.org/all/20211109172408.49641-1-mark.rutland@arm.com/

Thanks for the background.

Was there a reason for putting it out-of-line rather than directly in
_cond_resched()?

If it were inline then it wouldn't be that much different from the
static called version and I wonder if we could simplify by just using
the static key for all PREEMPT_DYNAMIC configs.

> > > If we knew each call-site would only call a particular function or skip the
> > > call, then we could do better (and would probably need something like objtool
> > > to NOP that out at compile time), but since we don't know the callee at build
> > > time we can't ensure we have a PLT in range when necessary.
> > 
> > Unfortunately most static calls have multiple destinations.
> 
> Sure, but here we're just enabling/disabling a call, which we could treat
> differently, or wrap at a different level within the scheduler code. I'm happy
> to take a look at that.

I can try to emulate what you did for PREEMPT_DYNAMIC.  I'll Cc you on
my actual patch to come soon-ish.

> > And most don't have the option of being NULL.
> 
> Oh, I was under the impression that all could be disabled/skipped, which is
> what a NULL target implied.

I guess what I was trying to say is that if the target can be NULL, the
call site has to use static_call_cond() to not break the
!HAVE_STATIC_CALL case.  But most call sites use static_call().

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Mark Rutland 2 years, 7 months ago
On Wed, Feb 01, 2023 at 08:57:27AM -0800, Josh Poimboeuf wrote:
> On Wed, Feb 01, 2023 at 11:10:20AM +0000, Mark Rutland wrote:
> > On Tue, Jan 31, 2023 at 08:38:32AM -0800, Josh Poimboeuf wrote:
> > > On Tue, Jan 31, 2023 at 10:22:09AM +0000, Mark Rutland wrote:
> > > > > Hm, it might be nice if our out-of-line static call implementation would
> > > > > automatically do a static key check as part of static_call_cond() for
> > > > > NULL-type static calls.
> > > > > 
> > > > > But the best answer is probably to just add inline static calls to
> > > > > arm64.  Is the lack of objtool the only thing blocking that?
> > > > 
> > > > The major issues were branch range limitations (and needing the linker to add
> > > > PLTs),
> > > 
> > > Does the compiler do the right thing (e.g., force PLT) if the branch
> > > target is outside the translation unit?  I'm wondering if we could for
> > > example use objtool to help enforce such rules at the call site.
> > 
> > It's the linker (rather than the compiler) that'll generate the PLT if the
> > caller and callee are out of range at link time. There are a few other issues
> > too (e.g. no guarnatee that the PLT isn't used by multiple distinct callers,
> > CMODX patching requirements), so we'd have to generate a pseudo-PLT ourselves
> > at build time with a patching-friendly code sequence. Ard had a prototype for
> > that:
> > 
> >   https://lore.kernel.org/linux-arm-kernel/20211105145917.2828911-1-ardb@kernel.org/
> > 
> > ... but that was sufficiently painful that we went with the current static key
> > approach:
> > 
> >   https://lore.kernel.org/all/20211109172408.49641-1-mark.rutland@arm.com/
> 
> Thanks for the background.
> 
> Was there a reason for putting it out-of-line rather than directly in
> _cond_resched()?

I think that's mostly a historical accident; I'm not aware of a reaason we
can't put that directly in _cond_resched(). 

Since we started from out-of-line static call trampolines, even the out-of-line
static key check looked nicer, and I think we just never considered moving the
static key check inline.

> If it were inline then it wouldn't be that much different from the
> static called version and I wonder if we could simplify by just using
> the static key for all PREEMPT_DYNAMIC configs.

That would be nice!

> > > > If we knew each call-site would only call a particular function or skip the
> > > > call, then we could do better (and would probably need something like objtool
> > > > to NOP that out at compile time), but since we don't know the callee at build
> > > > time we can't ensure we have a PLT in range when necessary.
> > > 
> > > Unfortunately most static calls have multiple destinations.
> > 
> > Sure, but here we're just enabling/disabling a call, which we could treat
> > differently, or wrap at a different level within the scheduler code. I'm happy
> > to take a look at that.
> 
> I can try to emulate what you did for PREEMPT_DYNAMIC.  I'll Cc you on
> my actual patch to come soon-ish.

I look forward to it! :)

> > > And most don't have the option of being NULL.
> > 
> > Oh, I was under the impression that all could be disabled/skipped, which is
> > what a NULL target implied.
> 
> I guess what I was trying to say is that if the target can be NULL, the
> call site has to use static_call_cond() to not break the
> !HAVE_STATIC_CALL case.  But most call sites use static_call().

Ah, sorry -- I had missed that we had distinct static_call_cond() and
static_call() helpers.

Thanks,
Mark.
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Song Liu 2 years, 7 months ago
On Mon, Jan 30, 2023 at 11:48 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Mon, Jan 30, 2023 at 06:36:32PM +0000, Mark Rutland wrote:
> > On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:
> > > > @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
> > > >  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
> > > >  int __sched dynamic_cond_resched(void)
> > > >  {
> > > > - if (!static_branch_unlikely(&sk_dynamic_cond_resched))
> > > > + if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
> > > > +         klp_sched_try_switch();
> > > >           return 0;
> > > > + }
> > > >   return __cond_resched();
> > > >  }
> > > >  EXPORT_SYMBOL(dynamic_cond_resched);
> > >
> > > I would make the klp_sched_try_switch() not depend on
> > > sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
> > > pass through __schedule().
> > >
> > > But you'll probably want to check with Mark here, this all might
> > > generate crap code on arm64.
> >
> > IIUC here klp_sched_try_switch() is a static call, so on arm64 this'll generate
> > at least a load, a conditional branch, and an indirect branch. That's not
> > ideal, but I'd have to benchmark it to find out whether it's a significant
> > overhead relative to the baseline of PREEMPT_DYNAMIC.
> >
> > For arm64 it'd be a bit nicer to have another static key check, and a call to
> > __klp_sched_try_switch(). That way the static key check gets turned into a NOP
> > in the common case, and the call to __klp_sched_try_switch() can be a direct
> > call (potentially a tail-call if we made it return 0).
>
> Hm, it might be nice if our out-of-line static call implementation would
> automatically do a static key check as part of static_call_cond() for
> NULL-type static calls.
>
> But the best answer is probably to just add inline static calls to
> arm64.  Is the lack of objtool the only thing blocking that?
>
> Objtool is now modular, so all the controversial CFG reverse engineering
> is now optional, so it shouldn't be too hard to just enable objtool for
> static call inlines.

This might be a little off topic, and maybe I missed some threads:
How far are we from officially supporting livepatch on arm64?

IIUC, stable stack unwinding is the missing piece at the moment?

Thanks,
Song
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Josh Poimboeuf 2 years, 7 months ago
On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:
> 
> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4df2b3e76b30..fbcd3acca25c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -36,6 +36,7 @@
> >  #include <linux/seqlock.h>
> >  #include <linux/kcsan.h>
> >  #include <linux/rv.h>
> > +#include <linux/livepatch_sched.h>
> >  #include <asm/kmap_size.h>
> >  
> >  /* task_struct member predeclarations (sorted alphabetically): */
> > @@ -2074,6 +2075,9 @@ DECLARE_STATIC_CALL(cond_resched, __cond_resched);
> >  
> >  static __always_inline int _cond_resched(void)
> >  {
> > +	//FIXME this is a bit redundant with preemption disabled
> > +	klp_sched_try_switch();
> > +
> >  	return static_call_mod(cond_resched)();
> >  }
> 
> Right, I was thinking you'd do something like:
> 
> 	static_call_update(cond_resched, klp_cond_resched);
> 
> With:
> 
> static int klp_cond_resched(void)
> {
> 	klp_try_switch_task(current);
> 	return __cond_resched();
> }
> 
> That would force cond_resched() into doing the transition thing,
> irrespective of the preemption mode at hand. And then, when KLP be done,
> re-run sched_dynamic_update() to reset it to whatever it ought to be.

Ok, makes sense.

> 
> > @@ -401,8 +421,10 @@ void klp_try_complete_transition(void)
> >  	 */
> >  	read_lock(&tasklist_lock);
> >  	for_each_process_thread(g, task)
> > -		if (!klp_try_switch_task(task))
> > +		if (!klp_try_switch_task(task)) {
> > +			set_tsk_need_resched(task);
> >  			complete = false;
> > +		}
> 
> Yeah, no, that's broken -- preemption state live in more than just the
> TIF bit.

Oops.

> >  	read_unlock(&tasklist_lock);
> >  
> >  	/*
> > @@ -413,6 +435,7 @@ void klp_try_complete_transition(void)
> >  		task = idle_task(cpu);
> >  		if (cpu_online(cpu)) {
> >  			if (!klp_try_switch_task(task)) {
> > +				set_tsk_need_resched(task);
> >  				complete = false;
> >  				/* Make idle task go through the main loop. */
> >  				wake_up_if_idle(cpu);
> 
> Idem.
> 
> Also, I don't see the point of this and the __schedule() hook here:

The (poorly executed) idea was to catch kthreads which do

	if (need_resched())
		schedule();

but I guess we can just replace those usages with cond_resched()?

> > @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
> >  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
> >  int __sched dynamic_cond_resched(void)
> >  {
> > -	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
> > +	if (!static_branch_unlikely(&sk_dynamic_cond_resched)) {
> > +		klp_sched_try_switch();
> >  		return 0;
> > +	}
> >  	return __cond_resched();
> >  }
> >  EXPORT_SYMBOL(dynamic_cond_resched);
> 
> I would make the klp_sched_try_switch() not depend on
> sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
> pass through __schedule().
> 
> But you'll probably want to check with Mark here, this all might
> generate crap code on arm64.
> 
> Both ways this seems to make KLP 'depend' (or at least work lots better)
> when PREEMPT_DYNAMIC=y. Do we want a PREEMPT_DYNAMIC=n fallback for
> _cond_resched() too?

That was the intent but I obviously failed.  Let me go rework it a bit.

-- 
Josh
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Seth Forshee 2 years, 7 months ago
On Mon, Jan 30, 2023 at 01:40:18PM +0100, Peter Zijlstra wrote:
> Both ways this seems to make KLP 'depend' (or at least work lots better)
> when PREEMPT_DYNAMIC=y. Do we want a PREEMPT_DYNAMIC=n fallback for
> _cond_resched() too?

I would like this, as we have no other need for PREEMPT_DYNAMIC=y and
currently have it disabled.

Thanks,
Seth
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Petr Mladek 2 years, 7 months ago
On Fri 2023-01-27 11:37:02, Peter Zijlstra wrote:
> On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> > On Thu, Jan 26, 2023 at 03:12:35PM -0600, Seth Forshee (DigitalOcean) wrote:
> > > On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > > > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > > > timeout period due to busy vhost worker kthreads.
> > > > 
> > > > I have missed this detail. Miroslav told me that we have solved
> > > > something similar some time ago, see
> > > > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> > > 
> > > Interesting thread. I had thought about something along the lines of the
> > > original patch, but there are some ideas in there that I hadn't
> > > considered.
> > 
> > Here's another idea, have we considered this?  Have livepatch set
> > TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> > have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> > set.
> > 
> > Not sure how scheduler folks would feel about that ;-)
> 
> So, let me try and page all that back in.... :-)
> 
> KLP needs to unwind the stack to see if any of the patched functions are
> active, if not, flip task to new set.
> 
> Unwinding the stack of a task can be done when:
> 
>  - task is inactive (stable reg and stack) -- provided it stays inactive
>    while unwinding etc..
> 
>  - task is current (guarantees stack doesn't dip below where we started
>    due to being busy on top etc..)
> 
> Can NOT be done from interrupt context, because can hit in the middle of
> setting up stack frames etc..

All the above seems correct.

> The issue at hand is that some tasks run for a long time without passing
> through an explicit check.

There might actually be two possibilities why the transition fails
too often:

1. The task might be in the running state most of the time. Therefore
   the backtrace is not reliable most of the time.

   In this case, some cooperation with the scheduler would really
   help. We would need to stop the task and check the stack
   when it is stopped. Something like the patch you proposed.


2. The task might be sleeping but almost always in a livepatched
   function. Therefore it could not be transitioned.

   It might be the case with vhost_worker(). The main loop is "tiny".
   The kthread probaly spends most of the time with processing
   a vhost_work. And if the "works" are livepatched...

   In this case, it would help to call klp_try_switch_task(current)
   in the main loop in vhost_worker(). It would always succeed
   when vhost_worker() is not livepatched on its own.

   Note that even this would not help with kPatch when a single
   vhost_work might need more than the 1 minute timout to get proceed.

> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index f1b25ec581e0..06746095a724 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -9,6 +9,7 @@
>  
>  #include <linux/cpu.h>
>  #include <linux/stacktrace.h>
> +#include <linux/stop_machine.h>
>  #include "core.h"
>  #include "patch.h"
>  #include "transition.h"
> @@ -334,6 +335,16 @@ static bool klp_try_switch_task(struct task_struct *task)
>  	return !ret;
>  }
>  
> +static int __stop_try_switch(void *arg)
> +{
> +	return klp_try_switch_task(arg) ? 0 : -EBUSY;
> +}
> +
> +static bool klp_try_switch_task_harder(struct task_struct *task)
> +{
> +	return !stop_one_cpu(task_cpu(task), __stop_try_switch, task);
> +}
> +
>  /*
>   * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
>   * Kthreads with TIF_PATCH_PENDING set are woken up.

Nice. I am surprised that it can be implemented so easily.

Best Regards,
Petr
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Seth Forshee 2 years, 7 months ago
On Fri, Jan 27, 2023 at 01:09:02PM +0100, Petr Mladek wrote:
> There might actually be two possibilities why the transition fails
> too often:
> 
> 1. The task might be in the running state most of the time. Therefore
>    the backtrace is not reliable most of the time.
> 
>    In this case, some cooperation with the scheduler would really
>    help. We would need to stop the task and check the stack
>    when it is stopped. Something like the patch you proposed.

This is the situation we are encountering.

> 2. The task might be sleeping but almost always in a livepatched
>    function. Therefore it could not be transitioned.
> 
>    It might be the case with vhost_worker(). The main loop is "tiny".
>    The kthread probaly spends most of the time with processing
>    a vhost_work. And if the "works" are livepatched...
> 
>    In this case, it would help to call klp_try_switch_task(current)
>    in the main loop in vhost_worker(). It would always succeed
>    when vhost_worker() is not livepatched on its own.
> 
>    Note that even this would not help with kPatch when a single
>    vhost_work might need more than the 1 minute timout to get proceed.
> 
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > index f1b25ec581e0..06746095a724 100644
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -9,6 +9,7 @@
> >  
> >  #include <linux/cpu.h>
> >  #include <linux/stacktrace.h>
> > +#include <linux/stop_machine.h>
> >  #include "core.h"
> >  #include "patch.h"
> >  #include "transition.h"
> > @@ -334,6 +335,16 @@ static bool klp_try_switch_task(struct task_struct *task)
> >  	return !ret;
> >  }
> >  
> > +static int __stop_try_switch(void *arg)
> > +{
> > +	return klp_try_switch_task(arg) ? 0 : -EBUSY;
> > +}
> > +
> > +static bool klp_try_switch_task_harder(struct task_struct *task)
> > +{
> > +	return !stop_one_cpu(task_cpu(task), __stop_try_switch, task);
> > +}
> > +
> >  /*
> >   * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
> >   * Kthreads with TIF_PATCH_PENDING set are woken up.
> 
> Nice. I am surprised that it can be implemented so easily.

Yes, that's a neat solution. I will give it a try.

AIUI this still doesn't help for architectures without a reliable
stacktrace though, right? So we probably should only try this for
architectures which do have relaible stacktraces.

Thanks,
Seth
Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads
Posted by Michael S. Tsirkin 2 years, 7 months ago
On Fri, Jan 20, 2023 at 04:12:20PM -0600, Seth Forshee (DigitalOcean) wrote:
> We've fairly regularaly seen liveptches which cannot transition within kpatch's
> timeout period due to busy vhost worker kthreads. In looking for a solution the
> only answer I found was to call klp_update_patch_state() from a safe location.
> I tried adding this call to vhost_worker(), and it works, but this creates the
> potential for problems if a livepatch attempted to patch vhost_worker().
> Without a call to klp_update_patch_state() fully loaded vhost kthreads can
> never switch because vhost_worker() will always appear on the stack, but with
> the call these kthreads can switch but will still be running the old version of
> vhost_worker().
> 
> To avoid this situation I've added a new function, klp_switch_current(), which
> switches the current task only if its stack does not include any function being
> patched. This allows kthreads to safely attempt switching themselves if a patch
> is pending. There is at least one downside, however. Since there's no way for
> the kthread to track whether it has already tried to switch for a pending patch
> it can end up calling klp_switch_current() repeatedly when it can never be
> safely switched.
> 
> I don't know whether this is the right solution, and I'm happy to try out other
> suggestions. But in my testing these patches proved effective in consistently
> switching heavily loaded vhost kthreads almost immediately.
> 
> To: Josh Poimboeuf <jpoimboe@kernel.org>
> To: Jiri Kosina <jikos@kernel.org>
> To: Miroslav Benes <mbenes@suse.cz>
> To: Petr Mladek <pmladek@suse.com>
> To: Joe Lawrence <joe.lawrence@redhat.com>
> To: "Michael S. Tsirkin" <mst@redhat.com>
> To: Jason Wang <jasowang@redhat.com>
> Cc: live-patching@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kvm@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>

Don't know enough about live patching to judge this.

I'll let livepatch maintainers judge this, and merge through
the livepatch tree if appropriate. For that:
Acked-by: Michael S. Tsirkin <mst@redhat.com>
but pls underestand this is more a 'looks ok superficially and
I don't have better ideas'  than 'I have reviewed this thoroughly'.

> 
> ---
> Seth Forshee (DigitalOcean) (2):
>       livepatch: add an interface for safely switching kthreads
>       vhost: check for pending livepatches from vhost worker kthreads
> 
>  drivers/vhost/vhost.c         |  4 ++++
>  include/linux/livepatch.h     |  2 ++
>  kernel/livepatch/transition.c | 11 +++++++++++
>  3 files changed, 17 insertions(+)
> ---
> base-commit: 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
> change-id: 20230120-vhost-klp-switching-ba9a3ae38b8a
> 
> Best regards,
> -- 
> Seth Forshee (DigitalOcean) <sforshee@kernel.org>