[RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode

Yafang Shao posted 2 patches 3 days, 12 hours ago
[RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode
Posted by Yafang Shao 3 days, 12 hours ago
The atomic replace livepatch mechanism was introduced to handle scenarios
where we want to unload a specific livepatch without unloading others.
However, its current implementation has significant shortcomings, making
it less than ideal in practice. Below are the key downsides:

- It is expensive

  During testing with frequent replacements of an old livepatch, random RCU
  warnings were observed:

  [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 jiffies old.
  [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 jiffies old.
  [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 jiffies old.
  [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 jiffies old.
  [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 jiffies old.
  [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 jiffies old.
  [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 jiffies old.
  [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 jiffies old.
  [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 jiffies old.
  [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 jiffies old.
  [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 jiffies old.
  
  This indicates that atomic replacement can cause performance issues,
  particularly with RCU synchronization under frequent use.

- Potential Risks During Replacement 

  One known issue involves replacing livepatched versions of critical
  functions such as do_exit(). During the replacement process, a panic
  might occur, as highlighted in [0]. Other potential risks may also arise
  due to inconsistencies or race conditions during transitions.

- Temporary Loss of Patching 

  During the replacement process, the old patch is set to a NOP (no-operation)
  before the new patch is fully applied. This creates a window where the
  function temporarily reverts to its original, unpatched state. If the old
  patch fixed a critical issue (e.g., one that prevented a system panic), the
  system could become vulnerable to that issue during the transition.

The current atomic replacement approach replaces all old livepatches,
even when such a sweeping change is unnecessary. This can be improved
by introducing a hybrid mode, which allows the coexistence of both
atomic replace and non atomic replace livepatches.

In the hybrid mode:

- Specific livepatches can be marked as "non-replaceable" to ensure they
  remain active and unaffected during replacements.

- Other livepatches can be marked as "replaceable," allowing targeted
  replacements of only those patches.

This selective approach would reduce unnecessary transitions, lower the
risk of temporary patch loss, and mitigate performance issues during
livepatch replacement.

Link: https://lore.kernel.org/live-patching/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com/ [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/livepatch/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 5e0c2caa0af8..f820b50c1b26 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
 		klp_for_each_object(old_patch, old_obj) {
 			int err;
 
+			if (!old_patch->replaceable)
+				continue;
 			err = klp_add_object_nops(patch, old_obj);
 			if (err)
 				return err;
@@ -830,6 +832,8 @@ void klp_free_replaced_patches_async(struct klp_patch *new_patch)
 	klp_for_each_patch_safe(old_patch, tmp_patch) {
 		if (old_patch == new_patch)
 			return;
+		if (!old_patch->replaceable)
+			continue;
 		klp_free_patch_async(old_patch);
 	}
 }
@@ -1232,6 +1236,8 @@ void klp_unpatch_replaced_patches(struct klp_patch *new_patch)
 		if (old_patch == new_patch)
 			return;
 
+		if (!old_patch->replaceable)
+			continue;
 		old_patch->enabled = false;
 		klp_unpatch_objects(old_patch);
 	}
-- 
2.43.5
Re: [RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode
Posted by Petr Mladek 3 days, 4 hours ago
On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> The atomic replace livepatch mechanism was introduced to handle scenarios
> where we want to unload a specific livepatch without unloading others.
> However, its current implementation has significant shortcomings, making
> it less than ideal in practice. Below are the key downsides:

[...]

> In the hybrid mode:
> 
> - Specific livepatches can be marked as "non-replaceable" to ensure they
>   remain active and unaffected during replacements.
> 
> - Other livepatches can be marked as "replaceable," allowing targeted
>   replacements of only those patches.
> 
> This selective approach would reduce unnecessary transitions, lower the
> risk of temporary patch loss, and mitigate performance issues during
> livepatch replacement.
> 
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
>  		klp_for_each_object(old_patch, old_obj) {
>  			int err;
>  
> +			if (!old_patch->replaceable)
> +				continue;

This is one example where things might get very complicated.

The same function might be livepatched by more livepatches, see
ops->func_stack. For example, let's have funcA and three livepatches:
a
  + lp1:
	.replace = false,
	.non-replace = true,
	.func =	{
			.old_name = "funcA",
			.new_func = lp1_funcA,
		}, { }

  + lp2:
	.replace = false,
	.non-replace = false,
	.func =	{
			.old_name = "funcA",
			.new_func = lp2_funcA,
		},{
			.old_name = "funcB",
			.new_func = lp2_funcB,
		}, { }


  + lp3:
	.replace = true,
	.non-replace = false,
	.func =	{
			.old_name = "funcB",
			.new_func = lp3_funcB,
		}, { }


Now, apply lp1:

      + funcA() gets redirected to lp1_funcA()

Then, apply lp2

      + funcA() gets redirected to lp2_funcA()

Finally, apply lp3:

      + The proposed code would add "nop()" for
	funcA() because	it exists in lp2 and does not exist in lp3.

      + funcA() will get redirected to the original code
	because of the nop() during transition

      + nop() will get removed in klp_complete_transition() and
	funcA() will get suddenly redirected to lp1_funcA()
	because it will still be on ops->func_stack even
	after the "nop" and lp2_funcA() gets removed.

	   => The entire system will start using another funcA
	      implementation at some random time

	   => this would violate the consistency model


The proper solution might be tricky:

1. We would need to detect this situation and do _not_ add
   the "nop" for lp3 and funcA().

2. We would need a more complicate code for handling the task states.

   klp_update_patch_state() sets task->patch_state using
   the global "klp_target_state". But in the above example,
   when enabling lp3:

    + funcA would need to get transitioned _backward_:
	 KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
      , so that it goes on ops->func_stack:
	 lp2_funcA -> lp1->funcA

   while:

    + funcA would need to get transitioned forward:
	 KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
      , so that it goes on ops->func_stack:
	 lp2_funcB -> lp3->funcB


=> the hybrid mode would complicate the life for both livepatch
   creators/maintainers and kernel code developers/maintainers.

   I am afraid that this complexity is not acceptable if there are
   better solutions for the original problem.

>  			err = klp_add_object_nops(patch, old_obj);
>  			if (err)
>  				return err;

I am sorry but I am quite strongly against this approach!

Best Regards,
Petr
Re: [RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode
Posted by Yafang Shao 3 days, 3 hours ago
On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > The atomic replace livepatch mechanism was introduced to handle scenarios
> > where we want to unload a specific livepatch without unloading others.
> > However, its current implementation has significant shortcomings, making
> > it less than ideal in practice. Below are the key downsides:
>
> [...]
>
> > In the hybrid mode:
> >
> > - Specific livepatches can be marked as "non-replaceable" to ensure they
> >   remain active and unaffected during replacements.
> >
> > - Other livepatches can be marked as "replaceable," allowing targeted
> >   replacements of only those patches.
> >
> > This selective approach would reduce unnecessary transitions, lower the
> > risk of temporary patch loss, and mitigate performance issues during
> > livepatch replacement.
> >
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> >               klp_for_each_object(old_patch, old_obj) {
> >                       int err;
> >
> > +                     if (!old_patch->replaceable)
> > +                             continue;
>
> This is one example where things might get very complicated.

Why does this example even exist in the first place?
If hybrid mode is enabled, this scenario could have been avoided entirely.

>
> The same function might be livepatched by more livepatches, see
> ops->func_stack. For example, let's have funcA and three livepatches:
> a
>   + lp1:
>         .replace = false,
>         .non-replace = true,
>         .func = {
>                         .old_name = "funcA",
>                         .new_func = lp1_funcA,
>                 }, { }
>
>   + lp2:
>         .replace = false,
>         .non-replace = false,
>         .func = {
>                         .old_name = "funcA",
>                         .new_func = lp2_funcA,
>                 },{
>                         .old_name = "funcB",
>                         .new_func = lp2_funcB,
>                 }, { }
>
>
>   + lp3:
>         .replace = true,
>         .non-replace = false,
>         .func = {
>                         .old_name = "funcB",
>                         .new_func = lp3_funcB,
>                 }, { }
>
>
> Now, apply lp1:
>
>       + funcA() gets redirected to lp1_funcA()
>
> Then, apply lp2
>
>       + funcA() gets redirected to lp2_funcA()
>
> Finally, apply lp3:
>
>       + The proposed code would add "nop()" for
>         funcA() because it exists in lp2 and does not exist in lp3.
>
>       + funcA() will get redirected to the original code
>         because of the nop() during transition
>
>       + nop() will get removed in klp_complete_transition() and
>         funcA() will get suddenly redirected to lp1_funcA()
>         because it will still be on ops->func_stack even
>         after the "nop" and lp2_funcA() gets removed.
>
>            => The entire system will start using another funcA
>               implementation at some random time
>
>            => this would violate the consistency model
>
>
> The proper solution might be tricky:
>
> 1. We would need to detect this situation and do _not_ add
>    the "nop" for lp3 and funcA().
>
> 2. We would need a more complicate code for handling the task states.
>
>    klp_update_patch_state() sets task->patch_state using
>    the global "klp_target_state". But in the above example,
>    when enabling lp3:
>
>     + funcA would need to get transitioned _backward_:
>          KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
>       , so that it goes on ops->func_stack:
>          lp2_funcA -> lp1->funcA
>
>    while:
>
>     + funcA would need to get transitioned forward:
>          KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
>       , so that it goes on ops->func_stack:
>          lp2_funcB -> lp3->funcB
>
>
> => the hybrid mode would complicate the life for both livepatch
>    creators/maintainers and kernel code developers/maintainers.
>

I don’t believe they should be held responsible for poor
configurations. These settings could have been avoided from the start.
There are countless tunable knobs in the system—should we try to
accommodate every possible combination? No, that’s not how systems are
designed to operate. Instead, we should identify and follow best
practices to ensure optimal functionality.

>    I am afraid that this complexity is not acceptable if there are
>    better solutions for the original problem.
>
> >                       err = klp_add_object_nops(patch, old_obj);
> >                       if (err)
> >                               return err;
>
> I am sorry but I am quite strongly against this approach!
>
> Best Regards,
> Petr



--
Regards
Yafang