The atomic replace livepatch mechanism was introduced to handle scenarios
where we want to unload a specific livepatch without unloading others.
However, its current implementation has significant shortcomings, making
it less than ideal in practice. Below are the key downsides:
- It is expensive
During testing with frequent replacements of an old livepatch, random RCU
warnings were observed:
[19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 jiffies old.
[19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 jiffies old.
[19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 jiffies old.
[19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 jiffies old.
[19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 jiffies old.
[19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 jiffies old.
[19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 jiffies old.
[19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 jiffies old.
[19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 jiffies old.
[19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 jiffies old.
[19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 jiffies old.
This indicates that atomic replacement can cause performance issues,
particularly with RCU synchronization under frequent use.
- Potential Risks During Replacement
One known issue involves replacing livepatched versions of critical
functions such as do_exit(). During the replacement process, a panic
might occur, as highlighted in [0]. Other potential risks may also arise
due to inconsistencies or race conditions during transitions.
- Temporary Loss of Patching
During the replacement process, the old patch is set to a NOP (no-operation)
before the new patch is fully applied. This creates a window where the
function temporarily reverts to its original, unpatched state. If the old
patch fixed a critical issue (e.g., one that prevented a system panic), the
system could become vulnerable to that issue during the transition.
The current atomic replacement approach replaces all old livepatches,
even when such a sweeping change is unnecessary. This can be improved
by introducing a hybrid mode, which allows the coexistence of both
atomic replace and non atomic replace livepatches.
In the hybrid mode:
- Specific livepatches can be marked as "non-replaceable" to ensure they
remain active and unaffected during replacements.
- Other livepatches can be marked as "replaceable," allowing targeted
replacements of only those patches.
This selective approach would reduce unnecessary transitions, lower the
risk of temporary patch loss, and mitigate performance issues during
livepatch replacement.
Link: https://lore.kernel.org/live-patching/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com/ [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
kernel/livepatch/core.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 5e0c2caa0af8..f820b50c1b26 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
klp_for_each_object(old_patch, old_obj) {
int err;
+ if (!old_patch->replaceable)
+ continue;
err = klp_add_object_nops(patch, old_obj);
if (err)
return err;
@@ -830,6 +832,8 @@ void klp_free_replaced_patches_async(struct klp_patch *new_patch)
klp_for_each_patch_safe(old_patch, tmp_patch) {
if (old_patch == new_patch)
return;
+ if (!old_patch->replaceable)
+ continue;
klp_free_patch_async(old_patch);
}
}
@@ -1232,6 +1236,8 @@ void klp_unpatch_replaced_patches(struct klp_patch *new_patch)
if (old_patch == new_patch)
return;
+ if (!old_patch->replaceable)
+ continue;
old_patch->enabled = false;
klp_unpatch_objects(old_patch);
}
--
2.43.5
On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote: > The atomic replace livepatch mechanism was introduced to handle scenarios > where we want to unload a specific livepatch without unloading others. > However, its current implementation has significant shortcomings, making > it less than ideal in practice. Below are the key downsides: > > - It is expensive > > During testing with frequent replacements of an old livepatch, random RCU > warnings were observed: > > [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 jiffies old. > [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 jiffies old. > [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 jiffies old. > [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 jiffies old. > [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 jiffies old. > [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 jiffies old. > [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 jiffies old. > [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 jiffies old. > [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 jiffies old. > [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 jiffies old. > [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 jiffies old. > > This indicates that atomic replacement can cause performance issues, > particularly with RCU synchronization under frequent use. Why does this happen? > - Potential Risks During Replacement > > One known issue involves replacing livepatched versions of critical > functions such as do_exit(). During the replacement process, a panic > might occur, as highlighted in [0]. Other potential risks may also arise > due to inconsistencies or race conditions during transitions. That needs to be fixed. > - Temporary Loss of Patching > > During the replacement process, the old patch is set to a NOP (no-operation) > before the new patch is fully applied. This creates a window where the > function temporarily reverts to its original, unpatched state. If the old > patch fixed a critical issue (e.g., one that prevented a system panic), the > system could become vulnerable to that issue during the transition. Are you saying that atomic replace is not atomic? If so, this sounds like another bug. -- Josh
On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote:
> > The atomic replace livepatch mechanism was introduced to handle scenarios
> > where we want to unload a specific livepatch without unloading others.
> > However, its current implementation has significant shortcomings, making
> > it less than ideal in practice. Below are the key downsides:
> >
> > - It is expensive
> >
> > During testing with frequent replacements of an old livepatch, random RCU
> > warnings were observed:
> >
> > [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 jiffies old.
> > [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 jiffies old.
> > [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 jiffies old.
> > [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 jiffies old.
> > [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 jiffies old.
> > [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 jiffies old.
> > [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 jiffies old.
> > [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 jiffies old.
> > [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 jiffies old.
> > [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 jiffies old.
> > [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 jiffies old.
> >
> > This indicates that atomic replacement can cause performance issues,
> > particularly with RCU synchronization under frequent use.
>
> Why does this happen?
It occurs during the KLP transition. It seems like the KLP transition
is taking too long.
[20329703.332453] livepatch: enabling patch 'livepatch_61_release6'
[20329703.340417] livepatch: 'livepatch_61_release6': starting
patching transition
[20329715.314215] rcu_tasks_wait_gp: rcu_tasks grace period 1109765 is
10166 jiffies old.
[20329737.126207] rcu_tasks_wait_gp: rcu_tasks grace period 1109769 is
10219 jiffies old.
[20329752.018236] rcu_tasks_wait_gp: rcu_tasks grace period 1109773 is
10199 jiffies old.
[20329754.848036] livepatch: 'livepatch_61_release6': patching complete
>
> > - Potential Risks During Replacement
> >
> > One known issue involves replacing livepatched versions of critical
> > functions such as do_exit(). During the replacement process, a panic
> > might occur, as highlighted in [0]. Other potential risks may also arise
> > due to inconsistencies or race conditions during transitions.
>
> That needs to be fixed.
>
> > - Temporary Loss of Patching
> >
> > During the replacement process, the old patch is set to a NOP (no-operation)
> > before the new patch is fully applied. This creates a window where the
> > function temporarily reverts to its original, unpatched state. If the old
> > patch fixed a critical issue (e.g., one that prevented a system panic), the
> > system could become vulnerable to that issue during the transition.
>
> Are you saying that atomic replace is not atomic? If so, this sounds
> like another bug.
From my understanding, there’s a window where the original function is
not patched.
klp_enable_patch
+ klp_init_patch
+ if (patch->replace)
klp_add_nops(patch); <<<< set all old patches to nop
+ __klp_enable_patch
+ klp_patch_object
+ klp_patch_func
+ ops = klp_find_ops(func->old_func);
+ if (ops)
// add the new patch to the func_stack list
list_add_rcu(&func->stack_node, &ops->func_stack);
klp_ftrace_handler
+ func = list_first_or_null_rcu(&ops->func_stack, struct klp_func
+ if (func->nop)
goto unlock;
+ ftrace_regs_set_instruction_pointer(fregs, (unsigned long)func->new_func);
Before the new atomic replace patch is added to the func_stack list,
the old patch is already set to nop. If klp_ftrace_handler() is
triggered at this point, it will effectively do nothing—in other
words, it will execute the original function.
I might be wrong.
--
Regards
Yafang
On Fri, Feb 07, 2025 at 11:16:45AM +0800, Yafang Shao wrote: > On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > Why does this happen? > > It occurs during the KLP transition. It seems like the KLP transition > is taking too long. > > [20329703.332453] livepatch: enabling patch 'livepatch_61_release6' > [20329703.340417] livepatch: 'livepatch_61_release6': starting > patching transition > [20329715.314215] rcu_tasks_wait_gp: rcu_tasks grace period 1109765 is > 10166 jiffies old. > [20329737.126207] rcu_tasks_wait_gp: rcu_tasks grace period 1109769 is > 10219 jiffies old. > [20329752.018236] rcu_tasks_wait_gp: rcu_tasks grace period 1109773 is > 10199 jiffies old. > [20329754.848036] livepatch: 'livepatch_61_release6': patching complete How specifically does the KLP transition trigger rcu_tasks workings? > Before the new atomic replace patch is added to the func_stack list, > the old patch is already set to nop. If klp_ftrace_handler() is > triggered at this point, it will effectively do nothing—in other > words, it will execute the original function. > I might be wrong. That's not actually how it works. klp_add_nops() probably needs some better comments. It adds nops to the *new* patch so that all the functions in the old patch(es) get replaced, even those which don't have a corresponding function in the new patch. The justification for your patch seems to be "here are some bugs, this patch helps work around them", which isn't very convincing. Instead we need to understand the original bugs and fix them. -- Josh
On Sat, Feb 8, 2025 at 12:59 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Fri, Feb 07, 2025 at 11:16:45AM +0800, Yafang Shao wrote:
> > On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > Why does this happen?
> >
> > It occurs during the KLP transition. It seems like the KLP transition
> > is taking too long.
> >
> > [20329703.332453] livepatch: enabling patch 'livepatch_61_release6'
> > [20329703.340417] livepatch: 'livepatch_61_release6': starting
> > patching transition
> > [20329715.314215] rcu_tasks_wait_gp: rcu_tasks grace period 1109765 is
> > 10166 jiffies old.
> > [20329737.126207] rcu_tasks_wait_gp: rcu_tasks grace period 1109769 is
> > 10219 jiffies old.
> > [20329752.018236] rcu_tasks_wait_gp: rcu_tasks grace period 1109773 is
> > 10199 jiffies old.
> > [20329754.848036] livepatch: 'livepatch_61_release6': patching complete
>
> How specifically does the KLP transition trigger rcu_tasks workings?
I believe the reason is the livepatch transition holds the spinlock
tasklist_lock too long. Though I haven't tried to prove it yet.
>
> > Before the new atomic replace patch is added to the func_stack list,
> > the old patch is already set to nop. If klp_ftrace_handler() is
> > triggered at this point, it will effectively do nothing—in other
> > words, it will execute the original function.
> > I might be wrong.
>
> That's not actually how it works. klp_add_nops() probably needs some
> better comments.
With Petr's help, I now understand how it works.
What do you think about adding the following comments? These comments
are copied from Petr's reply [0].
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index f2071a20e5f0..64a026af53e1 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -559,6 +559,9 @@ static int klp_add_object_nops(struct klp_patch *patch,
* Add 'nop' functions which simply return to the caller to run
* the original function. The 'nop' functions are added to a
* patch to facilitate a 'replace' mode.
+ *
+ * The nop entries are added only for functions which are currently
+ * livepatched but they are no longer livepatched in the new livepatch.
*/
static int klp_add_nops(struct klp_patch *patch)
{
[0]. https://lore.kernel.org/live-patching/CALOAHbBs5sfAxSw4HA6KwjWbH3GmhkHJqcni0d4iB7oVZ_3vjw@mail.gmail.com/T/#m96263bd4e0b2a781e5847aee4fe74f7a17ed186c
>
> It adds nops to the *new* patch so that all the functions in the old
> patch(es) get replaced, even those which don't have a corresponding
> function in the new patch.
>
> The justification for your patch seems to be "here are some bugs, this
> patch helps work around them", which isn't very convincing.
The statement "here are some bugs, the hybrid model can workaround
them" is correct. However, the most important part is "This selective
approach would reduce unnecessary transitions," which will be a
valuable improvement to the livepatch system.
In the old 4.19 kernel, we faced an issue where we had to unload an
already-loaded livepatch and replace it with a new one. Without atomic
replace, we had to first unload the old livepatch (along with others)
and then load the new one, which was less than ideal. In the 6.1
kernel, atomic replace is supported, and it works perfectly for that
situation. This is why we prefer using the atomic replace model over
the old one.
However, atomic replace is not entirely ideal, as it replaces all old
livepatches, even if they are not relevant to the new changes. In
reality, we should not replace livepatches unless they are relevant.
We only need to replace the ones that conflict with the new livepatch.
This is where the hybrid model excels, allowing us to achieve this
selective replacement.
> Instead we
> need to understand the original bugs and fix them.
Yes, we should continue working on fixing them.
--
Regards
Yafang
On Fri 2025-02-07 11:16:45, Yafang Shao wrote:
> On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote:
> > > - Temporary Loss of Patching
> > >
> > > During the replacement process, the old patch is set to a NOP (no-operation)
> > > before the new patch is fully applied. This creates a window where the
> > > function temporarily reverts to its original, unpatched state. If the old
> > > patch fixed a critical issue (e.g., one that prevented a system panic), the
> > > system could become vulnerable to that issue during the transition.
> >
> > Are you saying that atomic replace is not atomic? If so, this sounds
> > like another bug.
>
> >From my understanding, there’s a window where the original function is
> not patched.
This is a misunderstanding.
> klp_enable_patch
> + klp_init_patch
> + if (patch->replace)
> klp_add_nops(patch); <<<< set all old patches to nop
1. The "nop" entry is added into the _new_ (to-be-enabled) livepatch,
see klp_add_nops(patch). The parameter is the _newly_ enabled patch.
2. The "nop" entries are added only for functions which are currently
livepatched but they are not longer livepatched in the new
livepatch, see:
static int klp_add_object_nops(struct klp_patch *patch,
struct klp_object *old_obj)
{
[...]
klp_for_each_func(old_obj, old_func) {
func = klp_find_func(obj, old_func);
if (func)
continue; <------ Do not allocate nop
when the fuction is
implemeted in the new
livepatch.
func = klp_alloc_func_nop(old_func, obj);
if (!func)
return -ENOMEM;
}
return 0;
}
> + __klp_enable_patch
> + klp_patch_object
> + klp_patch_func
> + ops = klp_find_ops(func->old_func);
> + if (ops)
> // add the new patch to the func_stack list
> list_add_rcu(&func->stack_node, &ops->func_stack);
>
>
> klp_ftrace_handler
> + func = list_first_or_null_rcu(&ops->func_stack, struct klp_func
3. You omitted this important part of the code:
if (unlikely(func->transition)) {
patch_state = current->patch_state;
if (patch_state == KLP_TRANSITION_UNPATCHED) {
/*
----> * Use the previously patched version of the function.
----> * If no previous patches exist, continue with the
----> * original function.
*/
func = list_entry_rcu(func->stack_node.next,
struct klp_func, stack_node);
The condition "patch_state == KLP_TRANSITION_UNPATCHED" might
be a bit misleading.
The state "KLP_TRANSITION_UNPATCHED" means that it can't use
the code from the "new" livepatch => it has to fallback
to the previously used code => previous livepatch.
> + if (func->nop)
> goto unlock;
> + ftrace_regs_set_instruction_pointer(fregs, (unsigned long)func->new_func);
> Before the new atomic replace patch is added to the func_stack list,
> the old patch is already set to nop.
^^^
The nops are set in the _new_ patch for functions which will
not longer get livepatched, see the commit e1452b607c48c642
("livepatch: Add atomic replace") for more details.
> If klp_ftrace_handler() is
> triggered at this point, it will effectively do nothing—in other
> words, it will execute the original function.
> I might be wrong.
Fortunately, you are wrong. This would be a serious violation of
the consistency model and livepatches modifying some semantic would
blow up systems.
Best Regards,
Petr
On Fri, Feb 7, 2025 at 5:36 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Fri 2025-02-07 11:16:45, Yafang Shao wrote:
> > On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote:
> > > > - Temporary Loss of Patching
> > > >
> > > > During the replacement process, the old patch is set to a NOP (no-operation)
> > > > before the new patch is fully applied. This creates a window where the
> > > > function temporarily reverts to its original, unpatched state. If the old
> > > > patch fixed a critical issue (e.g., one that prevented a system panic), the
> > > > system could become vulnerable to that issue during the transition.
> > >
> > > Are you saying that atomic replace is not atomic? If so, this sounds
> > > like another bug.
> >
> > >From my understanding, there’s a window where the original function is
> > not patched.
>
> This is a misunderstanding.
>
> > klp_enable_patch
> > + klp_init_patch
> > + if (patch->replace)
> > klp_add_nops(patch); <<<< set all old patches to nop
>
> 1. The "nop" entry is added into the _new_ (to-be-enabled) livepatch,
> see klp_add_nops(patch). The parameter is the _newly_ enabled patch.
>
> 2. The "nop" entries are added only for functions which are currently
> livepatched but they are not longer livepatched in the new
> livepatch, see:
>
> static int klp_add_object_nops(struct klp_patch *patch,
> struct klp_object *old_obj)
> {
> [...]
> klp_for_each_func(old_obj, old_func) {
> func = klp_find_func(obj, old_func);
> if (func)
> continue; <------ Do not allocate nop
> when the fuction is
> implemeted in the new
> livepatch.
>
> func = klp_alloc_func_nop(old_func, obj);
> if (!func)
> return -ENOMEM;
> }
>
> return 0;
> }
Thanks for your explanation.
>
>
> > + __klp_enable_patch
> > + klp_patch_object
> > + klp_patch_func
> > + ops = klp_find_ops(func->old_func);
> > + if (ops)
> > // add the new patch to the func_stack list
> > list_add_rcu(&func->stack_node, &ops->func_stack);
> >
> >
> > klp_ftrace_handler
> > + func = list_first_or_null_rcu(&ops->func_stack, struct klp_func
>
> 3. You omitted this important part of the code:
>
> if (unlikely(func->transition)) {
> patch_state = current->patch_state;
> if (patch_state == KLP_TRANSITION_UNPATCHED) {
> /*
> ----> * Use the previously patched version of the function.
> ----> * If no previous patches exist, continue with the
> ----> * original function.
> */
> func = list_entry_rcu(func->stack_node.next,
> struct klp_func, stack_node);
>
>
> The condition "patch_state == KLP_TRANSITION_UNPATCHED" might
> be a bit misleading.
>
> The state "KLP_TRANSITION_UNPATCHED" means that it can't use
> the code from the "new" livepatch => it has to fallback
> to the previously used code => previous livepatch.
Understood.
>
>
> > + if (func->nop)
> > goto unlock;
> > + ftrace_regs_set_instruction_pointer(fregs, (unsigned long)func->new_func);
>
> > Before the new atomic replace patch is added to the func_stack list,
> > the old patch is already set to nop.
> ^^^
>
> The nops are set in the _new_ patch for functions which will
> not longer get livepatched, see the commit e1452b607c48c642
> ("livepatch: Add atomic replace") for more details.
>
> > If klp_ftrace_handler() is
> > triggered at this point, it will effectively do nothing—in other
> > words, it will execute the original function.
> > I might be wrong.
>
> Fortunately, you are wrong. This would be a serious violation of
> the consistency model and livepatches modifying some semantic would
> blow up systems.
That is great. Thanks for your help.
--
Regards
Yafang
On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> The atomic replace livepatch mechanism was introduced to handle scenarios
> where we want to unload a specific livepatch without unloading others.
> However, its current implementation has significant shortcomings, making
> it less than ideal in practice. Below are the key downsides:
[...]
> In the hybrid mode:
>
> - Specific livepatches can be marked as "non-replaceable" to ensure they
> remain active and unaffected during replacements.
>
> - Other livepatches can be marked as "replaceable," allowing targeted
> replacements of only those patches.
>
> This selective approach would reduce unnecessary transitions, lower the
> risk of temporary patch loss, and mitigate performance issues during
> livepatch replacement.
>
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> klp_for_each_object(old_patch, old_obj) {
> int err;
>
> + if (!old_patch->replaceable)
> + continue;
This is one example where things might get very complicated.
The same function might be livepatched by more livepatches, see
ops->func_stack. For example, let's have funcA and three livepatches:
a
+ lp1:
.replace = false,
.non-replace = true,
.func = {
.old_name = "funcA",
.new_func = lp1_funcA,
}, { }
+ lp2:
.replace = false,
.non-replace = false,
.func = {
.old_name = "funcA",
.new_func = lp2_funcA,
},{
.old_name = "funcB",
.new_func = lp2_funcB,
}, { }
+ lp3:
.replace = true,
.non-replace = false,
.func = {
.old_name = "funcB",
.new_func = lp3_funcB,
}, { }
Now, apply lp1:
+ funcA() gets redirected to lp1_funcA()
Then, apply lp2
+ funcA() gets redirected to lp2_funcA()
Finally, apply lp3:
+ The proposed code would add "nop()" for
funcA() because it exists in lp2 and does not exist in lp3.
+ funcA() will get redirected to the original code
because of the nop() during transition
+ nop() will get removed in klp_complete_transition() and
funcA() will get suddenly redirected to lp1_funcA()
because it will still be on ops->func_stack even
after the "nop" and lp2_funcA() gets removed.
=> The entire system will start using another funcA
implementation at some random time
=> this would violate the consistency model
The proper solution might be tricky:
1. We would need to detect this situation and do _not_ add
the "nop" for lp3 and funcA().
2. We would need a more complicate code for handling the task states.
klp_update_patch_state() sets task->patch_state using
the global "klp_target_state". But in the above example,
when enabling lp3:
+ funcA would need to get transitioned _backward_:
KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
, so that it goes on ops->func_stack:
lp2_funcA -> lp1->funcA
while:
+ funcA would need to get transitioned forward:
KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
, so that it goes on ops->func_stack:
lp2_funcB -> lp3->funcB
=> the hybrid mode would complicate the life for both livepatch
creators/maintainers and kernel code developers/maintainers.
I am afraid that this complexity is not acceptable if there are
better solutions for the original problem.
> err = klp_add_object_nops(patch, old_obj);
> if (err)
> return err;
I am sorry but I am quite strongly against this approach!
Best Regards,
Petr
On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > The atomic replace livepatch mechanism was introduced to handle scenarios
> > where we want to unload a specific livepatch without unloading others.
> > However, its current implementation has significant shortcomings, making
> > it less than ideal in practice. Below are the key downsides:
>
> [...]
>
> > In the hybrid mode:
> >
> > - Specific livepatches can be marked as "non-replaceable" to ensure they
> > remain active and unaffected during replacements.
> >
> > - Other livepatches can be marked as "replaceable," allowing targeted
> > replacements of only those patches.
> >
> > This selective approach would reduce unnecessary transitions, lower the
> > risk of temporary patch loss, and mitigate performance issues during
> > livepatch replacement.
> >
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> > klp_for_each_object(old_patch, old_obj) {
> > int err;
> >
> > + if (!old_patch->replaceable)
> > + continue;
>
> This is one example where things might get very complicated.
Why does this example even exist in the first place?
If hybrid mode is enabled, this scenario could have been avoided entirely.
>
> The same function might be livepatched by more livepatches, see
> ops->func_stack. For example, let's have funcA and three livepatches:
> a
> + lp1:
> .replace = false,
> .non-replace = true,
> .func = {
> .old_name = "funcA",
> .new_func = lp1_funcA,
> }, { }
>
> + lp2:
> .replace = false,
> .non-replace = false,
> .func = {
> .old_name = "funcA",
> .new_func = lp2_funcA,
> },{
> .old_name = "funcB",
> .new_func = lp2_funcB,
> }, { }
>
>
> + lp3:
> .replace = true,
> .non-replace = false,
> .func = {
> .old_name = "funcB",
> .new_func = lp3_funcB,
> }, { }
>
>
> Now, apply lp1:
>
> + funcA() gets redirected to lp1_funcA()
>
> Then, apply lp2
>
> + funcA() gets redirected to lp2_funcA()
>
> Finally, apply lp3:
>
> + The proposed code would add "nop()" for
> funcA() because it exists in lp2 and does not exist in lp3.
>
> + funcA() will get redirected to the original code
> because of the nop() during transition
>
> + nop() will get removed in klp_complete_transition() and
> funcA() will get suddenly redirected to lp1_funcA()
> because it will still be on ops->func_stack even
> after the "nop" and lp2_funcA() gets removed.
>
> => The entire system will start using another funcA
> implementation at some random time
>
> => this would violate the consistency model
>
>
> The proper solution might be tricky:
>
> 1. We would need to detect this situation and do _not_ add
> the "nop" for lp3 and funcA().
>
> 2. We would need a more complicate code for handling the task states.
>
> klp_update_patch_state() sets task->patch_state using
> the global "klp_target_state". But in the above example,
> when enabling lp3:
>
> + funcA would need to get transitioned _backward_:
> KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
> , so that it goes on ops->func_stack:
> lp2_funcA -> lp1->funcA
>
> while:
>
> + funcA would need to get transitioned forward:
> KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
> , so that it goes on ops->func_stack:
> lp2_funcB -> lp3->funcB
>
>
> => the hybrid mode would complicate the life for both livepatch
> creators/maintainers and kernel code developers/maintainers.
>
I don’t believe they should be held responsible for poor
configurations. These settings could have been avoided from the start.
There are countless tunable knobs in the system—should we try to
accommodate every possible combination? No, that’s not how systems are
designed to operate. Instead, we should identify and follow best
practices to ensure optimal functionality.
> I am afraid that this complexity is not acceptable if there are
> better solutions for the original problem.
>
> > err = klp_add_object_nops(patch, old_obj);
> > if (err)
> > return err;
>
> I am sorry but I am quite strongly against this approach!
>
> Best Regards,
> Petr
--
Regards
Yafang
On Mon 2025-01-27 23:34:50, Yafang Shao wrote:
> On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > > The atomic replace livepatch mechanism was introduced to handle scenarios
> > > where we want to unload a specific livepatch without unloading others.
> > > However, its current implementation has significant shortcomings, making
> > > it less than ideal in practice. Below are the key downsides:
> >
> > [...]
> >
> > > In the hybrid mode:
> > >
> > > - Specific livepatches can be marked as "non-replaceable" to ensure they
> > > remain active and unaffected during replacements.
> > >
> > > - Other livepatches can be marked as "replaceable," allowing targeted
> > > replacements of only those patches.
> > >
> > > This selective approach would reduce unnecessary transitions, lower the
> > > risk of temporary patch loss, and mitigate performance issues during
> > > livepatch replacement.
> > >
> > > --- a/kernel/livepatch/core.c
> > > +++ b/kernel/livepatch/core.c
> > > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> > > klp_for_each_object(old_patch, old_obj) {
> > > int err;
> > >
> > > + if (!old_patch->replaceable)
> > > + continue;
> >
> > This is one example where things might get very complicated.
>
> Why does this example even exist in the first place?
> If hybrid mode is enabled, this scenario could have been avoided entirely.
How exactly this scenario could be avoided, please?
In the real life, livepatches are used to fix many bugs.
And every bug is fixed by livepatching several functions.
Fix1 livepatches: funcA
Fix2 livepatches: funcA, funcB
Fix3 livepatches: funcB
How would you handle this with the hybrid model?
Which fix will be handled by livepatches installed in parallel?
Which fix will be handled by atomic replace?
How would you keep it consistent?
How would it work when there are hundreds of fixes and thousands
of livepatched functions?
Where exactly is the advantage of the hybrid model?
> >
> > The same function might be livepatched by more livepatches, see
> > ops->func_stack. For example, let's have funcA and three livepatches:
> > a
> > + lp1:
> > .replace = false,
> > .non-replace = true,
> > .func = {
> > .old_name = "funcA",
> > .new_func = lp1_funcA,
> > }, { }
> >
> > + lp2:
> > .replace = false,
> > .non-replace = false,
> > .func = {
> > .old_name = "funcA",
> > .new_func = lp2_funcA,
> > },{
> > .old_name = "funcB",
> > .new_func = lp2_funcB,
> > }, { }
> >
> >
> > + lp3:
> > .replace = true,
> > .non-replace = false,
> > .func = {
> > .old_name = "funcB",
> > .new_func = lp3_funcB,
> > }, { }
> >
> >
> > Now, apply lp1:
> >
> > + funcA() gets redirected to lp1_funcA()
> >
> > Then, apply lp2
> >
> > + funcA() gets redirected to lp2_funcA()
> >
> > Finally, apply lp3:
> >
> > + The proposed code would add "nop()" for
> > funcA() because it exists in lp2 and does not exist in lp3.
> >
> > + funcA() will get redirected to the original code
> > because of the nop() during transition
> >
> > + nop() will get removed in klp_complete_transition() and
> > funcA() will get suddenly redirected to lp1_funcA()
> > because it will still be on ops->func_stack even
> > after the "nop" and lp2_funcA() gets removed.
> >
> > => The entire system will start using another funcA
> > implementation at some random time
> >
> > => this would violate the consistency model
> >
> >
> > The proper solution might be tricky:
> >
> > 1. We would need to detect this situation and do _not_ add
> > the "nop" for lp3 and funcA().
> >
> > 2. We would need a more complicate code for handling the task states.
> >
> > klp_update_patch_state() sets task->patch_state using
> > the global "klp_target_state". But in the above example,
> > when enabling lp3:
> >
> > + funcA would need to get transitioned _backward_:
> > KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
> > , so that it goes on ops->func_stack:
> > lp2_funcA -> lp1->funcA
> >
> > while:
> >
> > + funcA would need to get transitioned forward:
> > KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
> > , so that it goes on ops->func_stack:
> > lp2_funcB -> lp3->funcB
> >
> >
> > => the hybrid mode would complicate the life for both livepatch
> > creators/maintainers and kernel code developers/maintainers.
> >
>
> I don’t believe they should be held responsible for poor
> configurations. These settings could have been avoided from the start.
> There are countless tunable knobs in the system—should we try to
> accommodate every possible combination? No, that’s not how systems are
> designed to operate. Instead, we should identify and follow best
> practices to ensure optimal functionality.
I am sorry but I do not understand the above paragraph at all.
Livepatches modify functions.
How is it related to system configuration or tunable knobs?
What best practices are you talking, please?
Best Regards,
Petr
On Tue, Feb 4, 2025 at 9:21 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Mon 2025-01-27 23:34:50, Yafang Shao wrote:
> > On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > > > The atomic replace livepatch mechanism was introduced to handle scenarios
> > > > where we want to unload a specific livepatch without unloading others.
> > > > However, its current implementation has significant shortcomings, making
> > > > it less than ideal in practice. Below are the key downsides:
> > >
> > > [...]
> > >
> > > > In the hybrid mode:
> > > >
> > > > - Specific livepatches can be marked as "non-replaceable" to ensure they
> > > > remain active and unaffected during replacements.
> > > >
> > > > - Other livepatches can be marked as "replaceable," allowing targeted
> > > > replacements of only those patches.
> > > >
> > > > This selective approach would reduce unnecessary transitions, lower the
> > > > risk of temporary patch loss, and mitigate performance issues during
> > > > livepatch replacement.
> > > >
> > > > --- a/kernel/livepatch/core.c
> > > > +++ b/kernel/livepatch/core.c
> > > > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> > > > klp_for_each_object(old_patch, old_obj) {
> > > > int err;
> > > >
> > > > + if (!old_patch->replaceable)
> > > > + continue;
> > >
> > > This is one example where things might get very complicated.
> >
> > Why does this example even exist in the first place?
> > If hybrid mode is enabled, this scenario could have been avoided entirely.
>
You have many questions, but it seems you haven’t taken the time to read even
a single line of this patchset. I’ll try to address them to save you some time.
> How exactly this scenario could be avoided, please?
>
> In the real life, livepatches are used to fix many bugs.
> And every bug is fixed by livepatching several functions.
>
> Fix1 livepatches: funcA
> Fix2 livepatches: funcA, funcB
> Fix3 livepatches: funcB
>
> How would you handle this with the hybrid model?
In your scenario, each Fix will replace the applied livepatches, so
they must be set to replaceable.
To clarify the hybrid model, I'll illustrate it with the following Fixes:
Fix1 livepatches: funcA
Fix2 livepatches: funcC
Fix3 livepatches: funcA, funcB
Fix4 livepatches: funcD
Fix5 livepatches: funcB
Each FixN represents a single /sys/kernel/livepatch/FixN.
By default, all Fixes are set to non-replaceable.
Step-by-step process:
1. Loading Fix1
Loaded Fixes: Fix1
2. Loading Fix2
Loaded Fixes: Fix1 Fix2
3. Loading Fix3
Before loading, set Fix1 to replaceable and replace it with Fix3,
which is a cumulative patch of Fix1 and Fix3.
Loaded Fixes: Fix2 Fix3
4. Loading Fix4
Loaded Fixes: Fix2 Fix3 Fix4
5. Loading Fix5
Similar to Step 3, set Fix3 to replaceable and replace it with Fix5.
Loaded Fixes: Fix2 Fix4 Fix5
This hybrid model ensures that irrelevant Fixes remain unaffected
during replacements.
>
> Which fix will be handled by livepatches installed in parallel?
> Which fix will be handled by atomic replace?
> How would you keep it consistent?
>
> How would it work when there are hundreds of fixes and thousands
> of livepatched functions?
The key point is that if a new Fix modifies a function already changed
by previous Fixes, all the affected old Fixes should be set to
replaceable, merged into the new Fix, and then the old Fixes should be
replaced with the new one.
>
> Where exactly is the advantage of the hybrid model?
That can be seen as a form of "deferred replacement"—in other words,
only replacing the old Fixes when absolutely necessary. This approach
can significantly reduce unnecessary work.
>
> > >
> > > The same function might be livepatched by more livepatches, see
> > > ops->func_stack. For example, let's have funcA and three livepatches:
> > > a
> > > + lp1:
> > > .replace = false,
> > > .non-replace = true,
> > > .func = {
> > > .old_name = "funcA",
> > > .new_func = lp1_funcA,
> > > }, { }
> > >
> > > + lp2:
> > > .replace = false,
> > > .non-replace = false,
> > > .func = {
> > > .old_name = "funcA",
> > > .new_func = lp2_funcA,
> > > },{
> > > .old_name = "funcB",
> > > .new_func = lp2_funcB,
> > > }, { }
> > >
> > >
> > > + lp3:
> > > .replace = true,
> > > .non-replace = false,
> > > .func = {
> > > .old_name = "funcB",
> > > .new_func = lp3_funcB,
> > > }, { }
> > >
> > >
> > > Now, apply lp1:
> > >
> > > + funcA() gets redirected to lp1_funcA()
> > >
> > > Then, apply lp2
> > >
> > > + funcA() gets redirected to lp2_funcA()
> > >
> > > Finally, apply lp3:
> > >
> > > + The proposed code would add "nop()" for
> > > funcA() because it exists in lp2 and does not exist in lp3.
> > >
> > > + funcA() will get redirected to the original code
> > > because of the nop() during transition
> > >
> > > + nop() will get removed in klp_complete_transition() and
> > > funcA() will get suddenly redirected to lp1_funcA()
> > > because it will still be on ops->func_stack even
> > > after the "nop" and lp2_funcA() gets removed.
> > >
> > > => The entire system will start using another funcA
> > > implementation at some random time
> > >
> > > => this would violate the consistency model
> > >
> > >
> > > The proper solution might be tricky:
> > >
> > > 1. We would need to detect this situation and do _not_ add
> > > the "nop" for lp3 and funcA().
> > >
> > > 2. We would need a more complicate code for handling the task states.
> > >
> > > klp_update_patch_state() sets task->patch_state using
> > > the global "klp_target_state". But in the above example,
> > > when enabling lp3:
> > >
> > > + funcA would need to get transitioned _backward_:
> > > KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
> > > , so that it goes on ops->func_stack:
> > > lp2_funcA -> lp1->funcA
> > >
> > > while:
> > >
> > > + funcA would need to get transitioned forward:
> > > KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
> > > , so that it goes on ops->func_stack:
> > > lp2_funcB -> lp3->funcB
> > >
> > >
> > > => the hybrid mode would complicate the life for both livepatch
> > > creators/maintainers and kernel code developers/maintainers.
> > >
> >
> > I don’t believe they should be held responsible for poor
> > configurations. These settings could have been avoided from the start.
> > There are countless tunable knobs in the system—should we try to
> > accommodate every possible combination? No, that’s not how systems are
> > designed to operate. Instead, we should identify and follow best
> > practices to ensure optimal functionality.
>
> I am sorry but I do not understand the above paragraph at all.
>
> Livepatches modify functions.
> How is it related to system configuration or tunable knobs?
/sys/kernel/livepatch/FixN/replaceable is a tunable knob.
+static struct kobj_attribute replaceable_kobj_attr = __ATTR_RW(replaceable);
> What best practices are you talking, please?
As mentioned earlier.
--
Regards
Yafang
On Wed 2025-02-05 10:54:47, Yafang Shao wrote:
> On Tue, Feb 4, 2025 at 9:21 PM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Mon 2025-01-27 23:34:50, Yafang Shao wrote:
> > > On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@suse.com> wrote:
> > > >
> > > > On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > > > > The atomic replace livepatch mechanism was introduced to handle scenarios
> > > > > where we want to unload a specific livepatch without unloading others.
> > > > > However, its current implementation has significant shortcomings, making
> > > > > it less than ideal in practice. Below are the key downsides:
> > > >
> > > > [...]
> > > >
> > > > > In the hybrid mode:
> > > > >
> > > > > - Specific livepatches can be marked as "non-replaceable" to ensure they
> > > > > remain active and unaffected during replacements.
> > > > >
> > > > > - Other livepatches can be marked as "replaceable," allowing targeted
> > > > > replacements of only those patches.
> > > > >
> > > > > This selective approach would reduce unnecessary transitions, lower the
> > > > > risk of temporary patch loss, and mitigate performance issues during
> > > > > livepatch replacement.
> > > > >
> > > > > --- a/kernel/livepatch/core.c
> > > > > +++ b/kernel/livepatch/core.c
> > > > > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> > > > > klp_for_each_object(old_patch, old_obj) {
> > > > > int err;
> > > > >
> > > > > + if (!old_patch->replaceable)
> > > > > + continue;
> > > >
> > > > This is one example where things might get very complicated.
> > >
> > > Why does this example even exist in the first place?
> > > If hybrid mode is enabled, this scenario could have been avoided entirely.
> >
>
> You have many questions, but it seems you haven’t taken the time to read even
> a single line of this patchset. I’ll try to address them to save you some time.
What?
> > How exactly this scenario could be avoided, please?
> >
> > In the real life, livepatches are used to fix many bugs.
> > And every bug is fixed by livepatching several functions.
> >
> > Fix1 livepatches: funcA
> > Fix2 livepatches: funcA, funcB
> > Fix3 livepatches: funcB
> >
> > How would you handle this with the hybrid model?
>
> In your scenario, each Fix will replace the applied livepatches, so
> they must be set to replaceable.
> To clarify the hybrid model, I'll illustrate it with the following Fixes:
>
> Fix1 livepatches: funcA
> Fix2 livepatches: funcC
> Fix3 livepatches: funcA, funcB
How does look the livepatched FuncA here?
Does it contain changes only for Fix3?
Or is it cummulative livepatches_funA includes both Fix1 + Fix3?
> Fix4 livepatches: funcD
> Fix5 livepatches: funcB
>
> Each FixN represents a single /sys/kernel/livepatch/FixN.
> By default, all Fixes are set to non-replaceable.
>
> Step-by-step process:
> 1. Loading Fix1
> Loaded Fixes: Fix1
> 2. Loading Fix2
> Loaded Fixes: Fix1 Fix2
> 3. Loading Fix3
> Before loading, set Fix1 to replaceable and replace it with Fix3,
> which is a cumulative patch of Fix1 and Fix3.
Who will modify the "replaceable" flag of "Fix1"? Userspace or kernel?
How the livepatch modules would get installed/removed to/from the
filesystem?
We (SUSE) distribute the livepatches using RPM packages. Every new
version of the livepatch module is packaged in a RPM package with
the same name and higher version. A post install script loads
the module into the kernel and removes disabled livepatch modules.
The package update causes that the new version of the livepatch module
is enabled and the old version is removed. And also the old version
of the module and is removed from the filesystem together with the old
version of the RPM package.
This matches the behavior of the atomic replace. There is always only
one version of the livepatch RPM package installed and only one
livepatch module loaded/enabled. And when it works correcly then
the version of the installed package matches the version of the loaded
livepatch module.
This might be hard to achieve with the hybrid model. Every livepatch
module would need to be packaged in a separate (different name)
RPM package. And some userspace service would need to clean up both
kernel modules and livepatch RPM packages (remove the unused ones).
This might add a lot of extra complexity.
> Loaded Fixes: Fix2 Fix3
> 4. Loading Fix4
> Loaded Fixes: Fix2 Fix3 Fix4
> 5. Loading Fix5
> Similar to Step 3, set Fix3 to replaceable and replace it with Fix5.
> Loaded Fixes: Fix2 Fix4 Fix5
Let's imagine another situation:
Fix1 livepatches: funcA, funcB
Fix2 livepatches: funcB, funcC
Fix3 livepatches: funcA, funcC
Variant A:
1. Loading Fix1
Loaded Fixes: Fix1
In use:: funcA_fix1, funcB_fix1
2. Loading Fix2
Loaded Fixes: Fix1 Fix2
In use: funcA_fix1, funcB_fix2, funcC_fix2
3. Loading Fix3
Loaded Fixes: Fix2 Fix3
In use: funcA_fix3, funcB_fix2, funcC_fix3
This will be correct only when:
+ funcA_fix3 contains both changes from Fix1 and Fix3
+ funcC_fix3 contains both changes from Fix2 and Fix3
Variant B:
1. Loading Fix1
Loaded Fixes: Fix1
In use:: funcA_fix1, funcB_fix1
2. Loading Fix2 (fails from some reason or is skipped)
Loaded Fixes: Fix1
In use:: funcA_fix1, funcB_fix1
3. Loading Fix3
Loaded Fixes: Fix1 Fix2
In use: funcA_fix3, funcB_fix1, funcC_fix3
This will be correct only when:
+ funcA_fix3 contains both changes from Fix1 and Fix3
and stays funcB_fix1 is compatible with funcA_fix3
+ funcC_fix3 contains changes only from Fix3,
it must be compatible with the original funcB because
I want to say that this would work only when all livepatches
are loaded in the right order. Otherwise, it might break
the system.
How do you want to ensure this?
Is it really that easy?
> 3. Loading Fix3
> Before loading, set Fix1 to replaceable and replace it with Fix3,
> This hybrid model ensures that irrelevant Fixes remain unaffected
> during replacements.
It makes some some now. But IMHO, it is not as easy as it looks.
The complexity is in details.
> >
> > Which fix will be handled by livepatches installed in parallel?
> > Which fix will be handled by atomic replace?
> > How would you keep it consistent?
> >
> > How would it work when there are hundreds of fixes and thousands
> > of livepatched functions?
>
> The key point is that if a new Fix modifies a function already changed
> by previous Fixes, all the affected old Fixes should be set to
> replaceable, merged into the new Fix, and then the old Fixes should be
> replaced with the new one.
As I tried to explain above. This might be hard to use in practice.
We would either need to enforce loading all livepatches in the right
order. It might be hard to make it user friendly.
Or it would need a lot of extra code which would ensure that only
compatible livepatches can be loaded.
Best Regards,
Petr
On Thu, Feb 6, 2025 at 12:03 AM Petr Mladek <pmladek@suse.com> wrote:
>
> On Wed 2025-02-05 10:54:47, Yafang Shao wrote:
> > On Tue, Feb 4, 2025 at 9:21 PM Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > On Mon 2025-01-27 23:34:50, Yafang Shao wrote:
> > > > On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@suse.com> wrote:
> > > > >
> > > > > On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > > > > > The atomic replace livepatch mechanism was introduced to handle scenarios
> > > > > > where we want to unload a specific livepatch without unloading others.
> > > > > > However, its current implementation has significant shortcomings, making
> > > > > > it less than ideal in practice. Below are the key downsides:
> > > > >
> > > > > [...]
> > > > >
> > > > > > In the hybrid mode:
> > > > > >
> > > > > > - Specific livepatches can be marked as "non-replaceable" to ensure they
> > > > > > remain active and unaffected during replacements.
> > > > > >
> > > > > > - Other livepatches can be marked as "replaceable," allowing targeted
> > > > > > replacements of only those patches.
> > > > > >
> > > > > > This selective approach would reduce unnecessary transitions, lower the
> > > > > > risk of temporary patch loss, and mitigate performance issues during
> > > > > > livepatch replacement.
> > > > > >
> > > > > > --- a/kernel/livepatch/core.c
> > > > > > +++ b/kernel/livepatch/core.c
> > > > > > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> > > > > > klp_for_each_object(old_patch, old_obj) {
> > > > > > int err;
> > > > > >
> > > > > > + if (!old_patch->replaceable)
> > > > > > + continue;
> > > > >
> > > > > This is one example where things might get very complicated.
> > > >
> > > > Why does this example even exist in the first place?
> > > > If hybrid mode is enabled, this scenario could have been avoided entirely.
> > >
> >
> > You have many questions, but it seems you haven’t taken the time to read even
> > a single line of this patchset. I’ll try to address them to save you some time.
>
> What?
Apologies if my words offended you.
>
> > > How exactly this scenario could be avoided, please?
> > >
> > > In the real life, livepatches are used to fix many bugs.
> > > And every bug is fixed by livepatching several functions.
> > >
> > > Fix1 livepatches: funcA
> > > Fix2 livepatches: funcA, funcB
> > > Fix3 livepatches: funcB
> > >
> > > How would you handle this with the hybrid model?
> >
> > In your scenario, each Fix will replace the applied livepatches, so
> > they must be set to replaceable.
> > To clarify the hybrid model, I'll illustrate it with the following Fixes:
> >
> > Fix1 livepatches: funcA
> > Fix2 livepatches: funcC
> > Fix3 livepatches: funcA, funcB
>
> How does look the livepatched FuncA here?
> Does it contain changes only for Fix3?
> Or is it cummulative livepatches_funA includes both Fix1 + Fix3?
It is cumulative livepatches_funA includes both Fix1 + Fix3.
>
> > Fix4 livepatches: funcD
> > Fix5 livepatches: funcB
> >
> > Each FixN represents a single /sys/kernel/livepatch/FixN.
> > By default, all Fixes are set to non-replaceable.
> >
> > Step-by-step process:
> > 1. Loading Fix1
> > Loaded Fixes: Fix1
> > 2. Loading Fix2
> > Loaded Fixes: Fix1 Fix2
> > 3. Loading Fix3
> > Before loading, set Fix1 to replaceable and replace it with Fix3,
> > which is a cumulative patch of Fix1 and Fix3.
>
> Who will modify the "replaceable" flag of "Fix1"? Userspace or kernel?
Userspace scripts. Modify it before loading a new one.
>
> How the livepatch modules would get installed/removed to/from the
> filesystem?
It doesn't make any difference.
>
> We (SUSE) distribute the livepatches using RPM packages. Every new
> version of the livepatch module is packaged in a RPM package with
> the same name and higher version. A post install script loads
> the module into the kernel and removes disabled livepatch modules.
Similar to us.
>
> The package update causes that the new version of the livepatch module
> is enabled and the old version is removed. And also the old version
> of the module and is removed from the filesystem together with the old
> version of the RPM package.
>
> This matches the behavior of the atomic replace. There is always only
> one version of the livepatch RPM package installed and only one
> livepatch module loaded/enabled. And when it works correcly then
> the version of the installed package matches the version of the loaded
> livepatch module.
>
> This might be hard to achieve with the hybrid model. Every livepatch
> module would need to be packaged in a separate (different name)
> RPM package.
Before switching to atomic replace, we packaged all the livepatch
modules into a single RPM. The RPM installation handled them quite
well.
> And some userspace service would need to clean up both
> kernel modules and livepatch RPM packages (remove the unused ones).
>
> This might add a lot of extra complexity.
It's not that complex—just like what we did before atomic replacement
was enabled.
>
> > Loaded Fixes: Fix2 Fix3
> > 4. Loading Fix4
> > Loaded Fixes: Fix2 Fix3 Fix4
> > 5. Loading Fix5
> > Similar to Step 3, set Fix3 to replaceable and replace it with Fix5.
> > Loaded Fixes: Fix2 Fix4 Fix5
>
> Let's imagine another situation:
>
> Fix1 livepatches: funcA, funcB
> Fix2 livepatches: funcB, funcC
> Fix3 livepatches: funcA, funcC
>
> Variant A:
>
> 1. Loading Fix1
> Loaded Fixes: Fix1
> In use:: funcA_fix1, funcB_fix1
>
> 2. Loading Fix2
> Loaded Fixes: Fix1 Fix2
> In use: funcA_fix1, funcB_fix2, funcC_fix2
>
> 3. Loading Fix3
> Loaded Fixes: Fix2 Fix3
> In use: funcA_fix3, funcB_fix2, funcC_fix3
>
> This will be correct only when:
>
> + funcA_fix3 contains both changes from Fix1 and Fix3
> + funcC_fix3 contains both changes from Fix2 and Fix3
>
>
> Variant B:
>
> 1. Loading Fix1
> Loaded Fixes: Fix1
> In use:: funcA_fix1, funcB_fix1
>
> 2. Loading Fix2 (fails from some reason or is skipped)
> Loaded Fixes: Fix1
> In use:: funcA_fix1, funcB_fix1
>
> 3. Loading Fix3
> Loaded Fixes: Fix1 Fix2
> In use: funcA_fix3, funcB_fix1, funcC_fix3
>
> This will be correct only when:
>
> + funcA_fix3 contains both changes from Fix1 and Fix3
> and stays funcB_fix1 is compatible with funcA_fix3
> + funcC_fix3 contains changes only from Fix3,
> it must be compatible with the original funcB because
>
> I want to say that this would work only when all livepatches
> are loaded in the right order. Otherwise, it might break
> the system.
>
> How do you want to ensure this?
As we discussed earlier, if there’s an overlap between two
livepatches, we merge them into a cumulative patch and replace the old
one.
>
> Is it really that easy?
>
> > 3. Loading Fix3
> > Before loading, set Fix1 to replaceable and replace it with Fix3,
>
> > This hybrid model ensures that irrelevant Fixes remain unaffected
> > during replacements.
>
> It makes some some now. But IMHO, it is not as easy as it looks.
> The complexity is in details.
>
> > >
> > > Which fix will be handled by livepatches installed in parallel?
> > > Which fix will be handled by atomic replace?
> > > How would you keep it consistent?
> > >
> > > How would it work when there are hundreds of fixes and thousands
> > > of livepatched functions?
> >
> > The key point is that if a new Fix modifies a function already changed
> > by previous Fixes, all the affected old Fixes should be set to
> > replaceable, merged into the new Fix, and then the old Fixes should be
> > replaced with the new one.
>
> As I tried to explain above. This might be hard to use in practice.
>
> We would either need to enforce loading all livepatches in the right
> order. It might be hard to make it user friendly.
>
> Or it would need a lot of extra code which would ensure that only
> compatible livepatches can be loaded.
That’s related to the configuration. If you decide to use it, you
should familiarize yourself with the best practices. Once you
understand those, it becomes much simpler and less complex.
BTW, based on my experience, the likelihood of overlapping between two
livepatches is very low in real production environments. I haven’t
encountered that case so far in our production servers.
--
Regards
Yafang
On Thu 2025-02-06 10:35:11, Yafang Shao wrote:
> On Thu, Feb 6, 2025 at 12:03 AM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Wed 2025-02-05 10:54:47, Yafang Shao wrote:
> > > On Tue, Feb 4, 2025 at 9:21 PM Petr Mladek <pmladek@suse.com> wrote:
> > > >
> > > > On Mon 2025-01-27 23:34:50, Yafang Shao wrote:
I am not sure if you still would want the hybrid model.
It is possible that the timeout in klp_try_complete_transition()
would remove the hardlockup watchdog warnings, see
https://lore.kernel.org/r/Z6Tmqro6CSm0h-E3@pathway.suse.cz
I reply to this mail just for record because there were some
unanswered questions...
> > > To clarify the hybrid model, I'll illustrate it with the following Fixes:
> > >
> > > Fix1 livepatches: funcA
> > > Fix2 livepatches: funcC
> > > Fix3 livepatches: funcA, funcB
> >
> > How does look the livepatched FuncA here?
> > Does it contain changes only for Fix3?
> > Or is it cummulative livepatches_funA includes both Fix1 + Fix3?
>
> It is cumulative livepatches_funA includes both Fix1 + Fix3.
It makes sense.
I have missed this in the previous mail. I see it there now (after
re-reading). But trick was somehow encrypted in long sentences.
> > > Fix4 livepatches: funcD
> > > Fix5 livepatches: funcB
> > >
> > > Each FixN represents a single /sys/kernel/livepatch/FixN.
> > > By default, all Fixes are set to non-replaceable.
> > >
> > > Step-by-step process:
> > > 1. Loading Fix1
> > > Loaded Fixes: Fix1
> > > 2. Loading Fix2
> > > Loaded Fixes: Fix1 Fix2
> > > 3. Loading Fix3
> > > Before loading, set Fix1 to replaceable and replace it with Fix3,
> > > which is a cumulative patch of Fix1 and Fix3.
> >
> > Who will modify the "replaceable" flag of "Fix1"? Userspace or kernel?
>
> Userspace scripts. Modify it before loading a new one.
This is one mine concern. Such a usespace script would be
more complex for the the hybrid model then for cumulative livepatches.
Any user of the hybrid model would have their own scripts
and eventual bugs.
Anyway, the more possibilities there more ways to break things
and the more complicated is to debug eventual problems.
If anyone would still like the hybrid model then I would like
to enforce some safe behavior from the kernel. I mean to do
as much as possible in the common (kernel) code.
I have the following ideas:
1. Allow to load a livepatch with .replace enabled only when
all conflicting[*] livepatches are allowed to be replaced.
2. Automatically replace all conflicting[*] livepatches.
3. Allow to define a list of to-be-replaced livepatches
into struct patch.
The 1st or 2nd idea would make the hybrid model more safe.
The 2nd idea would even work without the .replaceable flag.
The 3rd idea would allow to replace even non-conflicting[*]
patches.
[*] I would define that two livepatches are "conflicting" when
at least one function is modified by both of them. e.g.
+ Patch1: funcA, funcB \
+ Patch2: funcC, funcD - non-conflicting
+ Patch1: funcA, funcB \
+ Patch2: funcB, funcC - conflicting
Or a bit weaker definition. Two patches are "conflicting"
when the new livepatch provides only partial replacement
for already livepatch functions, .e.g.
+ Patch1: funcA, funcB \
+ Patch2: funcC, funcD - non-conflicting anyway
+ Patch1: funcA, funcB - Patch1 can't replace Patch2 (conflict)
+ Patch2: funcA, funcB, funcC - Patch2 can replace Patch1 (no conflict)
+ Patch1: funcA, funcB \
+ Patch2: funcB, funcC - conflicting anyway
Especially, the automatic replacement of conflicting patches might
make the hybrid model safe and easier to use. And it would resolve
most of my fears with this approach.
Best Regards,
Petr
On Fri, Feb 7, 2025 at 9:58 PM Petr Mladek <pmladek@suse.com> wrote: > > On Thu 2025-02-06 10:35:11, Yafang Shao wrote: > > On Thu, Feb 6, 2025 at 12:03 AM Petr Mladek <pmladek@suse.com> wrote: > > > > > > On Wed 2025-02-05 10:54:47, Yafang Shao wrote: > > > > On Tue, Feb 4, 2025 at 9:21 PM Petr Mladek <pmladek@suse.com> wrote: > > > > > > > > > > On Mon 2025-01-27 23:34:50, Yafang Shao wrote: > > I am not sure if you still would want the hybrid model. I believe the ability to selectively replace specific livepatches will be a highly valuable feature. > It is possible that the timeout in klp_try_complete_transition() > would remove the hardlockup watchdog warnings, see > https://lore.kernel.org/r/Z6Tmqro6CSm0h-E3@pathway.suse.cz > > I reply to this mail just for record because there were some > unanswered questions... > > > > > To clarify the hybrid model, I'll illustrate it with the following Fixes: > > > > > > > > Fix1 livepatches: funcA > > > > Fix2 livepatches: funcC > > > > Fix3 livepatches: funcA, funcB > > > > > > How does look the livepatched FuncA here? > > > Does it contain changes only for Fix3? > > > Or is it cummulative livepatches_funA includes both Fix1 + Fix3? > > > > It is cumulative livepatches_funA includes both Fix1 + Fix3. > > It makes sense. > > I have missed this in the previous mail. I see it there now (after > re-reading). But trick was somehow encrypted in long sentences. > > > > > > Fix4 livepatches: funcD > > > > Fix5 livepatches: funcB > > > > > > > > Each FixN represents a single /sys/kernel/livepatch/FixN. > > > > By default, all Fixes are set to non-replaceable. > > > > > > > > Step-by-step process: > > > > 1. Loading Fix1 > > > > Loaded Fixes: Fix1 > > > > 2. Loading Fix2 > > > > Loaded Fixes: Fix1 Fix2 > > > > 3. Loading Fix3 > > > > Before loading, set Fix1 to replaceable and replace it with Fix3, > > > > which is a cumulative patch of Fix1 and Fix3. > > > > > > Who will modify the "replaceable" flag of "Fix1"? Userspace or kernel? > > > > Userspace scripts. Modify it before loading a new one. > > This is one mine concern. Such a usespace script would be > more complex for the the hybrid model then for cumulative livepatches. > Any user of the hybrid model would have their own scripts > and eventual bugs. In the old model, we maintained a large script to deploy individual livepatches. In the atomic replace model, we maintain another complex script to deploy cumulative livepatches. We always end up creating our own scripts to adapt to the complexities of the production environment. > > Anyway, the more possibilities there more ways to break things > and the more complicated is to debug eventual problems. We still have some servers running the old 4.19 kernel, with over 20 livepatches deployed. These livepatches are managed using the old model since the atomic replace model isn't supported yet. The maintenance of these livepatches has been running smoothly with user scripts. Things would be much simpler if we could rely on user scripts to handle this process. > > If anyone would still like the hybrid model then I would like > to enforce some safe behavior from the kernel. I mean to do > as much as possible in the common (kernel) code. > > I have the following ideas: > > 1. Allow to load a livepatch with .replace enabled only when > all conflicting[*] livepatches are allowed to be replaced. Makes sense. > > 2. Automatically replace all conflicting[*] livepatches. Makes sense. > > 3. Allow to define a list of to-be-replaced livepatches > into struct patch. Seems like a great idea. > > > The 1st or 2nd idea would make the hybrid model more safe. > > The 2nd idea would even work without the .replaceable flag. > > The 3rd idea would allow to replace even non-conflicting[*] > patches. > > [*] I would define that two livepatches are "conflicting" when > at least one function is modified by both of them. e.g. > > + Patch1: funcA, funcB \ > + Patch2: funcC, funcD - non-conflicting > > + Patch1: funcA, funcB \ > + Patch2: funcB, funcC - conflicting > > Or a bit weaker definition. Two patches are "conflicting" > when the new livepatch provides only partial replacement > for already livepatch functions, .e.g. > > + Patch1: funcA, funcB \ > + Patch2: funcC, funcD - non-conflicting anyway > > + Patch1: funcA, funcB - Patch1 can't replace Patch2 (conflict) > + Patch2: funcA, funcB, funcC - Patch2 can replace Patch1 (no conflict) > > + Patch1: funcA, funcB \ > + Patch2: funcB, funcC - conflicting anyway > > > Especially, the automatic replacement of conflicting patches might > make the hybrid model safe and easier to use. And it would resolve > most of my fears with this approach. Many thanks for your suggestion. It’s been incredibly helpful. I’ll definitely give it some thought. -- Regards Yafang
© 2016 - 2026 Red Hat, Inc.