[v2] KSTATE: a mechanism to migrate some part of the kernel state across kexec

[PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Andrey Ryabinin 11 months ago

 Main changes from v1 [1]:
  - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
  - Lots of misc cleanups/refactorings.

kstate (kernel state) is a mechanism to describe internal some part of the
kernel state, save it into the memory and restore the state after kexec
in the new kernel.

The end goal here and the main use case for this is to be able to
update host kernel under VMs with VFIO pass-through devices running
on that host. Since we are pretty far from that end goal yet, this
only establishes some basic infrastructure to describe and migrate complex
in-kernel states.

The idea behind KSTATE resembles QEMU's migration framework [1], which
solves quite similar problem - migrate state of VM/emulated devices
across different versions of QEMU.

This is an altenative to Kexec Hand Over (KHO [3]).

So, why not KHO?

 - The main reason is KHO doesn't provide simple and convenient internal
    API for the drivers/subsystems to preserve internal data.
    E.g. lets consider we have some variable of type 'struct a'
    that needs to be preserved:
	struct a {
	        int i;
        	unsigned long *p_ulong;
	        char s[10];
        	struct page *page;
	};

     The KHO-way requires driver/subsystem to have a bunch of code
     dealing with FDT stuff, something like

     a_kho_write()
     {
	     ...
	     fdt_property(fdt, "i", &a.i, sizeof(a.i));
	     fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
	     fdt_property(fdt, "s", &a.s, sizeof(a.s));
	     if (err)
	     ...
     }
     a_kho_restore()
     {
             ...
     	     a.i = fdt_getprop(fdt, offset, "i", &len);
	     if (!a.i || len != sizeof(a.i))
	     	goto err
	     *a.p_ulong = fdt_getprop....
     }

    Each driver/subsystem has to solve this problem in their own way.
    Also if we use fdt properties for individual fields, that might be wastefull
    in terms of used memory, as these properties use strings as keys.

   While with KSTATE solves the same problem in more elegant way, with this:
	struct kstate_description a_state = {
        	.name = "a_struct",
	        .version_id = 1,
        	.id = KSTATE_TEST_ID,
	        .state_list = LIST_HEAD_INIT(test_state.state_list),
        	.fields = (const struct kstate_field[]) {
                	KSTATE_BASE_TYPE(i, struct a, int),
	                KSTATE_BASE_TYPE(s, struct a, char [10]),
        	        KSTATE_POINTER(p_ulong, struct a),
                	KSTATE_PAGE(page, struct a),
	                KSTATE_END_OF_LIST()
        	},
	};


	{
		static unsigned long ulong
		static struct a a_data = { .p_ulong = &ulong };

		kstate_register(&test_state, &a_data);
	}

       The driver needs only to have a proper 'kstate_description' and call kstate_register()
       to save/restore a_data.
       Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
       And kstate_register() does all this save/restore stuff under the hood.

 - Another bonus point - kstate can preserve migratable memory, which is required
    to preserve guest memory


So now to the part how this works.

State of kernel data (usually it's some struct) is described by the
'struct kstate_description' containing the array of individual
fields descpriptions - 'struct kstate_field'. Each field
has set of bits in ->flags which instructs how to save/restore
a certain field of the struct. E.g.:
  - KS_BASE_TYPE flag tells that field can be just copied by value,

  - KS_POINTER means that the struct member is a pointer to the actual
     data, so it needs to be dereference before saving/restoring data
     to/from kstate data steam.

  - KS_STRUCT - contains another struct,  field->ksd must point to
      another 'struct kstate_dscription'

  - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
               ->restore() callbacks to save/restore data.

  - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
                         field->count() callback
  - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
     linear address. Store offset

  - KS_END - special flag indicating the end of migration stream data.

kstate_register() call accepts kstate_description along with an instance
of an object and registers it in the global 'states' list.

During kexec reboot phase we go through the list of 'kstate_description's
and each instance of kstate_description forms the 'struct kstate_entry'
which save into the kstate's data stream.

The 'kstate_entry' contains information like ID of kstate_description, version
of it, size of migration data and the data itself. The ->data is formed in
accordance to the kstate_field's of the corresponding kstate_description.

After the reboot, when the kstate_register() called it parses migration
stream, finds the appropriate 'kstate_entry' and restores the contents of
the object in accordance with kstate_description and ->fields.

 [1] https://lkml.kernel.org/r/20241002160722.20025-1-arbn@yandex-team.com
 [2] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate
 [3] https://lkml.kernel.org/r/20250206132754.2596694-1-rppt@kernel.org

Andrey Ryabinin (7):
  kstate: Add kstate - a mechanism to describe and migrate kernel state
    across kexec
  kstate, kexec, x86: transfer kstate data across kexec
  kexec: exclude control pages from the destination addresses
  kexec, kstate: delay loading of kexec segments
  x86, kstate: Add the ability to preserve memory pages across kexec.
  kexec, kstate: save kstate data before kexec'ing
  kstate, test: add test module for testing kstate subsystem.

 arch/x86/Kconfig                  |   1 +
 arch/x86/kernel/kexec-bzimage64.c |   4 +
 arch/x86/kernel/setup.c           |   2 +
 include/linux/kexec.h             |   3 +
 include/linux/kstate.h            | 216 ++++++++++++++
 kernel/Kconfig.kexec              |  13 +
 kernel/Makefile                   |   1 +
 kernel/kexec_core.c               |  30 ++
 kernel/kexec_file.c               | 159 +++++++----
 kernel/kexec_internal.h           |   9 +
 kernel/kstate.c                   | 458 ++++++++++++++++++++++++++++++
 lib/Makefile                      |   2 +
 lib/test_kstate.c                 |  86 ++++++
 13 files changed, 925 insertions(+), 59 deletions(-)
 create mode 100644 include/linux/kstate.h
 create mode 100644 kernel/kstate.c
 create mode 100644 lib/test_kstate.c

-- 
2.45.3

Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Chris Li 9 months, 2 weeks ago

Hi Andrey,

I am working on the PCI portion of the live update and looking at
using KSTATE as an alternative to the FDT. Here are some high level
feedbacks.

On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>
>  Main changes from v1 [1]:
>   - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>   - Lots of misc cleanups/refactorings.
>
> kstate (kernel state) is a mechanism to describe internal some part of the
> kernel state, save it into the memory and restore the state after kexec
> in the new kernel.
>
> The end goal here and the main use case for this is to be able to
> update host kernel under VMs with VFIO pass-through devices running
> on that host. Since we are pretty far from that end goal yet, this
> only establishes some basic infrastructure to describe and migrate complex
> in-kernel states.
>
> The idea behind KSTATE resembles QEMU's migration framework [1], which
> solves quite similar problem - migrate state of VM/emulated devices
> across different versions of QEMU.
>
> This is an altenative to Kexec Hand Over (KHO [3]).
>
> So, why not KHO?
>

KHO does more than just serializing/unserializing. It also has scratch
areas etc to allow safely performing early allocation without stepping
on the preserved memory. I see KSTATE as an alternative to libFDT as
ways of serializing the preserved memory. Not a replacement for KHO.

With that, it would be great to see a KSTATE build on top of the
current version of KHO. The V6 version of KHO uses a recursive FDT
object. I see recursive FDT can map to the C struct description
similar to the KSTATE field description nicely. However, that will
require KSTATE to make some considerable changes to embrace the KHO
v6. For example, the KSTATE uses one contiguous stream buffer and KHO
V6 uses many recursive physical address object pointers for different
objects.  Maybe a KSTATE V3?

>  - The main reason is KHO doesn't provide simple and convenient internal
>     API for the drivers/subsystems to preserve internal data.
>     E.g. lets consider we have some variable of type 'struct a'
>     that needs to be preserved:
>         struct a {
>                 int i;
>                 unsigned long *p_ulong;
>                 char s[10];
>                 struct page *page;
>         };
>
>      The KHO-way requires driver/subsystem to have a bunch of code
>      dealing with FDT stuff, something like
>
>      a_kho_write()
>      {
>              ...
>              fdt_property(fdt, "i", &a.i, sizeof(a.i));
>              fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
>              fdt_property(fdt, "s", &a.s, sizeof(a.s));
>              if (err)
>              ...
>      }

I can add more of the pain point of using FDT as data format for
load/restore states. It is not easy to determine how much memory the
FDT serialize is going to use up front. We want to do all the memory
allocation in the KHO PREPARE phase, so that after the KHO PREPARE
phase there is no KHO failure due to can't allocate memory.
The current KHO V6 does not handle the case where the recursive FDT
goes beyond 4K pages. There is a feature gap where the PCI subsystem
will likely save state for a list of PCI devices and the FDT can
possibly go more than 4K.

FDT also does not save the type of the object buffer, only the size.
There is an implicit contract of what this object points to. The
KSTATE description table can be extended to be more expressive than
FDT, e.g. cover optional min max allowed values.

>      a_kho_restore()
>      {
>              ...
>              a.i = fdt_getprop(fdt, offset, "i", &len);
>              if (!a.i || len != sizeof(a.i))
>                 goto err
>              *a.p_ulong = fdt_getprop....
>      }
>
>     Each driver/subsystem has to solve this problem in their own way.
>     Also if we use fdt properties for individual fields, that might be wastefull
>     in terms of used memory, as these properties use strings as keys.

Right, I need to write a lot of boilerplate code to do the per
property save/restore. I am not worried too much about memory usage. A
lot of string keys are not much longer than 8 bytes. The memory saving
convert to binary index is not huge. I actually would suggest adding
the string version of the field name to the description table, so that
we can dump the state in KSTATE just like the YAML FDT output for
debugging purposes. It is a very useful feature of FDT to dump the
current saving state into a human readable form. KSTATE can have the
same feature added.

>
>    While with KSTATE solves the same problem in more elegant way, with this:
>         struct kstate_description a_state = {
>                 .name = "a_struct",
>                 .version_id = 1,
>                 .id = KSTATE_TEST_ID,
>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>                 .fields = (const struct kstate_field[]) {
>                         KSTATE_BASE_TYPE(i, struct a, int),
>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>                         KSTATE_POINTER(p_ulong, struct a),
>                         KSTATE_PAGE(page, struct a),
>                         KSTATE_END_OF_LIST()
>                 },
>         };
>
>
>         {
>                 static unsigned long ulong
>                 static struct a a_data = { .p_ulong = &ulong };
>
>                 kstate_register(&test_state, &a_data);
>         }
>
>        The driver needs only to have a proper 'kstate_description' and call kstate_register()
>        to save/restore a_data.
>        Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>        And kstate_register() does all this save/restore stuff under the hood.

It seems the KSTATE uses one contiguous stream and the object has to
be loaded in the order it was saved. For the PCI code, the PCI device
scanning and probing might cause the device load out of the order of
saving. (The PCI probing is actually the reverse order of saving).
This kstate_register() might pose restrictions on the restore order.
PCI will need to look up and find the device state based on the PCI
device ID. Other subsystems will likely have the requirement to look
up their own saved state as well.
I also see KSTATE can be extended to support that.

>  - Another bonus point - kstate can preserve migratable memory, which is required
>     to preserve guest memory
>
>
> So now to the part how this works.
>
> State of kernel data (usually it's some struct) is described by the
> 'struct kstate_description' containing the array of individual
> fields descpriptions - 'struct kstate_field'. Each field
> has set of bits in ->flags which instructs how to save/restore
> a certain field of the struct. E.g.:
>   - KS_BASE_TYPE flag tells that field can be just copied by value,
>
>   - KS_POINTER means that the struct member is a pointer to the actual
>      data, so it needs to be dereference before saving/restoring data
>      to/from kstate data steam.
>
>   - KS_STRUCT - contains another struct,  field->ksd must point to
>       another 'struct kstate_dscription'

The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
type, right? Some of these flag combinations do not make sense. This
part might need more careful planning to keep it simple. Maybe some of
the flags bits should be enum.

>
>   - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>                ->restore() callbacks to save/restore data.
>
>   - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>                          field->count() callback
>   - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>      linear address. Store offset

I think we want to describe different stream types.
For example the most simple stream container is just a contiguous
buffer with start address and size.
The more complex one might be and size then an array of page pointers,
all those pointers add to up the new buffer which describe an saved
KSTATE that is larger than 4K and spread into an array of pages. Those
pages don't need to be contiguous. Such a page array buffer stores the
KSTATE entry described by a separate description table.

>
>   - KS_END - special flag indicating the end of migration stream data.
>
> kstate_register() call accepts kstate_description along with an instance
> of an object and registers it in the global 'states' list.
>
> During kexec reboot phase we go through the list of 'kstate_description's
> and each instance of kstate_description forms the 'struct kstate_entry'
> which save into the kstate's data stream.
>
> The 'kstate_entry' contains information like ID of kstate_description, version
> of it, size of migration data and the data itself. The ->data is formed in
> accordance to the kstate_field's of the corresponding kstate_description.

The version for the kstate_description might not be enough. The
version works if there is a linear history. Here we are likely to have
different vendors add their own extension to the device state saving.
I suggest instead we save the old kernel's kstate_description table
(once per description table as a recursive object)  alongside the
object physical address as well. The new kernel has their new version
of the description table. It can compare between the old and new
description tables and find out what fields need to be upgraded or
downgraded. The new kernel will use the old kstate_description table
to decode the previous kernel's saved object. I think that way it is
more flexible to support adding one or more features and not tight to
which version has what feature. It can also make sure the new kernel
can always dump the old KSTATE into YAML.

That way we might be able to simplify the subsection and the
depreciation flags. The new kernel doesn't need to carry the history
of changes made to the old description table.

> After the reboot, when the kstate_register() called it parses migration
> stream, finds the appropriate 'kstate_entry' and restores the contents of
> the object in accordance with kstate_description and ->fields.

Again this restoring can happen in a different order when the PCI
device scanning and probing order. The restoration might not happen in
one single call chain. Material for V3?

I am happy to work with you to get KSTATE working with the existing KHO effort.

Chris

Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Andrey Ryabinin 9 months, 1 week ago


On 4/29/25 1:01 AM, Chris Li wrote:
> Hi Andrey,
> 
> I am working on the PCI portion of the live update and looking at
> using KSTATE as an alternative to the FDT. Here are some high level
> feedbacks.
> 

Hi, thanks  a lot.

> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>>
>>  Main changes from v1 [1]:
>>   - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>>   - Lots of misc cleanups/refactorings.
>>
>> kstate (kernel state) is a mechanism to describe internal some part of the
>> kernel state, save it into the memory and restore the state after kexec
>> in the new kernel.
>>
>> The end goal here and the main use case for this is to be able to
>> update host kernel under VMs with VFIO pass-through devices running
>> on that host. Since we are pretty far from that end goal yet, this
>> only establishes some basic infrastructure to describe and migrate complex
>> in-kernel states.
>>
>> The idea behind KSTATE resembles QEMU's migration framework [1], which
>> solves quite similar problem - migrate state of VM/emulated devices
>> across different versions of QEMU.
>>
>> This is an altenative to Kexec Hand Over (KHO [3]).
>>
>> So, why not KHO?
>>
> 
> KHO does more than just serializing/unserializing. It also has scratch
> areas etc to allow safely performing early allocation without stepping
> on the preserved memory. I see KSTATE as an alternative to libFDT as
> ways of serializing the preserved memory. Not a replacement for KHO.
> 
> With that, it would be great to see a KSTATE build on top of the
> current version of KHO. The V6 version of KHO uses a recursive FDT
> object. I see recursive FDT can map to the C struct description
> similar to the KSTATE field description nicely. However, that will
> require KSTATE to make some considerable changes to embrace the KHO
> v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> V6 uses many recursive physical address object pointers for different
> objects.  Maybe a KSTATE V3?
> 

Yep, I'll take a look into combinig KSTATE with KHO.

....

> 
>>      a_kho_restore()
>>      {
>>              ...
>>              a.i = fdt_getprop(fdt, offset, "i", &len);
>>              if (!a.i || len != sizeof(a.i))
>>                 goto err
>>              *a.p_ulong = fdt_getprop....
>>      }
>>
>>     Each driver/subsystem has to solve this problem in their own way.
>>     Also if we use fdt properties for individual fields, that might be wastefull
>>     in terms of used memory, as these properties use strings as keys.
> 
> Right, I need to write a lot of boilerplate code to do the per
> property save/restore. I am not worried too much about memory usage. A
> lot of string keys are not much longer than 8 bytes. The memory saving
> convert to binary index is not huge. I actually would suggest adding
> the string version of the field name to the description table, so that
> we can dump the state in KSTATE just like the YAML FDT output for
> debugging purposes. It is a very useful feature of FDT to dump the
> current saving state into a human readable form. KSTATE can have the
> same feature added.
> 

kstate_field already have string with name of the field:

#define KSTATE_BASE_TYPE(_f, _state, _type) {		\
	.name = (__stringify(_f)),			\

Currently it's not used in code, but it's there for debug purposes 


>>    While with KSTATE solves the same problem in more elegant way, with this:
>>         struct kstate_description a_state = {
>>                 .name = "a_struct",
>>                 .version_id = 1,
>>                 .id = KSTATE_TEST_ID,
>>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>>                 .fields = (const struct kstate_field[]) {
>>                         KSTATE_BASE_TYPE(i, struct a, int),
>>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>>                         KSTATE_POINTER(p_ulong, struct a),
>>                         KSTATE_PAGE(page, struct a),
>>                         KSTATE_END_OF_LIST()
>>                 },
>>         };
>>
>>
>>         {
>>                 static unsigned long ulong
>>                 static struct a a_data = { .p_ulong = &ulong };
>>
>>                 kstate_register(&test_state, &a_data);
>>         }
>>
>>        The driver needs only to have a proper 'kstate_description' and call kstate_register()
>>        to save/restore a_data.
>>        Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>>        And kstate_register() does all this save/restore stuff under the hood.
> 
> It seems the KSTATE uses one contiguous stream and the object has to
> be loaded in the order it was saved. For the PCI code, the PCI device
> scanning and probing might cause the device load out of the order of
> saving. (The PCI probing is actually the reverse order of saving).
> This kstate_register() might pose restrictions on the restore order.
> PCI will need to look up and find the device state based on the PCI
> device ID. Other subsystems will likely have the requirement to look
> up their own saved state as well.
> I also see KSTATE can be extended to support that.
> 

Absolutely agreed. I think we need to decouple restore and register, ie remove
 restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
so the PCI code could do:
            kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));

And on probing stage (probably in pci_device_add()) call
        kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))

which would locate state for the device if any and restore it.


>>  - Another bonus point - kstate can preserve migratable memory, which is required
>>     to preserve guest memory
>>
>>
>> So now to the part how this works.
>>
>> State of kernel data (usually it's some struct) is described by the
>> 'struct kstate_description' containing the array of individual
>> fields descpriptions - 'struct kstate_field'. Each field
>> has set of bits in ->flags which instructs how to save/restore
>> a certain field of the struct. E.g.:
>>   - KS_BASE_TYPE flag tells that field can be just copied by value,
>>
>>   - KS_POINTER means that the struct member is a pointer to the actual
>>      data, so it needs to be dereference before saving/restoring data
>>      to/from kstate data steam.
>>
>>   - KS_STRUCT - contains another struct,  field->ksd must point to
>>       another 'struct kstate_dscription'
> 
> The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
> type, right? Some of these flag combinations do not make sense. This
> part might need more careful planning to keep it simple. Maybe some of
> the flags bits should be enum.
> 

Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field.
 Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL)


>>
>>   - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>>                ->restore() callbacks to save/restore data.
>>
>>   - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>>                          field->count() callback
>>   - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>>      linear address. Store offset
> 
> I think we want to describe different stream types.
> For example the most simple stream container is just a contiguous
> buffer with start address and size.
> The more complex one might be and size then an array of page pointers,
> all those pointers add to up the new buffer which describe an saved
> KSTATE that is larger than 4K and spread into an array of pages. Those
> pages don't need to be contiguous. Such a page array buffer stores the
> KSTATE entry described by a separate description table.
> 

Agreed, I had similar thoughts. But this complicates code, so I started with
something simple.

>>
>>   - KS_END - special flag indicating the end of migration stream data.
>>
>> kstate_register() call accepts kstate_description along with an instance
>> of an object and registers it in the global 'states' list.
>>
>> During kexec reboot phase we go through the list of 'kstate_description's
>> and each instance of kstate_description forms the 'struct kstate_entry'
>> which save into the kstate's data stream.
>>
>> The 'kstate_entry' contains information like ID of kstate_description, version
>> of it, size of migration data and the data itself. The ->data is formed in
>> accordance to the kstate_field's of the corresponding kstate_description.
> 
> The version for the kstate_description might not be enough. The
> version works if there is a linear history. Here we are likely to have
> different vendors add their own extension to the device state saving.

I think vendors can just declare separate kstate_description with different ID.
The only problem with this, is that kstate_description.id is integer, so it would be
a problem to allocate those without conflicts.
Perhaps change the id to string? So vendors can just add vendor prefix to ID.


> I suggest instead we save the old kernel's kstate_description table
> (once per description table as a recursive object)  alongside the
> object physical address as well. The new kernel has their new version
> of the description table. It can compare between the old and new
> description tables and find out what fields need to be upgraded or
> downgraded. The new kernel will use the old kstate_description table
> to decode the previous kernel's saved object.

Hmm.. I'm not sure, there is a lot to think about.
 This might make changes in kstate_description painful,
e.g. if I want to rearrange some ->flags for whatever reason.
So how to deal with changes in kstate_description itself?

How do we save links to methods in kstate_field (->restore/->save/->count),
and what if we'll need to change function prototypes of these methods ?


> I think that way it is> more flexible to support adding one or more features and not tight to
> which version has what feature. It can also make sure the new kernel
> can always dump the old KSTATE into YAML.
> 
> That way we might be able to simplify the subsection and the
> depreciation flags. The new kernel doesn't need to carry the history
> of changes made to the old description table.
> 
>> After the reboot, when the kstate_register() called it parses migration
>> stream, finds the appropriate 'kstate_entry' and restores the contents of
>> the object in accordance with kstate_description and ->fields.
> 
> Again this restoring can happen in a different order when the PCI
> device scanning and probing order. The restoration might not happen in
> one single call chain. Material for V3?

Agreed.

> I am happy to work with you to get KSTATE working with the existing KHO effort.
> 
Thanks for useful feedback, appreciated.

Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Chris Li 9 months ago

On Mon, May 5, 2025 at 7:37 AM Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> >
> > With that, it would be great to see a KSTATE build on top of the
> > current version of KHO. The V6 version of KHO uses a recursive FDT
> > object. I see recursive FDT can map to the C struct description
> > similar to the KSTATE field description nicely. However, that will
> > require KSTATE to make some considerable changes to embrace the KHO
> > v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> > V6 uses many recursive physical address object pointers for different
> > objects.  Maybe a KSTATE V3?
> >
>
> Yep, I'll take a look into combinig KSTATE with KHO.

Wonderful. KHO use kho_preserve_folio() to mark a folio to be preserved.
After kexec, it use kho_restore_folio/page() to restore a folio.
There is also kho_preserve_phys(), you should only use it for memory
that does not have a page struct, e.g. CMA.

You can take a look at the current version of luo and replace the libFDT
usage with KSTATE, that would be a good starting point.

> > debugging purposes. It is a very useful feature of FDT to dump the
> > current saving state into a human readable form. KSTATE can have the
> > same feature added.
> >
>
> kstate_field already have string with name of the field:
>
> #define KSTATE_BASE_TYPE(_f, _state, _type) {           \
>         .name = (__stringify(_f)),                      \
>
> Currently it's not used in code, but it's there for debug purposes

That is good to know.
We need to have something like fdt debugfs node for kstate.
>
>
> >>    While with KSTATE solves the same problem in more elegant way, with this:
> >>         struct kstate_description a_state = {
> >>                 .name = "a_struct",
> >>                 .version_id = 1,
> >>                 .id = KSTATE_TEST_ID,
> >>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
> >>                 .fields = (const struct kstate_field[]) {
> >>                         KSTATE_BASE_TYPE(i, struct a, int),
> >>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
> >>                         KSTATE_POINTER(p_ulong, struct a),
> >>                         KSTATE_PAGE(page, struct a),
> >>                         KSTATE_END_OF_LIST()
> >>                 },
> >>         };
> >>
> >>
> >>         {
> >>                 static unsigned long ulong
> >>                 static struct a a_data = { .p_ulong = &ulong };
> >>
> >>                 kstate_register(&test_state, &a_data);
> >>         }
> >>
> >>        The driver needs only to have a proper 'kstate_description' and call kstate_register()
> >>        to save/restore a_data.
> >>        Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
> >>        And kstate_register() does all this save/restore stuff under the hood.
> >
> > It seems the KSTATE uses one contiguous stream and the object has to
> > be loaded in the order it was saved. For the PCI code, the PCI device
> > scanning and probing might cause the device load out of the order of
> > saving. (The PCI probing is actually the reverse order of saving).
> > This kstate_register() might pose restrictions on the restore order.
> > PCI will need to look up and find the device state based on the PCI
> > device ID. Other subsystems will likely have the requirement to look
> > up their own saved state as well.
> > I also see KSTATE can be extended to support that.
> >
>
> Absolutely agreed. I think we need to decouple restore and register, ie remove
>  restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
> so the PCI code could do:
>             kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));
Need to have the domain number with PCI_DEVID as well.

I am expect to call some thing like kstate_save() in the LUO prepare call back.
LUO has a few stage. "Prepare" is where you save most of the stuff and
VM is still running.
"Reboot" is where VM is already paused. The last chance to save any
thing before kexec.

> And on probing stage (probably in pci_device_add()) call
>         kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))

It needs to happen before that, in pci_setup_device() it already need
to know if the device is keepalive or not. If device is keep alive,
the PCI core will need to re-create the device state from an already
running device rather than initialize the device fresh. It needs to
avoid perform PCI config space write to keepalive PCI devices.

>
> which would locate state for the device if any and restore it.
>
>
> >>   - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
> >>                ->restore() callbacks to save/restore data.
> >>
> >>   - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
> >>                          field->count() callback
> >>   - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
> >>      linear address. Store offset
> >
> > I think we want to describe different stream types.
> > For example the most simple stream container is just a contiguous
> > buffer with start address and size.
> > The more complex one might be and size then an array of page pointers,
> > all those pointers add to up the new buffer which describe an saved
> > KSTATE that is larger than 4K and spread into an array of pages. Those
> > pages don't need to be contiguous. Such a page array buffer stores the
> > KSTATE entry described by a separate description table.
>
> Agreed, I had similar thoughts. But this complicates code, so I started with
> something simple.
>
Yes we can start simple. Right now the KSTATE only follow pointer in
the Kernel C struct rather than the saved state objects. We will need
to support pointer following in the saved state object as well. That
is some thing very different from typical message serialization. In
the kernel it is easier to spread the state buffer in to recursive C
struct that store in different pages. The kstate stream will be more
or less like a tree of objects.

> >>
> >> The 'kstate_entry' contains information like ID of kstate_description, version
> >> of it, size of migration data and the data itself. The ->data is formed in
> >> accordance to the kstate_field's of the corresponding kstate_description.
> >
> > The version for the kstate_description might not be enough. The
> > version works if there is a linear history. Here we are likely to have
> > different vendors add their own extension to the device state saving.
>
> I think vendors can just declare separate kstate_description with different ID.
> The only problem with this, is that kstate_description.id is integer, so it would be
> a problem to allocate those without conflicts.
> Perhaps change the id to string? So vendors can just add vendor prefix to ID.

In my mind, the ID is a number, it is unique to a struct type. Once
the number is assign to member field, it will not able to reuse that
number for other field in that struct, it will stay with that field
member for live. If the field get deleted, that number will still not
be able to re-use by other member in that struct. Each field will also
have a string name for debug purpose.

Different struct can have same ID number, but it will have different
meaning. Just like (struct a*)->foo have different meaning than
(struct b*)->foo, it is on different name space.

If the vendor want to make sure never conflict with upstream, they
should allocate the ID in a vendor specific struct. That way it will
make sure the vendor get their own struct name space. The struct has a
texted name so it will not conflict with vendors. The field ID will
remain as number.

> > I suggest instead we save the old kernel's kstate_description table
> > (once per description table as a recursive object)  alongside the
> > object physical address as well. The new kernel has their new version
> > of the description table. It can compare between the old and new
> > description tables and find out what fields need to be upgraded or
> > downgraded. The new kernel will use the old kstate_description table
> > to decode the previous kernel's saved object.
>
> Hmm.. I'm not sure, there is a lot to think about.
>  This might make changes in kstate_description painful,

Let me clarify. I don't mean to save the V2.1 kstate_description as it
is. The current kstate_description have type system and a run time
portion as well. The run time portion does not save. Only the type
system and possible value description. (min, max, enum etc).
> e.g. if I want to rearrange some ->flags for whatever reason.
If possible, would be best not save those flags. However, if the flag
is used to describe how the wire format object is layouted, those will
have to be save and part of the ABI as well. There is no way to get
around that even without saving the descriptor table.

> So how to deal with changes in kstate_description itself?
The member in the kstate_description itself will have a description
table to describe it as well. The will be a minimal set of
kstate_description feature to describe other capabilities.
The kstate_descrition can have a capability array (inspired by the PCI
capabilities) declare what it supports.

For example, the basic version only support two buffer container type.
1) pointer with size. 2) <array counter n> + linear array layout in
memory of n elements.
The new kernel add a container type 3) single link list ptr pointer
array. It points to page contain {<next list page pfn> + <array
counter n> + 510<page pfn array>}. The ptr list array can describe a
kvmalloc buffer without recursive allocate page pointer array . The
new kernel can detect the old kernel does not have this capability.
When roll back to old kernel.  It do not use ptr link list array to
write out saved state object.

The description table will have a version number as well as last
defense If we have to introduce change that break the description
table compatibility. We can bump that version. That is the only place
need to have a version number. Other capability should always describe
using capability feature set.

>
> How do we save links to methods in kstate_field (->restore/->save/->count),
> and what if we'll need to change function prototypes of these methods ?

Those I consider as run time behavior. I hope we don't need to save
the function pointers.

Chris

> > I think that way it is> more flexible to support adding one or more features and not tight to
> > which version has what feature. It can also make sure the new kernel
> > can always dump the old KSTATE into YAML.
> >
> > That way we might be able to simplify the subsection and the
> > depreciation flags. The new kernel doesn't need to carry the history
> > of changes made to the old description table.
> >
> >> After the reboot, when the kstate_register() called it parses migration
> >> stream, finds the appropriate 'kstate_entry' and restores the contents of
> >> the object in accordance with kstate_description and ->fields.
> >
> > Again this restoring can happen in a different order when the PCI
> > device scanning and probing order. The restoration might not happen in
> > one single call chain. Material for V3?
>
> Agreed.
>
> > I am happy to work with you to get KSTATE working with the existing KHO effort.
> >
> Thanks for useful feedback, appreciated.

Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Cong Wang 11 months ago

Hi Andrey,

On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>     Each driver/subsystem has to solve this problem in their own way.
>     Also if we use fdt properties for individual fields, that might be wastefull
>     in terms of used memory, as these properties use strings as keys.
>
>    While with KSTATE solves the same problem in more elegant way, with this:
>         struct kstate_description a_state = {
>                 .name = "a_struct",
>                 .version_id = 1,
>                 .id = KSTATE_TEST_ID,
>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>                 .fields = (const struct kstate_field[]) {
>                         KSTATE_BASE_TYPE(i, struct a, int),
>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>                         KSTATE_POINTER(p_ulong, struct a),
>                         KSTATE_PAGE(page, struct a),
>                         KSTATE_END_OF_LIST()
>                 },
>         };

Hmm, this still requires manual efforts to implement this, so potentially
a lot of work given how many drivers we have in-tree.

And those KSTATE_* stuffs look a lot similar to BTF:
https://docs.kernel.org/bpf/btf.html

So, any possibility to reuse BTF here? Note, BTF is automatically
generated by pahole, no manual effort is required.

Regards,
Cong Wang

Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Andrey Ryabinin 11 months ago

On Tue, Mar 11, 2025 at 3:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi Andrey,
>
> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
> >     Each driver/subsystem has to solve this problem in their own way.
> >     Also if we use fdt properties for individual fields, that might be wastefull
> >     in terms of used memory, as these properties use strings as keys.
> >
> >    While with KSTATE solves the same problem in more elegant way, with this:
> >         struct kstate_description a_state = {
> >                 .name = "a_struct",
> >                 .version_id = 1,
> >                 .id = KSTATE_TEST_ID,
> >                 .state_list = LIST_HEAD_INIT(test_state.state_list),
> >                 .fields = (const struct kstate_field[]) {
> >                         KSTATE_BASE_TYPE(i, struct a, int),
> >                         KSTATE_BASE_TYPE(s, struct a, char [10]),
> >                         KSTATE_POINTER(p_ulong, struct a),
> >                         KSTATE_PAGE(page, struct a),
> >                         KSTATE_END_OF_LIST()
> >                 },
> >         };
>
> Hmm, this still requires manual efforts to implement this, so potentially
> a lot of work given how many drivers we have in-tree.
>

We are not going to have every possible driver to be able to persist its state.
I think the main target is VFIO driver which also implies PCI/IOMMU.

Besides, we'll need to persist only some fields of the struct, not the
entire thing.
There is no way to automate such decisions, so there will be some
manual effort anyway.

> And those KSTATE_* stuffs look a lot similar to BTF:
> https://docs.kernel.org/bpf/btf.html
>
> So, any possibility to reuse BTF here?

Perhaps, but I don't see it right away. I'll think about it.

> Note, BTF is automatically generated by pahole, no manual effort is required.

Nothing will save us from manual efforts of what parts of data we want to save,
so there has to be some way to mark that data.
Also same C types may represent different kind of data, e.g.
we may have an address to some persistent data (in linear mapping)
stored as an 'unsigned long address'.
Because of KASLR we can't copy 'address' by value, we'll need to save
it as an offset from PAGE_OFFSET
and add PAGE_OFFSET of the new kernel on restore.

Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Posted by Chris Li 9 months, 2 weeks ago

On Tue, Mar 11, 2025 at 5:19 AM Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> > Hmm, this still requires manual efforts to implement this, so potentially
> > a lot of work given how many drivers we have in-tree.
> >
>
> We are not going to have every possible driver to be able to persist its state.
> I think the main target is VFIO driver which also implies PCI/IOMMU.
>
> Besides, we'll need to persist only some fields of the struct, not the
> entire thing.
> There is no way to automate such decisions, so there will be some
> manual effort anyway.
>
>
> > And those KSTATE_* stuffs look a lot similar to BTF:
> > https://docs.kernel.org/bpf/btf.html
> >
> > So, any possibility to reuse BTF here?
>
> Perhaps, but I don't see it right away. I'll think about it.

There is some possibility to use tools to lighten the repeat portion
of the load.
For example, the use sparse checker to example the struct field.

>
> > Note, BTF is automatically generated by pahole, no manual effort is required.
>
> Nothing will save us from manual efforts of what parts of data we want to save,
> so there has to be some way to mark that data.
> Also same C types may represent different kind of data, e.g.
> we may have an address to some persistent data (in linear mapping)
> stored as an 'unsigned long address'.
> Because of KASLR we can't copy 'address' by value, we'll need to save
> it as an offset from PAGE_OFFSET
> and add PAGE_OFFSET of the new kernel on restore.

Agree, there will be cases requiring manual intervention. It is
unlikely to fully automate this process.

Chris


Chris