From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F0071D096B; Wed, 2 Oct 2024 16:08:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885340; cv=none; b=TQYkwBpK6dK5x85kKpkCff/Jx+AM5STh84alaUTLIoZcoA4rf40JXJ7cp1dApxcLMmvH57BxRhWRx7qY1NinrApU+FZ3YfcuIgmOD7Pnwa+bpmHIHxcqdjEszyke0uH4vLbJofIOkw6abUzhoNa9Lzd8P4G5aDstC/k1zgQ4CTs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885340; c=relaxed/simple; bh=Ph1JSH3so2QDpuJP4xo6S6knDf1xV5EBulULkB1cgbg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HbvX3eOcNXUb1RaybNHqGimMBW4IuK/Wq/6yGO3m9zKfuHhOmJJ6Pw8xPfKMX3xTIvPU9G+P5qCyWiph0ttIHoV09oazmJueYTIgXE8DKxYpaQ+wYCbyfwEoLn6AdHD/51AEq5EJGY+D0dJLBcrw9jvFqFVBvtmeyYXXgEqZth4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=Wl/qGGKH; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="Wl/qGGKH" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id 2065060A59; Wed, 2 Oct 2024 19:08:53 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-Eli2J3P3; Wed, 02 Oct 2024 19:08:51 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885332; bh=4BDwgTNznoqh19FbW1xbxFVkO1uBqVi3RNsbBRD27Tk=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=Wl/qGGKHTSFA9wWN1bbDi0YbwEitbGlvPpRRrKsCp6ccao0ouB11BBtvPJ8m+RQU9 S1A+woHBqqO/jmqdf0EgywyOtxMlp2iXTZwnlRl9UdmwlHpLYKiFEQdpmQ5pmgN62h H2VfuCOascBrttGcumZxN4A5bl1+FwHOgUl+Ts/8= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 1/7] kstate: Add kstate - a mechanism to migrate some kernel state across kexec Date: Wed, 2 Oct 2024 18:07:16 +0200 Message-ID: <20241002160722.20025-2-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" kstate (kernel state) is a mechanism to describe internal kernel state (partially), save it into the memory and restore the state after kexec in new kernel. The end goal here and the main use case for this is to be able to update host kernel under VMs with VFIO pass-through devices running on that host. We are pretty far from that end goal yet. This and following patches only try to establish some basic infrastructure to describe and migrate complex in-kernel states. And as a demonstration, the state of trace buffer migrated across kexec to new kernel (in the follow up patches). States (usually this is some struct) are described by the 'struct kstate_description' containing the array of individual fields descpriptions - 'struct kstate_field'. Fields have different types like: KS_SIMPLE - trivial type that just copied by value KS_POINTER - field contains pointer, it will be dereferenced to copy the value during save/restore phases. KS_STRUCT - contains another struct, field->ksd must point to another 'struct kstate_dscription' KS_CUSTOM - something that requires fit trivial types as above, for this fields the callbacks field->save()/->restore() must do all job KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by= the field->count() callback KS_END - special flag indicating the end of migration stream data. kstate_register() call accepts kstate_description along with an instance of an object and registers it in the global 'states' list. During kexec reboot phase this list iterated, and for each instance in the list 'struct kstate_entry' formed and saved in the migration stream. 'kstate_entry' contains information like ID of kstate_description, version of it, size of migration data and the data itself. After the reboot, when the kstate_register() called it parses migration stream, finds the appropriate 'kstate_entry' and restores the contents of t= he object. This is an early RFC, so the code is somewhat hacky and some parts of this feature isn't well thought trough yet (like dealing with struct changes between old and new kernel, fixed size of migrate stream memory, and many more). Signed-off-by: Andrey Ryabinin --- include/linux/kstate.h | 118 ++++++++++++++++++++++++ kernel/Kconfig.kexec | 12 +++ kernel/Makefile | 1 + kernel/kstate.c | 198 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 329 insertions(+) create mode 100644 include/linux/kstate.h create mode 100644 kernel/kstate.c diff --git a/include/linux/kstate.h b/include/linux/kstate.h new file mode 100644 index 0000000000000..c97804d0243ea --- /dev/null +++ b/include/linux/kstate.h @@ -0,0 +1,118 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _KSTATE_H +#define _KSTATE_H + +#include +#include +#include + +struct kstate_description; +enum kstate_flags { + KS_SIMPLE =3D (1 << 0), + KS_POINTER =3D (1 << 1), + KS_STRUCT =3D (1 << 2), + KS_CUSTOM =3D (1 << 3), + KS_ARRAY_OF_POINTER =3D (1 << 4), + KS_END =3D (1UL << 31), +}; + +struct kstate_field { + const char *name; + size_t offset; + size_t size; + enum kstate_flags flags; + const struct kstate_description *ksd; + int version_id; + int (*restore)(void *mig_stream, void *obj, const struct kstate_field *fi= eld); + int (*save)(void *mig_stream, void *obj, const struct kstate_field *field= ); + int (*count)(void); +}; + +enum kstate_ids { + KSTATE_LAST_ID =3D -1, +}; + +struct kstate_description { + const char *name; + enum kstate_ids id; + atomic_t instance_id; + int version_id; + struct list_head state_list; + + const struct kstate_field *fields; +}; + +struct state_entry { + u64 id; + struct list_head list; + struct kstate_description *kstd; + void *obj; +}; + +static inline bool kstate_get_byte(void **mig_stream) +{ + bool ret =3D **(u8 **)mig_stream; + (*mig_stream)++; + return ret; +} +static inline void *kstate_save_byte(void *mig_stream, u8 val) +{ + *(u8 *)mig_stream =3D val; + return mig_stream + sizeof(val); +} + +static inline void *kstate_save_ulong(void *mig_stream, unsigned long val) +{ + *(unsigned long *)mig_stream =3D val; + return mig_stream + sizeof(val); +} +static inline unsigned long kstate_get_ulong(void **mig_stream) +{ + unsigned long ret =3D **(unsigned long **)mig_stream; + (*mig_stream) +=3D sizeof(unsigned long); + return ret; +} + +#ifdef CONFIG_KSTATE +bool is_migrate_kernel(void); + +void save_migrate_state(unsigned long mig_stream); + +void __kstate_register(struct kstate_description *state, + void *obj, struct state_entry *se); +int kstate_register(struct kstate_description *state, void *obj); + +struct kstate_entry; +void *save_kstate(void *stream, int id, const struct kstate_description *k= state, + void *obj); +void *restore_kstate(struct kstate_entry *ke, int id, + const struct kstate_description *kstate, void *obj); +#else + +#define __kstate_register(state, obj, se) +#define kstate_register(state, obj) + +static inline void save_migrate_state(unsigned long mig_stream) { } + +#endif + + +#define KSTATE_SIMPLE(_f, _state) { \ + .name =3D (__stringify(_f)), \ + .size =3D sizeof_field(_state, _f), \ + .flags =3D KS_SIMPLE, \ + .offset =3D offsetof(_state, _f), \ + } + +#define KSTATE_POINTER(_f, _state) { \ + .name =3D (__stringify(_f)), \ + .size =3D sizeof(*(((_state *)0)->_f)), \ + .flags =3D KS_POINTER, \ + .offset =3D offsetof(_state, _f), \ + } + +#define KSTATE_END_OF_LIST() { \ + .flags =3D KS_END,\ + } + +#endif diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 6c34e63c88ff4..d8fecf29e384a 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -151,4 +151,16 @@ config CRASH_MAX_MEMORY_RANGES the computation behind the value provided through the /sys/kernel/crash_elfcorehdr_size attribute. =20 +config KSTATE + bool "Migrate certain internal kernel state across kexec" + default n + depends on CRASH_DUMP + help + Enable functionality to migrate some internal kernel states to new + kernel across kexec. Currently capable only migrating trace buffers + as an example. Can be extended to other states like IOMMU page tables, + VFIO state of the device... + Description of the trace buffer saved into memory preserved across kexe= c. + The new kernel reads description to restore the state of trace buffers. + endmenu diff --git a/kernel/Makefile b/kernel/Makefile index 87866b037fbed..6bdf947fc84f5 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -75,6 +75,7 @@ obj-$(CONFIG_CRASH_DUMP) +=3D crash_core.o obj-$(CONFIG_KEXEC) +=3D kexec.o obj-$(CONFIG_KEXEC_FILE) +=3D kexec_file.o obj-$(CONFIG_KEXEC_ELF) +=3D kexec_elf.o +obj-$(CONFIG_KSTATE) +=3D kstate.o obj-$(CONFIG_BACKTRACE_SELF_TEST) +=3D backtracetest.o obj-$(CONFIG_COMPAT) +=3D compat.o obj-$(CONFIG_CGROUPS) +=3D cgroup/ diff --git a/kernel/kstate.c b/kernel/kstate.c new file mode 100644 index 0000000000000..0ef228baef94e --- /dev/null +++ b/kernel/kstate.c @@ -0,0 +1,198 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include + +static LIST_HEAD(states); + +struct kstate_entry { + int state_id; + int version_id; + int instance_id; + int size; + DECLARE_FLEX_ARRAY(u8, data); +}; + +void *save_kstate(void *stream, int id, const struct kstate_description *k= state, + void *obj) +{ + const struct kstate_field *field =3D kstate->fields; + struct kstate_entry *ke =3D stream; + + stream =3D ke->data; + + ke->state_id =3D kstate->id; + ke->version_id =3D kstate->version_id; + ke->instance_id =3D id; + + while (field->flags !=3D KS_END) { + void *first, *cur; + int n_elems =3D 1; + int size, i; + + first =3D obj + field->offset; + + if (field->flags & KS_POINTER) + first =3D *(void **)(obj + field->offset); + if (field->count) + n_elems =3D field->count(); + size =3D field->size; + for (i =3D 0; i < n_elems; i++) { + cur =3D first + i * size; + + if (field->flags & KS_ARRAY_OF_POINTER) + cur =3D *(void **)cur; + + if (field->flags & KS_STRUCT) + stream =3D save_kstate(stream, 0, field->ksd, cur); + else if (field->flags & KS_CUSTOM) { + if (field->save) + stream +=3D field->save(stream, cur, field); + } else if (field->flags & (KS_SIMPLE|KS_POINTER)) { + memcpy(stream, cur, size); + stream +=3D size; + } else + WARN_ON_ONCE(1); + + } + field++; + + } + + ke->size =3D (u8 *)stream - ke->data; + return stream; +} + +void save_migrate_state(unsigned long mig_stream) +{ + struct state_entry *se; + struct kstate_entry *ke; + void *dest; + struct page *page; + + page =3D boot_pfn_to_page(mig_stream >> PAGE_SHIFT); + arch_kexec_post_alloc_pages(page_address(page), 512, 0); + dest =3D page_address(page); + list_for_each_entry(se, &states, list) + dest =3D save_kstate(dest, se->id, se->kstd, se->obj); + ke =3D dest; + ke->state_id =3D KSTATE_LAST_ID; +} + +void *restore_kstate(struct kstate_entry *ke, int id, + const struct kstate_description *kstate, void *obj) +{ + const struct kstate_field *field =3D kstate->fields; + u8 *stream =3D ke->data; + + WARN_ONCE(ke->version_id !=3D kstate->version_id, "version mismatch %d %d= \n", + ke->version_id, kstate->version_id); + + WARN_ONCE(ke->instance_id !=3D id, "instance id mismatch %d %d\n", + ke->instance_id, id); + + while (field->flags !=3D KS_END) { + void *first, *cur; + int n_elems =3D 1; + int size, i; + + first =3D obj + field->offset; + if (field->flags & KS_POINTER) + first =3D *(void **)(obj + field->offset); + if (field->count) + n_elems =3D field->count(); + size =3D field->size; + for (i =3D 0; i < n_elems; i++) { + cur =3D first + i * size; + + if (field->flags & KS_ARRAY_OF_POINTER) + cur =3D *(void **)cur; + + if (field->flags & KS_STRUCT) + stream =3D restore_kstate((struct kstate_entry *)stream, + 0, field->ksd, cur); + else if (field->flags & KS_CUSTOM) { + if (field->restore) + stream +=3D field->restore(stream, cur, field); + } else if (field->flags & (KS_SIMPLE|KS_POINTER)) { + memcpy(cur, stream, size); + stream +=3D size; + } else + WARN_ON_ONCE(1); + + } + field++; + } + + return stream; +} + +static void restore_migrate_state(unsigned long mig_stream, + struct state_entry *se) +{ + char *dest; + struct kstate_entry *ke; + + if (mig_stream =3D=3D -1) + return; + + dest =3D phys_to_virt(mig_stream); + ke =3D (struct kstate_entry *)dest; + while (ke->state_id !=3D KSTATE_LAST_ID) { + if (ke->state_id !=3D se->kstd->id || + ke->instance_id !=3D se->id) { + ke =3D (struct kstate_entry *)(ke->data + ke->size); + continue; + } + + restore_kstate(ke, se->id, se->kstd, se->obj); + ke =3D (struct kstate_entry *)(ke->data + ke->size); + } +} + +unsigned long long migrate_stream_addr =3D -1; +EXPORT_SYMBOL_GPL(migrate_stream_addr); +unsigned long long migrate_stream_size; + +bool is_migrate_kernel(void) +{ + return migrate_stream_addr !=3D -1; +} + +void __kstate_register(struct kstate_description *state, void *obj, struct= state_entry *se) +{ + se->kstd =3D state; + se->id =3D atomic_inc_return(&state->instance_id); + se->obj =3D obj; + list_add(&se->list, &states); + restore_migrate_state(migrate_stream_addr, se); +} + +int kstate_register(struct kstate_description *state, void *obj) +{ + struct state_entry *se; + + se =3D kmalloc(sizeof(*se), GFP_KERNEL); + if (!se) + return -ENOMEM; + + __kstate_register(state, obj, se); + return 0; +} + +static int __init setup_migrate(char *arg) +{ + char *end; + + if (!arg) + return -EINVAL; + migrate_stream_addr =3D memparse(arg, &end); + if (*end =3D=3D '@') { + migrate_stream_size =3D migrate_stream_addr; + migrate_stream_addr =3D memparse(end + 1, &end); + } + return end > arg ? 0 : -EINVAL; +} +early_param("migrate_stream", setup_migrate); --=20 2.45.2 From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE8B01D0E19; Wed, 2 Oct 2024 16:08:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885341; cv=none; b=RFUpVY1pM7i6B87ZxTW3BUj+TZAoEHe978iE+ngY5BQykNmlJFkaSSNDmoZBeWDJvtJHEnww+olJFjLZxMcmXwwTzTHy/BW3dVLDC6s4FY0zph2VXlTGBEu//aO8dD+3M/ggpgObYCjf6SUZnwN94g7v4jMi2O6an0+zWQQbv1E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885341; c=relaxed/simple; bh=jdoey+uksmyfcJ5iDcu5tqp2xlukBo0sMMs91TZYJco=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qnkQyoXOp1Z2ij+ZDRcu1wodLclEeb0BDTVUciP0O6TKP+gHSK/Tfj1RlI3ek/RwEJCdsvOIaM75v4kS8Tlhws3WYmetMiSQIenwJmmxRptu4bGL7wG/AkBYiXmjte20uhody7/1Fmnf5DwUovynoisHbX4syRHYEZu0jpU0jD4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=rISzUCre; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="rISzUCre" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id C8E5760A6D; Wed, 2 Oct 2024 19:08:55 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-NNW0FOgv; Wed, 02 Oct 2024 19:08:54 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885334; bh=FEwZkDU2VcUSP99EqMYuNuirJIt2aNspqQO+oi3JBZ0=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=rISzUCreiYPr5KO7sSHN36OWn/OE/1SW6J4qWdFjSKXFl/OmP3/AsvrO93oc/rfO3 E5TEW/kK1jsj+bzA3aigLi1L2TRjEmagB2KlDbxL85Ej/R71PObcT7q7wfbME/qBX8 269sah/PVLakog1l83ZOSHYhyBL6jgF+dNfxMlRE= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 2/7] kexec: Hack and abuse crashkernel for the kstate's migration stream Date: Wed, 2 Oct 2024 18:07:17 +0200 Message-ID: <20241002160722.20025-3-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" This is an early ugly hack just for now. Will be completely redone later. This abuses crashkernel segment of memory for the kstate purposes to save and restore object descriptions. The proper solution probably would be using segments in ordinary kexec mechanism, however since kstate requires such segments very late (at reboot stage, not the load stage) some thought and work will be required to make that happen. The KEXEC_FILE_MIGRATE/KEXEC_TYPE_MIGRATE fla= gs also likely won't be required. Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/kexec-bzimage64.c | 36 ++++++++++++++++++++++++++++++ arch/x86/kernel/machine_kexec_64.c | 5 ++++- include/linux/kexec.h | 6 +++-- include/uapi/linux/kexec.h | 2 ++ kernel/crash_core.c | 3 ++- kernel/kexec_core.c | 10 ++++++++- kernel/kexec_file.c | 15 +++++++++++-- 7 files changed, 70 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzim= age64.c index 68530fad05f74..71c82841e6b12 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -18,6 +18,7 @@ #include #include #include +#include =20 #include #include @@ -77,6 +78,11 @@ static int setup_cmdline(struct kimage *image, struct bo= ot_params *params, len =3D sprintf(cmdline_ptr, "elfcorehdr=3D0x%lx ", image->elf_load_addr); } + if (image->type =3D=3D KEXEC_TYPE_MIGRATE) { + len =3D sprintf(cmdline_ptr, + "migrate_stream=3D0x0%llx ", crashk_res.start); + } + memcpy(cmdline_ptr + len, cmdline, cmdline_len); cmdline_len +=3D len; =20 @@ -389,6 +395,29 @@ static int bzImage64_probe(const char *buf, unsigned l= ong len) return ret; } =20 +static int load_migrate_segments(struct kimage *image) +{ + int ret; + struct kexec_buf kbuf =3D { .image =3D image, .buf_min =3D 0, + .buf_max =3D ULONG_MAX, .top_down =3D false }; + + kbuf.bufsz =3D 4096; + kbuf.buffer =3D vzalloc(kbuf.bufsz); + + kbuf.memsz =3D 8*1024*1024; + + kbuf.buf_align =3D ELF_CORE_HEADER_ALIGN; + kbuf.mem =3D KEXEC_BUF_MEM_UNKNOWN; + ret =3D kexec_add_buffer(&kbuf); + if (ret) + return ret; + image->mig_stream =3D kbuf.mem; + kexec_dprintk("kstate: Loaded mig_stream at 0x%lx bufsz=3D0x%lx memsz=3D0= x%lx\n", + image->mig_stream, kbuf.bufsz, kbuf.memsz); + + return ret; +} + static void *bzImage64_load(struct kimage *image, char *kernel, unsigned long kernel_len, char *initrd, unsigned long initrd_len, char *cmdline, @@ -444,6 +473,13 @@ static void *bzImage64_load(struct kimage *image, char= *kernel, } #endif =20 + if (image->type =3D=3D KEXEC_TYPE_MIGRATE) { + ret =3D load_migrate_segments(image); + if (ret) + return ERR_PTR(ret); + + } + /* * Load purgatory. For 64bit entry point, purgatory code can be * anywhere. diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_k= exec_64.c index 9c9ac606893e9..edf6234b75baf 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -572,7 +572,10 @@ static void kexec_mark_crashkres(bool protect) kexec_mark_range(crashk_low_res.start, crashk_low_res.end, protect); =20 /* Don't touch the control code page used in crash_kexec().*/ - control =3D PFN_PHYS(page_to_pfn(kexec_crash_image->control_code_page)); + if (kexec_image && kexec_image->type & KEXEC_TYPE_MIGRATE) + control =3D PFN_PHYS(page_to_pfn(kexec_image->control_code_page)); + else if (kexec_crash_image) + control =3D PFN_PHYS(page_to_pfn(kexec_crash_image->control_code_page)); /* Control code page is located in the 2nd page. */ kexec_mark_range(crashk_res.start, control + PAGE_SIZE - 1, protect); control +=3D KEXEC_CONTROL_PAGE_SIZE; diff --git a/include/linux/kexec.h b/include/linux/kexec.h index f0e9f8eda7a3c..182ef76f21860 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -299,6 +299,7 @@ struct kimage { unsigned long start; struct page *control_code_page; struct page *swap_page; + unsigned long mig_stream; void *vmcoreinfo_data_copy; /* locates in the crash memory */ =20 unsigned long nr_segments; @@ -312,9 +313,10 @@ struct kimage { unsigned long control_page; =20 /* Flags to indicate special processing */ - unsigned int type : 1; + unsigned int type : 2; #define KEXEC_TYPE_DEFAULT 0 #define KEXEC_TYPE_CRASH 1 +#define KEXEC_TYPE_MIGRATE 2 unsigned int preserve_context : 1; /* If set, we are using file mode kexec syscall */ unsigned int file_mode:1; @@ -401,7 +403,7 @@ bool kexec_load_permitted(int kexec_image_type); =20 /* List of defined/legal kexec file flags */ #define KEXEC_FILE_FLAGS (KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH | \ - KEXEC_FILE_NO_INITRAMFS | KEXEC_FILE_DEBUG) + KEXEC_FILE_NO_INITRAMFS | KEXEC_FILE_DEBUG | KEXEC_FILE_MIGRATE) =20 /* flag to track if kexec reboot is in progress */ extern bool kexec_in_progress; diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h index 5ae1741ea8ea0..454dc7c8a7d86 100644 --- a/include/uapi/linux/kexec.h +++ b/include/uapi/linux/kexec.h @@ -27,6 +27,8 @@ #define KEXEC_FILE_ON_CRASH 0x00000002 #define KEXEC_FILE_NO_INITRAMFS 0x00000004 #define KEXEC_FILE_DEBUG 0x00000008 +#define KEXEC_FILE_MIGRATE 0X00000010 + =20 /* These values match the ELF architecture values. * Unless there is a good reason that should continue to be the case. diff --git a/kernel/crash_core.c b/kernel/crash_core.c index c1048893f4b68..87b9a52d60352 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -42,7 +42,8 @@ int kimage_crash_copy_vmcoreinfo(struct kimage *image) =20 if (!IS_ENABLED(CONFIG_CRASH_DUMP)) return 0; - if (image->type !=3D KEXEC_TYPE_CRASH) + if (image->type !=3D KEXEC_TYPE_CRASH && + image->type !=3D KEXEC_TYPE_MIGRATE) return 0; =20 /* diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index c0caa14880c3b..ca6283d21235e 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -196,7 +197,8 @@ int sanity_check_segment_list(struct kimage *image) * kernel could corrupt things. */ =20 - if (image->type =3D=3D KEXEC_TYPE_CRASH) { + if (image->type =3D=3D KEXEC_TYPE_CRASH || + image->type =3D=3D KEXEC_TYPE_MIGRATE) { for (i =3D 0; i < nr_segments; i++) { unsigned long mstart, mend; =20 @@ -461,6 +463,7 @@ struct page *kimage_alloc_control_pages(struct kimage *= image, break; #ifdef CONFIG_CRASH_DUMP case KEXEC_TYPE_CRASH: + case KEXEC_TYPE_MIGRATE: pages =3D kimage_alloc_crash_control_pages(image, order); break; #endif @@ -859,6 +862,7 @@ int kimage_load_segment(struct kimage *image, break; #ifdef CONFIG_CRASH_DUMP case KEXEC_TYPE_CRASH: + case KEXEC_TYPE_MIGRATE: result =3D kimage_load_crash_segment(image, segment); break; #endif @@ -1044,9 +1048,13 @@ int kernel_kexec(void) */ cpu_hotplug_enable(); pr_notice("Starting new kernel\n"); + arch_kexec_unprotect_crashkres(); machine_shutdown(); } =20 + if (kexec_image->type & KEXEC_TYPE_MIGRATE) + save_migrate_state(kexec_image->mig_stream); + kmsg_dump(KMSG_DUMP_SHUTDOWN); machine_kexec(kexec_image); =20 diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 3eedb8c226ad8..4a576db4141cd 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -293,6 +293,11 @@ kimage_file_alloc_init(struct kimage **rimage, int ker= nel_fd, } #endif =20 + if (flags & KEXEC_FILE_MIGRATE) { + image->control_page =3D crashk_res.start; + image->type =3D KEXEC_TYPE_MIGRATE; + } + ret =3D kimage_file_prepare_segments(image, kernel_fd, initrd_fd, cmdline_ptr, cmdline_len, flags); if (ret) @@ -360,6 +365,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, = initrd_fd, #endif dest_image =3D &kexec_image; =20 + if (image_type =3D=3D KEXEC_TYPE_MIGRATE) + if (*dest_image) + arch_kexec_unprotect_crashkres(); + if (flags & KEXEC_FILE_UNLOAD) goto exchange; =20 @@ -428,7 +437,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, i= nitrd_fd, image =3D xchg(dest_image, image); out: #ifdef CONFIG_CRASH_DUMP - if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image) + if (((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image) || + ((flags & KEXEC_FILE_MIGRATE) && kexec_image)) arch_kexec_protect_crashkres(); #endif =20 @@ -608,7 +618,8 @@ static int kexec_walk_resources(struct kexec_buf *kbuf, int (*func)(struct resource *, void *)) { #ifdef CONFIG_CRASH_DUMP - if (kbuf->image->type =3D=3D KEXEC_TYPE_CRASH) + if (kbuf->image->type =3D=3D KEXEC_TYPE_CRASH || + kbuf->image->type =3D=3D KEXEC_TYPE_MIGRATE) return walk_iomem_res_desc(crashk_res.desc, IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, crashk_res.start, crashk_res.end, --=20 2.45.2 From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1774F1D0E2C; Wed, 2 Oct 2024 16:08:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885342; cv=none; b=sYbnhXmK6C0y1WMtTkFyDv9Yig53pl2dHB7D8hFEhwerTu9HpSWT0EaGqPYTbJPOCD+oz4VNg0xPGh+IO57D0d7w2El/J7vqUNV3CDi2Ekal+l1aV5rkdYdU50xXJ8kedoE7/SEnXGaGdgioKxn60WMOPgutViAV/BUOddSF4jg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885342; c=relaxed/simple; bh=RzR2IgYML2BjBfFi3e8WjlUv5LhzDc1Kx6kdJlVAxG4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fNXHbyZx3EDfRibHOYZYIvMIyS5CC1tibIXU/06ELrgWMPXDHpxdSs8ssq0jvJq78JxZlRj4xariTSUjkCGqqzzELOsNG7n1ULyJ/JA1qFGTVtFyLRuWGjx0rj4ivAzUWe9OdiXaSPGEOYn9cYqUPMBslrqb3JOlSzN9Y5MaO3M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=qPH+N7ke; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="qPH+N7ke" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id 41C2460A74; Wed, 2 Oct 2024 19:08:58 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-cMZ3bEDP; Wed, 02 Oct 2024 19:08:57 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885337; bh=jFKAUNHTYomsdDtZ7yDs1yniRUoGKOv+DuJ1bapdrNI=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=qPH+N7keP7fgNy1yVdHKgLdk6Vbwy5VUI37bXZuORZnoZCxar0lhhrLgbGMX/rOtr LTEiQPxkzr/9Y1AEX6RNxsh7BzoT/zDNJBuWYiGmRAwpny8Drj92A+n8avkCV2ys3e +ZVTj+7zFN+pwSdGU0uASrLQaonWOGPG+JWQX8yM= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 3/7] [hack] purgatory: disable purgatory verification. Date: Wed, 2 Oct 2024 18:07:18 +0200 Message-ID: <20241002160722.20025-4-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" Kstate changes data in kexec segments after the calculation of the checksum, so we don't pass purgatroy verification stage. Disable it for now. Proper solution will be later, in next versions of the patchset. Signed-off-by: Andrey Ryabinin --- arch/x86/purgatory/purgatory.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c index aea47e7939637..cdec5f21282a7 100644 --- a/arch/x86/purgatory/purgatory.c +++ b/arch/x86/purgatory/purgatory.c @@ -45,6 +45,8 @@ void purgatory(void) { int ret; =20 + if (IS_ENABLED(CONFIG_KSTATE)) + return; ret =3D verify_sha256_digest(); if (ret) { /* loop forever */ --=20 2.45.2 From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5EFF1D0F45; Wed, 2 Oct 2024 16:09:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885344; cv=none; b=ahk+qCzfOopxkreo9iJrY4HnBN593+ITZ6l2iD2XtZSjO+PDmnLxWBbfLrs/nMdokR4qPTb0MN4ytzbpo04c244QuHHjDQQiaO0WCHNkG1EA3FH+3X4w8uQEbvPOcwJQWkiCE0O5NE+QznDM3o7JU6GBMFWOQ5A7HdoFoEUHdqs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885344; c=relaxed/simple; bh=3ZzZGwiTLDoc/55+gBHhrLfBpS6Z6sFARGBDSkaxIko=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bH9aidvInkputf9kcqNO4JlIMVEFifjnMRoVLMV1JAMpYveN3gWT+vqlMHbu4Np6oR15tz887mes71q29Ji1EWCQKIbE7RfYtqfuBQb08DyrPk7vd0FrAafrL94ZStWPrq/EdTvnvfcD8peo1gBTAGbQJ78pdf+Elpd++Btewfw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=r7jyfNdc; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="r7jyfNdc" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id CC29E60A0A; Wed, 2 Oct 2024 19:09:00 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-zn4PxY97; Wed, 02 Oct 2024 19:08:59 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885340; bh=Buw6OtY4vaTEiLfE/rpef765HwG5RsoTmBzPC5wm+WI=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=r7jyfNdcu/K1YYJQ5x6knMx5Nl49ROZZsg0QUwcs8pAkjxACaAuyrBlPZogqCEB6V 9NxYNkSHpbUIL/TAjsx2WCSnmAbHSj6ctGbfnRoo8DUx7sJv137cKVrtUwnBzmFt0M LJnhETmwa6hxGje8tRERsspaQJi5H+2Gj2LROC2E= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 4/7] mm/memblock: Add MEMBLOCK_PRSRV flag Date: Wed, 2 Oct 2024 18:07:19 +0200 Message-ID: <20241002160722.20025-5-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" Add MEMBLOCK_PRSRV flag indicating that we don't need to initialize 'struct page' at all. The flag will be used in the following patches to mark memory intended to be kept intact across kexec. The 'struct page' for such region assumed to be initialized by the old kernel, so the new one shouldn't touch it. This is only initial RFC sketch, in which we assume that 'struct page' layout doens't change between old and new kernel. The proper solution would require some form of migration from old 'struct page' to the new one if layout did change. Signed-off-by: Andrey Ryabinin --- include/linux/memblock.h | 7 +++++++ mm/memblock.c | 9 ++++++++- mm/mm_init.c | 19 +++++++++++++++++++ 3 files changed, 34 insertions(+), 1 deletion(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 673d5cae7c813..b3c6029b03624 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -50,6 +50,7 @@ enum memblock_flags { MEMBLOCK_NOMAP =3D 0x4, /* don't add to kernel direct mapping */ MEMBLOCK_DRIVER_MANAGED =3D 0x8, /* always detected via a driver */ MEMBLOCK_RSRV_NOINIT =3D 0x10, /* don't initialize struct pages */ + MEMBLOCK_PRSRV =3D 0x20, /* struct page presreved during kexec, don't in= itialize */ }; =20 /** @@ -132,6 +133,7 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t = size); int memblock_mark_nomap(phys_addr_t base, phys_addr_t size); int memblock_clear_nomap(phys_addr_t base, phys_addr_t size); int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size); +int memblock_reserved_mark_preserved(phys_addr_t base, phys_addr_t size); =20 void memblock_free_all(void); void memblock_free(void *ptr, size_t size); @@ -271,6 +273,11 @@ static inline bool memblock_is_reserved_noinit(struct = memblock_region *m) return m->flags & MEMBLOCK_RSRV_NOINIT; } =20 +static inline bool memblock_is_preserved(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_PRSRV; +} + static inline bool memblock_is_driver_managed(struct memblock_region *m) { return m->flags & MEMBLOCK_DRIVER_MANAGED; diff --git a/mm/memblock.c b/mm/memblock.c index 0389ce5cd281e..20ab3272cc166 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1048,6 +1048,12 @@ int __init_memblock memblock_reserved_mark_noinit(ph= ys_addr_t base, phys_addr_t MEMBLOCK_RSRV_NOINIT); } =20 +int __init_memblock memblock_reserved_mark_preserved(phys_addr_t base, phy= s_addr_t size) +{ + return memblock_setclr_flag(&memblock.reserved, base, size, 1, + MEMBLOCK_PRSRV); +} + static bool should_skip_region(struct memblock_type *type, struct memblock_region *m, int nid, int flags) @@ -2181,7 +2187,8 @@ static void __init memmap_init_reserved_pages(void) * the MEMBLOCK_RSRV_NOINIT flag set */ for_each_reserved_mem_region(region) { - if (!memblock_is_reserved_noinit(region)) { + if (!memblock_is_reserved_noinit(region) && + !memblock_is_preserved(region)) { nid =3D memblock_get_region_node(region); start =3D region->base; end =3D start + region->size; diff --git a/mm/mm_init.c b/mm/mm_init.c index 4ba5607aaf194..b82c13077928f 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -837,6 +837,22 @@ static void __init init_unavailable_range(unsigned lon= g spfn, node, zone_names[zone], pgcnt); } =20 +static bool pfn_preserved(unsigned long *pfn) +{ + struct memblock_region *r; + + for_each_reserved_mem_region(r) { + if (memblock_is_preserved(r)) { + if (*pfn >=3D memblock_region_memory_base_pfn(r) && + *pfn < memblock_region_memory_end_pfn(r)) { + *pfn =3D memblock_region_memory_end_pfn(r); + return true; + } + } + } + return false; +} + /* * Initially all pages are reserved - free ones are freed * up by memblock_free_all() once the early boot process is @@ -889,6 +905,9 @@ void __meminit memmap_init_range(unsigned long size, in= t nid, unsigned long zone } } =20 + if (pfn_preserved(&pfn)) + continue; + page =3D pfn_to_page(pfn); __init_single_page(page, pfn, zone, nid); if (context =3D=3D MEMINIT_HOTPLUG) { --=20 2.45.2 From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 293551D14E8; Wed, 2 Oct 2024 16:09:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885347; cv=none; b=l8nornXXjiu8hZouxwP2kQBqJtvySR2au7nYKX+CXalW98w5YrxGhhff/9hIP71h4tcZVrpYgprA9mTF1Na0xR8IJNXFTOIsLKAHnS5YdSIPFpKbCl7mca7ruowPkuy4mgi/wBN1YphQwtBOMlc9pyZmH4C76LbBzS1pu0JfoYM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885347; c=relaxed/simple; bh=lWXlRd4W012qkR+2KrtnHm7j963tppJFb5Bms4EKxPY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZBJ/kSe5WeDy5UmKNA30BGeIhNodH1alo22gcUvSStCeP+meeE0bak152ROSSY/xE2X4akd1locsf85qhQhUnUowjD6RdHLVQ8FzPJ0CnVIXcI5hkuyS6nLbOAbC43tDtEx1DkVO8AIhNeKASAT8H/EgncbJgJn6j/iY4jOw8qc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=Q0TLXURt; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="Q0TLXURt" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id 62E2A609C4; Wed, 2 Oct 2024 19:09:03 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-vYLh3gEa; Wed, 02 Oct 2024 19:09:02 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885342; bh=vDCJtHtK3uU2sdBk8J/tnuJVXLHuR9McbyPPpAX4VPs=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=Q0TLXURt2M11h05HQcTCotDXqV17/pTO/eYHxyvF7CpIRwLEfbBfbhK5guCrZhW7i FTOE8WK8gOdDKUCtCqsclUk1GbjcM0KngfLxMtHpSJRo9cn3UXM9NGQ39FLI8zLwTl y0g4IX5ZfCyGA9n7LHUYQ9btKkyR6EMQ9Tfr2MTA= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 5/7] kstate: Add mechanism to preserved specified memory pages across kexec. Date: Wed, 2 Oct 2024 18:07:20 +0200 Message-ID: <20241002160722.20025-6-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" This adds functionality to preserve memory pages across kexec. kstate_register_page() stores struct page in the special list of 'struct page_state's. At kexec reboot stage this list iterated, pfns saved into kstate's migrate stream. The new kernel after kexec reads pfns from the stream and marks memory as reserved to keep it intact. Also it marked with MEMBLOCK_PRSRV flag indicating that 'struct page' itself shouldn't be reinitialized. Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/kexec-bzimage64.c | 2 +- arch/x86/kernel/setup.c | 81 +++++++++++++++++++++++++++++++ include/linux/kstate.h | 6 +++ kernel/kstate.c | 7 +++ 4 files changed, 95 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzim= age64.c index 71c82841e6b12..d769d08cf9a8a 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -406,7 +406,7 @@ static int load_migrate_segments(struct kimage *image) =20 kbuf.memsz =3D 8*1024*1024; =20 - kbuf.buf_align =3D ELF_CORE_HEADER_ALIGN; + kbuf.buf_align =3D PAGE_SIZE; kbuf.mem =3D KEXEC_BUF_MEM_UNKNOWN; ret =3D kexec_add_buffer(&kbuf); if (ret) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index f1fea506e20f4..cfddc902e266b 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -638,6 +639,85 @@ static void __init e820_add_kernel_range(void) e820__range_add(start, size, E820_TYPE_RAM); } =20 +#ifdef CONFIG_KSTATE +struct state_entry mem_kstate; + +struct mem_state { + unsigned int nr_pages; + struct list_head list; +}; +struct page_state { + struct list_head list; + int order; + struct page *page; +}; + +struct mem_state m_state =3D { .list =3D LIST_HEAD_INIT(m_state.list) }; + +int kstate_register_page(struct page *page, int order) +{ + struct page_state *state; + + state =3D kmalloc(sizeof(*state), GFP_KERNEL); + if (!state) + return -ENOMEM; + + state->page =3D page; + state->order =3D order; + list_add(&state->list, &m_state.list); + m_state.nr_pages++; + return 0; +} + +static int kstate_pages_save(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct page_state *p_state; + void *start =3D mig_stream; + + list_for_each_entry(p_state, &m_state.list, list) { + mig_stream =3D kstate_save_byte(mig_stream, p_state->order); + mig_stream =3D kstate_save_ulong(mig_stream, page_to_phys(p_state->page)= ); + } + return mig_stream - start; +} + +static int __init kstate_pages_restore(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct mem_state *m_state =3D obj; + int nr_pages, i; + + nr_pages =3D m_state->nr_pages; + for (i =3D 0; i < nr_pages; i++) { + int order =3D kstate_get_byte(&mig_stream); + unsigned long phys =3D kstate_get_ulong(&mig_stream); + + memblock_reserve(phys, PAGE_SIZE << order); + memblock_reserved_mark_preserved(phys, PAGE_SIZE << order); + } + return 0; +} + +struct kstate_description kstate_reserved =3D { + .name =3D "reserved_mem", + .id =3D KSTATE_RSVD_MEM_ID, + .state_list =3D LIST_HEAD_INIT(kstate_reserved.state_list), + .fields =3D (const struct kstate_field[]) { + KSTATE_SIMPLE(nr_pages, struct mem_state), + { + .name =3D "pages", + .flags =3D KS_CUSTOM, + .size =3D sizeof(struct mem_state), + .save =3D kstate_pages_save, + .restore =3D kstate_pages_restore, + }, + + KSTATE_END_OF_LIST() + }, +}; +#endif + static void __init early_reserve_memory(void) { /* @@ -989,6 +1069,7 @@ void __init setup_arch(char **cmdline_p) =20 memblock_set_current_limit(ISA_END_ADDRESS); e820__memblock_setup(); + __kstate_register(&kstate_reserved, &m_state, &mem_kstate); =20 /* * Needs to run after memblock setup because it needs the physical diff --git a/include/linux/kstate.h b/include/linux/kstate.h index c97804d0243ea..855acb339d5d7 100644 --- a/include/linux/kstate.h +++ b/include/linux/kstate.h @@ -29,6 +29,8 @@ struct kstate_field { }; =20 enum kstate_ids { + KSTATE_PAGE_ID, + KSTATE_RSVD_MEM_ID, KSTATE_LAST_ID =3D -1, }; =20 @@ -87,6 +89,10 @@ void *save_kstate(void *stream, int id, const struct kst= ate_description *kstate, void *obj); void *restore_kstate(struct kstate_entry *ke, int id, const struct kstate_description *kstate, void *obj); + +int kstate_page_save(void *mig_stream, void *obj, + const struct kstate_field *field); +int kstate_register_page(struct page *page, int order); #else =20 #define __kstate_register(state, obj, se) diff --git a/kernel/kstate.c b/kernel/kstate.c index 0ef228baef94e..7f7e135bafd81 100644 --- a/kernel/kstate.c +++ b/kernel/kstate.c @@ -182,6 +182,13 @@ int kstate_register(struct kstate_description *state, = void *obj) return 0; } =20 +int kstate_page_save(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + kstate_register_page(*(struct page **)obj, 0); + return 0; +} + static int __init setup_migrate(char *arg) { char *end; --=20 2.45.2 From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 626281D0E1B; Wed, 2 Oct 2024 16:09:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885349; cv=none; b=FUP4f20hXt90ZDRVBmJtjTVtBBbDlZ2TM2/zf9AiQ7KcaYYuhoVxPBa5JJxEbFA/D4wouUDCxvykXXeFUFUZDXLziXnwavSx2jruLiWPA5cmFmjaXAH+SR1VZWrY2yIUYloKDVgCvHVpPB3Jo2ajC5tqRg/QdBqNuQ94OJilV70= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885349; c=relaxed/simple; bh=vnnWpNAcaXOZPTAhlNQ8w9M/4YoI/vFJdPQ9pACtY0A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NQOk2w5JubUX1lU3kQMOZWq8gOSGA7rfuzEx0ZmeGLZLIffqgugMu61oZjhC/su0glqsjjFjtj6Q7BpQUb9928mzL6kplSEt2fdVIKt/uz31Eicxqt9U9KYNeSfDSNEKakANwauDzwFpfuPCHtg5uXfIYR5F9oJ+bSzFFWXBqPQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=QJ6MOGmC; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="QJ6MOGmC" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id E548260A23; Wed, 2 Oct 2024 19:09:05 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-nRmmw9QO; Wed, 02 Oct 2024 19:09:05 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885345; bh=0gXnv/qrKYuE/cS8ZBx3ikqYeKs1eCUWKuL7cMB0Reg=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=QJ6MOGmC2WXQfZHzJZ1VaR1B8rKzNr7QnMVdzqdp+QQv3P0S3FHMiPewEj2rscP08 wpInBKldjvUn90+GV+UmlmIIYMJDqh6JpsBqKYkROXaqaEMgJ2zuIgRq06YpaV+5qL OvdxqE3T7r9VwN+6lIu4k91/bcBG6keVqTB4vvQk= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 6/7] kstate, test: add test module for testing kstate subsystem. Date: Wed, 2 Oct 2024 18:07:21 +0200 Message-ID: <20241002160722.20025-7-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" This is simple test and playground useful kstate subsystem development. It contains some structure with different kind of data which migrated across kexec to the new kernel using kstate. Signed-off-by: Andrey Ryabinin --- include/linux/kstate.h | 1 + lib/Makefile | 2 + lib/test_kstate.c | 89 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 92 insertions(+) create mode 100644 lib/test_kstate.c diff --git a/include/linux/kstate.h b/include/linux/kstate.h index 855acb339d5d7..2ddbe41a1f171 100644 --- a/include/linux/kstate.h +++ b/include/linux/kstate.h @@ -31,6 +31,7 @@ struct kstate_field { enum kstate_ids { KSTATE_PAGE_ID, KSTATE_RSVD_MEM_ID, + KSTATE_TEST_ID, KSTATE_LAST_ID =3D -1, }; =20 diff --git a/lib/Makefile b/lib/Makefile index 773adf88af416..2432e47664c35 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -354,6 +354,8 @@ obj-$(CONFIG_PARMAN) +=3D parman.o =20 obj-y +=3D group_cpus.o =20 +obj-$(CONFIG_KSTATE) +=3D test_kstate.o + # GCC library routines obj-$(CONFIG_GENERIC_LIB_ASHLDI3) +=3D ashldi3.o obj-$(CONFIG_GENERIC_LIB_ASHRDI3) +=3D ashrdi3.o diff --git a/lib/test_kstate.c b/lib/test_kstate.c new file mode 100644 index 0000000000000..e95e3110f8949 --- /dev/null +++ b/lib/test_kstate.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include + +unsigned long ulong_val; +struct kstate_test_data { + int i; + unsigned long *p_ulong; + char s[10]; + struct page *page; +}; + +struct kstate_description test_state =3D { + .name =3D "test", + .version_id =3D 1, + .id =3D KSTATE_TEST_ID, + .state_list =3D LIST_HEAD_INIT(test_state.state_list), + .fields =3D (const struct kstate_field[]) { + KSTATE_SIMPLE(i, struct kstate_test_data), + KSTATE_SIMPLE(s, struct kstate_test_data), + KSTATE_POINTER(p_ulong, struct kstate_test_data), + { + .name =3D "page", + .flags =3D KS_CUSTOM, + .offset =3D offsetof(struct kstate_test_data, page), + .save =3D kstate_page_save, + }, + KSTATE_SIMPLE(page, struct kstate_test_data), + KSTATE_END_OF_LIST() + }, +}; + +static struct kstate_test_data test_data; + +static int init_test_data(void) +{ + struct page *page; + int i; + + test_data.i =3D 10; + ulong_val =3D 20; + memcpy(test_data.s, "abcdefghk", sizeof(test_data.s)); + page =3D alloc_page(GFP_KERNEL); + if (!page) + return -ENOMEM; + + for (i =3D 0; i < PAGE_SIZE/4; i +=3D 4) + *((u32 *)page_address(page) + i) =3D 0xdeadbeef; + test_data.page =3D page; + return 0; +} + +static void validate_test_data(void) +{ + int i; + + WARN_ON(test_data.i !=3D 10); + WARN_ON(*test_data.p_ulong !=3D 20); + WARN_ON(strcmp(test_data.s, "abcdefghk") !=3D 0); + + for (i =3D 0; i < PAGE_SIZE/4; i +=3D 4) { + u32 val =3D *((u32 *)page_address(test_data.page) + i); + + WARN_ON(val !=3D 0xdeadbeef); + } +} + +static int __init test_kstate_init(void) +{ + int ret =3D 0; + + test_data.p_ulong =3D &ulong_val; + + if (!is_migrate_kernel()) { + ret =3D init_test_data(); + if (ret) + goto out; + } + + kstate_register(&test_state, &test_data); + + validate_test_data(); + +out: + return ret; +} +__initcall(test_kstate_init); --=20 2.45.2 From nobody Thu Nov 28 08:29:23 2024 Received: from forwardcorp1d.mail.yandex.net (forwardcorp1d.mail.yandex.net [178.154.239.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1878F1D172E; Wed, 2 Oct 2024 16:09:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.154.239.200 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885352; cv=none; b=gI12AL+/ufnAumDb9g3q38HUCsQ0PNlgo57ahcWOMUjjKUk4BA5/Pp1gRzT1wobW1La6l177S8B2394xoqt4X/e9hJYu1VlXgJgJFwYxRnGOxv7Qp5hyhn/DAt0KU61SYx1SwZq86iZyQkTKcrmBhLjcc/SkHkuC//mWP0W7FvI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727885352; c=relaxed/simple; bh=aiEkC1/T4zE2ihaFK8waklG/0upY4UyvJwxXxR+UToQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JXgDlZcUmJAtzu1aPAYNfWYWz/cz7zAO2Z5hjOVfSr4U+KD7KFhD0D2HTvWlmjDkvi5OU7V+moYzNpyM7w+v9aHKUSW7NBOjjvyuQ0GTUCvKFU/V4NMPAYIEldSjCUmUnOUhXHO8eJlP1/0Wx/APNhST8HO49Mo0/ZgVi3efEgo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com; spf=pass smtp.mailfrom=yandex-team.com; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b=F5+i2ZBr; arc=none smtp.client-ip=178.154.239.200 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=yandex-team.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.com header.i=@yandex-team.com header.b="F5+i2ZBr" Received: from mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:b1cb:0:640:2a1e:0]) by forwardcorp1d.mail.yandex.net (Yandex) with ESMTPS id 9F16460949; Wed, 2 Oct 2024 19:09:08 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id Z8emWD2IhiE0-ueku0Pjg; Wed, 02 Oct 2024 19:09:07 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1727885347; bh=r6KOHiXGwu9wJr+jMWOsPrgnVfkjQk3UoBVxWjKVRQc=; h=Message-ID:Date:In-Reply-To:Cc:Subject:References:To:From; b=F5+i2ZBr2Uq7UPOSwZbgC7ZA0tYzgqL47KGhDszAsbQa6F6J/7vbR7Ko3bOmn6riv 8JQEDe8b3xruR/nSABrpSw17jrHYUuNZVIz8HuRdoAPk5CYiB7gV3oWoy60iWQ6ke4 u3C0/g9G/pAFLeQBOSS9YnP/FYBQuMXxxe4R9D2w= Authentication-Results: mail-nwsmtp-smtp-corp-main-56.klg.yp-c.yandex.net; dkim=pass header.i=@yandex-team.com From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org, valesini@yandex-team.com, Andrey Ryabinin Subject: [RFC PATCH 7/7] trace: migrate trace buffers across kexec Date: Wed, 2 Oct 2024 18:07:22 +0200 Message-ID: <20241002160722.20025-8-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20241002160722.20025-1-arbn@yandex-team.com> References: <20241002160722.20025-1-arbn@yandex-team.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Yandex-Filter: 1 Content-Type: text/plain; charset="utf-8" This is a demonstration of kstate capabilities to migrate across kexec something more complex rather than simple structure like in the lib/test_kstate.c module. Here we migrate tracing_on/current_trace and content of the trace buffers to the new kernel The 'global_trace_state' describes 'tracing_on' and 'current_trace' states. The 'trace_buffer' kstate field in 'global_trace_state' points to 'kstate_trace_buffer' describing the state of ring buffers. The code in kstate_rb_[save/restore]() saves and restore list of buffer pages. It turned out to be somewhat hacky and ugly, partially because kstate currently can't migrate slab data. So because of that we have to save/restore positions of commit_page/reader_page/etc in the list of pages. We could probably teach kstate to migrate slab pages, preserving contents at the same address, which would make easier to migrate lists like the ring buffer list in the trace, as we would need to save/restore only pointer. Signed-off-by: Andrey Ryabinin --- include/linux/kstate.h | 4 + kernel/trace/ring_buffer.c | 189 +++++++++++++++++++++++++++++++++++++ kernel/trace/trace.c | 81 ++++++++++++++++ 3 files changed, 274 insertions(+) diff --git a/include/linux/kstate.h b/include/linux/kstate.h index 2ddbe41a1f171..ae807a75a02f8 100644 --- a/include/linux/kstate.h +++ b/include/linux/kstate.h @@ -32,6 +32,10 @@ enum kstate_ids { KSTATE_PAGE_ID, KSTATE_RSVD_MEM_ID, KSTATE_TEST_ID, + KSTATE_TRACE_ID, + KSTATE_TRACE_BUFFER_ID, + KSTATE_TRACE_RING_BUFFER_ID, + KSTATE_TRACE_BUFFER_PAGE_ID, KSTATE_LAST_ID =3D -1, }; =20 diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 77dc0b25140e6..9a8692d7d960c 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -16,6 +16,7 @@ #include #include #include +#include #include /* for self test */ #include #include @@ -1467,6 +1468,194 @@ static void rb_tail_page_update(struct ring_buffer_= per_cpu *cpu_buffer, } } =20 +#ifdef CONFIG_KSTATE +static int kstate_bpage_save(void *mig_stream, void *obj, const struct kst= ate_field *field) +{ + struct buffer_page *bpage =3D obj; + + kstate_register_page(virt_to_page(bpage->page), bpage->order); + return 0; + +} +struct kstate_description kstate_buffer_page =3D { + .name =3D "buffer_page", + .id =3D KSTATE_TRACE_BUFFER_PAGE_ID, + .fields =3D (const struct kstate_field[]) { + KSTATE_SIMPLE(write, struct buffer_page), + KSTATE_SIMPLE(read, struct buffer_page), + KSTATE_SIMPLE(entries, struct buffer_page), + KSTATE_SIMPLE(real_end, struct buffer_page), + KSTATE_SIMPLE(order, struct buffer_page), + KSTATE_SIMPLE(page, struct buffer_page), + { + .name =3D "buffer_page", + .flags =3D KS_CUSTOM, + .save =3D kstate_bpage_save, + .size =3D (sizeof(struct buffer_page)), + }, + KSTATE_END_OF_LIST(), + }, +}; + +static void restore_pages_positions(void **mig_stream, + struct ring_buffer_per_cpu *cpu_buffer) +{ + struct list_head *tmp; + struct list_head *head =3D rb_list_head(cpu_buffer->pages); + unsigned long commit_page_nr, reader_page_nr, + head_page_nr, tail_page_nr; + int i =3D 0; + + commit_page_nr =3D kstate_get_ulong(mig_stream); + reader_page_nr =3D kstate_get_ulong(mig_stream); + head_page_nr =3D kstate_get_ulong(mig_stream); + tail_page_nr =3D kstate_get_ulong(mig_stream); + + for (tmp =3D head;;) { + struct buffer_page *page =3D (struct buffer_page *)tmp; + + if (commit_page_nr =3D=3D i) + cpu_buffer->commit_page =3D page; + if (reader_page_nr =3D=3D i) + cpu_buffer->reader_page =3D page; + if (head_page_nr =3D=3D i) + cpu_buffer->head_page =3D page; + if (tail_page_nr =3D=3D i) + cpu_buffer->tail_page =3D page; + i++; + tmp =3D rb_list_head(tmp->next); + if (tmp =3D=3D head) + break; + } +} + +static int kstate_rb_restore(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct ring_buffer_per_cpu *cpu_buffer =3D obj; + LIST_HEAD(pages); + void *stream_start =3D mig_stream; + struct buffer_page *page; + struct list_head *tmp; + struct list_head *head =3D rb_list_head(cpu_buffer->pages); + int i =3D 0; + + while (kstate_get_byte(&mig_stream)) { + int j =3D 0; + bool page_exists =3D false; + + for (tmp =3D rb_list_head(head->next); tmp !=3D head; + tmp =3D rb_list_head(tmp->next)) { + if (j =3D=3D i) { + page_exists =3D true; + page =3D (struct buffer_page *)tmp; + break; + } + j++; + } + if (!page_exists) { + struct buffer_page *bpage; + + bpage =3D kzalloc_node(ALIGN(sizeof(*bpage), + cache_line_size()), GFP_KERNEL, + cpu_to_node(cpu_buffer->cpu)); + list_add(&bpage->list, &pages); + page =3D bpage; + } + mig_stream =3D restore_kstate((struct kstate_entry *)mig_stream, + i++, field->ksd, page); + } + + restore_pages_positions(&mig_stream, cpu_buffer); + + return mig_stream - stream_start; +} + +static int kstate_rb_save(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct ring_buffer_per_cpu *cpu_buffer =3D obj; + struct list_head *tmp; + struct list_head *head =3D rb_list_head(cpu_buffer->pages); + void *stream_start =3D mig_stream; + unsigned long commit_page_nr, reader_page_nr, + head_page_nr, tail_page_nr; + int i =3D 0; + + + for (tmp =3D head;;) { + struct buffer_page *page =3D (struct buffer_page *)tmp; + + mig_stream =3D kstate_save_byte(mig_stream, 1); + mig_stream =3D save_kstate(mig_stream, i, field->ksd, page); + + if (cpu_buffer->commit_page =3D=3D page) + commit_page_nr =3D i; + if (cpu_buffer->reader_page =3D=3D page) + reader_page_nr =3D i; + if (cpu_buffer->head_page =3D=3D page) + head_page_nr =3D i; + if (cpu_buffer->tail_page =3D=3D page) + tail_page_nr =3D i; + i++; + tmp =3D rb_list_head(tmp->next); + if (tmp =3D=3D head) + break; + } + + mig_stream =3D kstate_save_byte(mig_stream, 0); + + /* save pages positions */ + mig_stream =3D kstate_save_ulong(mig_stream, commit_page_nr); + mig_stream =3D kstate_save_ulong(mig_stream, reader_page_nr); + mig_stream =3D kstate_save_ulong(mig_stream, head_page_nr); + mig_stream =3D kstate_save_ulong(mig_stream, tail_page_nr); + + return mig_stream - stream_start; +} + +struct kstate_description kstate_ring_buffer_per_cpu =3D { + .name =3D "ring_buffer_per_cpu", + .id =3D KSTATE_TRACE_RING_BUFFER_ID, + .state_list =3D LIST_HEAD_INIT(kstate_ring_buffer_per_cpu.state_list), + .fields =3D (const struct kstate_field[]) { + KSTATE_SIMPLE(entries, struct ring_buffer_per_cpu), + KSTATE_SIMPLE(entries_bytes, struct ring_buffer_per_cpu), + { + .name =3D "buffer_pages", + .flags =3D KS_CUSTOM, + .size =3D (sizeof(struct ring_buffer_per_cpu)), + .ksd =3D &kstate_buffer_page, + .save =3D kstate_rb_save, + .restore =3D kstate_rb_restore, + }, + KSTATE_END_OF_LIST(), + }, +}; + +static int nr_ring_buffers(void) +{ + return nr_cpu_ids; +} + +struct kstate_description kstate_trace_buffer =3D { + .name =3D "trace_buffer", + .id =3D KSTATE_TRACE_BUFFER_ID, + .state_list =3D LIST_HEAD_INIT(kstate_trace_buffer.state_list), + .fields =3D (const struct kstate_field[]) { + { + .name =3D "ring_buffers", + .flags =3D KS_STRUCT|KS_POINTER|KS_ARRAY_OF_POINTER, + .size =3D (sizeof(struct ring_buffer_per_cpu *)), + .offset =3D offsetof(struct trace_buffer, buffers), + .count =3D nr_ring_buffers, + .ksd =3D &kstate_ring_buffer_per_cpu, + }, + KSTATE_END_OF_LIST(), + } +}; +#endif + static void rb_check_bpage(struct ring_buffer_per_cpu *cpu_buffer, struct buffer_page *bpage) { diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c01375adc4714..bb07d716beab4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -10621,6 +10622,84 @@ __init static void enable_instances(void) } } =20 +#ifdef CONFIG_KSTATE +static int cur_trace_save(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct trace_array *tr =3D obj; + + return strscpy(mig_stream, tr->current_trace->name, 100) + 1; +} + +static int cur_trace_restore(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct trace_array *tr =3D obj; + + tracing_set_tracer(tr, mig_stream); + return strlen(mig_stream) + 1; +} + +static int tracing_on_save(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct trace_array *tr =3D obj; + + *(u8 *)mig_stream =3D (u8)tracer_tracing_is_on(tr); + return sizeof(u8); + +} + +static int tracing_on_restore(void *mig_stream, void *obj, + const struct kstate_field *field) +{ + struct trace_array *tr =3D obj; + u8 on =3D *(u8 *)mig_stream; + + if (on) + tracer_tracing_on(tr); + else + tracer_tracing_off(tr); + + return sizeof(on); +} + +extern struct kstate_description kstate_trace_buffer; + +struct kstate_description global_trace_state =3D { + .name =3D "trace_state", + .id =3D KSTATE_TRACE_ID, + .version_id =3D 1, + .state_list =3D LIST_HEAD_INIT(global_trace_state.state_list), + .fields =3D (const struct kstate_field[]) { + { + .name =3D "tracing_on", + .flags =3D KS_CUSTOM, + .version_id =3D 0, + .size =3D sizeof(struct trace_array), + .save =3D tracing_on_save, + .restore =3D tracing_on_restore, + }, + { + .name =3D "current_trace", + .flags =3D KS_CUSTOM, + .version_id =3D 0, + .size =3D sizeof(struct trace_array), + .save =3D cur_trace_save, + .restore =3D cur_trace_restore, + + }, + { + .name =3D "trace_buffer", + .flags =3D KS_STRUCT|KS_POINTER, + .offset =3D offsetof(struct trace_array, array_buffer.buffer), + .ksd =3D &kstate_trace_buffer, + }, + KSTATE_END_OF_LIST() + }, +}; +#endif + __init static int tracer_alloc_buffers(void) { int ring_buf_size; @@ -10848,6 +10927,8 @@ __init static int late_trace_init(void) =20 tracing_set_default_clock(); clear_boot_tracer(); + kstate_register(&global_trace_state, &global_trace); + return 0; } =20 --=20 2.45.2