From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-004.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-004.esa.us-west-2.outbound.mail-perimeter.amazon.com [44.246.77.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A835275860; Fri, 3 Oct 2025 09:01:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=44.246.77.92 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482111; cv=none; b=oDE7Vb4OBu0l2Fpdx5gfiL6AJrMjO56+d1oyWiGRRUoIiiTy2gtrWljCSn2MGJQi3BARyYcmY8tW57xl69X4fd3FsdMU57QRhUz6B1Nk0LOhfsY1cv7tq1wP9BdYgSSZN82w7myBBgQZgdBRRTMfcYZdpcnn2MFe7iOWO51+ipc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482111; c=relaxed/simple; bh=1Mjt8IXBt8f6Mhe+VfcjXU6q0F0NTRgXpDiZCPaONLI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ItIuTGzZXdibJpWW9eDL/5KySCWkDxe3sc3ruwr6UITV+m5//A8IQfvDhyh8C5+gEcNmmMsTKQMk1y4Ge1iqYru/sF5nRlHNQuVagwKTUFskiMKwT2YTH7N9D+da6ALZ9jvp6jLWqbtu9XZF2DV15QpfWUMEptlza6ecJ5VUGR0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=p+Ftlmij; arc=none smtp.client-ip=44.246.77.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="p+Ftlmij" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482109; x=1791018109; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=w4ijJ021O6Jl00VOy/IsHrHY4ShgXRWyy/Q296Dl1D8=; b=p+Ftlmij/3ethZZM1eGkgqBwAvNnOEgeWdrEaIC9Mq/B8qOSG5slkTgs yokM6a18bR8rIrEUAd/C/cyeoeCYBe8rD11l0oCN8TZngtnUEXcpB07NX p7Uz6FoZAiFBXgfV7iMt6B6RRfCciNyqo6NQxFqsH7iJUv1n+ipZ6ODX3 8SqGfpt2pkUnWHDOeW+5yzcGNoOngxnn0TiQEO2Q2jBm9HnaUy0BtwWP+ WP4wAg9QXNGOmTERsfzr6XblrRbbWpRoXefkZAeLQOzSL5jE2R6Yg6kuR TIF7+29Fzn8FWoiIRA5cMdAmrzJah/rPF3ZhtMvZO15OiWGIch6ylwyWV w==; X-CSE-ConnectionGUID: IUUx5IXdSKaFBlcCyvINtQ== X-CSE-MsgGUID: KIDT40C7SVOC9NKKnRQ6Vg== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4214776" Received: from ip-10-5-9-48.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.9.48]) by internal-pdx-out-004.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:01:47 +0000 Received: from EX19MTAUWC001.ant.amazon.com [10.0.38.20:43651] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.57.112:2525] with esmtp (Farcaster) id 3d6eac18-9bf1-44d8-9ac7-a134f3d0e68e; Fri, 3 Oct 2025 09:01:47 +0000 (UTC) X-Farcaster-Flow-ID: 3d6eac18-9bf1-44d8-9ac7-a134f3d0e68e Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWC001.ant.amazon.com (10.250.64.174) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:01:47 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:01:44 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 01/13] pci: pcsc: Add plumbing for the PCI Configuration Space Cache (PCSC) Date: Fri, 3 Oct 2025 09:00:37 +0000 Message-ID: X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D045UWA001.ant.amazon.com (10.13.139.83) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce the basic infrastructure for the PCI Configuration Space Cache (PCSC), a mechanism to cache PCI configuration space accesses to reduce latency and bus traffic. The PCSC implements a transparent interception layer for PCI config space operations by dynamically injecting its own ops into the PCI bus hierarchy. The design preserves existing PCI ops while allowing PCSC to intercept and cache accesses: The` struct pci_bus` is extended to hold the original `pci_ops`, while the cache ones are injected via `pcsc_inject_bus_ops()`. The cache ops are injected when new buses are added via registering it to a bus notifier and integrating it at: * `pci_register_host_bridge()` - for root buses * `pci_alloc_child_bus()` - for child buses * `pci_bus_set_ops()` - when ops are dynamically changed The implementation includes weak pcsc_hw_config_read/write functions that handle calling the original op, when access to the actual HW is required. This approach ensures complete transparency - existing drivers and subsystems continue to use standard PCI config access functions while PCSC can intercept and cache accesses as needed. The weak functions also allow architecture-specific implementations to override the default behavior. The `core` initcall level is chosen so the cache is initialised before the PCI driver, ensuring that all config space access go through the cache. Kconfig options are added for both PCSC and PCIe PCSC support, with the latter extending the cache to handle 4KB PCIe configuration space. In this initial patch, the cache simply passes through all accesses to the hardware via the original ops - actual caching functionality will be added in subsequent patches. There is one caveat in this patch. The map_bus operations can potentially alter the cache, without invalidating / updating the cache. This is not an issue for the current upstream usages, as it is only being used in Root complexes and the `pci_generic_config_{read,write}{,32}` Signed-off-by: Evangelos Petrongonas Signed-off-by: Stanislav Spassov --- drivers/pci/Kconfig | 10 ++ drivers/pci/Makefile | 1 + drivers/pci/access.c | 81 ++++++++++++++- drivers/pci/pcie/Kconfig | 9 ++ drivers/pci/pcsc.c | 208 +++++++++++++++++++++++++++++++++++++++ drivers/pci/probe.c | 24 ++++- include/linux/pci.h | 3 + include/linux/pcsc.h | 86 ++++++++++++++++ 8 files changed, 419 insertions(+), 3 deletions(-) create mode 100644 drivers/pci/pcsc.c create mode 100644 include/linux/pcsc.h diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 9a249c65aedc..c26162b58365 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -40,6 +40,16 @@ config PCI_DOMAINS_GENERIC config PCI_SYSCALL bool =20 +config PCSC + bool "PCI Configuration Space Cache" + depends on PCI + default n + help + This option enables support for the PCI Configuration Space Cache + (PCSC). PCSC is a transparent caching layer that + intercepts configuration space operations and maintains cached + copies of register values + source "drivers/pci/pcie/Kconfig" =20 config PCI_MSI diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 67647f1880fb..012561b97e32 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -37,6 +37,7 @@ obj-$(CONFIG_PCI_DOE) +=3D doe.o obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) +=3D of_property.o obj-$(CONFIG_PCI_NPEM) +=3D npem.o obj-$(CONFIG_PCIE_TPH) +=3D tph.o +obj-$(CONFIG_PCSC) +=3D pcsc.o =20 # Endpoint library must be initialized before its users obj-$(CONFIG_PCI_ENDPOINT) +=3D endpoint/ diff --git a/drivers/pci/access.c b/drivers/pci/access.c index b123da16b63b..b89e9210d330 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include #include @@ -189,15 +190,93 @@ EXPORT_SYMBOL_GPL(pci_generic_config_write32); * @ops: new raw operations * * Return previous raw operations + * + * When PCSC is enabled, this function maintains transparency by: + * - Returning the original non-PCSC ops to the caller + * - Properly handling the case where PCSC ops are already injected + * - Re-injecting PCSC ops after setting new ops when appropriate */ struct pci_ops *pci_bus_set_ops(struct pci_bus *bus, struct pci_ops *ops) { struct pci_ops *old_ops; unsigned long flags; +#ifdef CONFIG_PCSC + bool pcsc_was_injected =3D false; + struct pci_ops *pcsc_ops_ptr =3D NULL; +#endif =20 raw_spin_lock_irqsave(&pci_lock, flags); - old_ops =3D bus->ops; + +#ifdef CONFIG_PCSC + /* + * Check if PCSC ops are currently injected. If so, we need to: + * 1. Return the original (non-PCSC) ops to maintain transparency + * 2. Update orig_ops to point to the new ops + * 3. Re-inject PCSC ops if the new ops are different from PCSC ops + */ + if (bus->orig_ops) { + pcsc_was_injected =3D true; + pcsc_ops_ptr =3D bus->ops; /* Save current PCSC ops */ + old_ops =3D bus->orig_ops; /* Return the real original ops */ + + /* + * If the caller is trying to restore the PCSC ops themselves, + * just keep the current setup and return the original ops + */ + if (ops =3D=3D pcsc_ops_ptr) + goto out_unlock; + + /* Clear orig_ops temporarily to allow re-injection */ + bus->orig_ops =3D NULL; + } else +#endif + { + old_ops =3D bus->ops; + } + bus->ops =3D ops; + +#ifdef CONFIG_PCSC + /* + * Re-inject PCSC ops if they were previously injected and the new ops + * are not the PCSC ops themselves. This maintains caching transparency. + */ + if (pcsc_was_injected && ops !=3D pcsc_ops_ptr) { + /* + * IMPORTANT: Dynamic ops changes after PCSC injection can lead to + * cache consistency issues if operations were performed that should + * have invalidated the cache. We re-inject PCSC ops here, but the + * caller is responsible for ensuring cache consistency if needed. + * This will be fixed in a future commit, when PCSC resets are + * introduced. + */ + + pr_warn("PCSC: Dynamic ops change detected on bus %04x:%02x, resetting c= ache\n", + pci_domain_nr(bus), bus->number); + + if (pcsc_inject_bus_ops(bus)) { + pr_err("PCSC: Failed to re-inject ops after ops change on bus %04x:%02x= \n", + pci_domain_nr(bus), bus->number); + /* + * If re-injection fails, we've lost caching but at least + * the caller's requested ops are in place. Log it + */ + pr_warn("PCSC: Cache disabled for bus %04x:%02x after ops change\n", + pci_domain_nr(bus), bus->number); + } else { + pr_debug("PCSC: Successfully re-injected ops after ops change on bus %0= 4x:%02x\n", + pci_domain_nr(bus), bus->number); + } + } else if (!pcsc_was_injected) { + /* First-time injection for this bus */ + if (pcsc_inject_bus_ops(bus)) { + pr_err("PCSC: Failed to inject ops on bus %04x:%02x\n", + pci_domain_nr(bus), bus->number); + } + } + +out_unlock: +#endif raw_spin_unlock_irqrestore(&pci_lock, flags); return old_ops; } diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig index 17919b99fa66..2f1efc41afcc 100644 --- a/drivers/pci/pcie/Kconfig +++ b/drivers/pci/pcie/Kconfig @@ -155,3 +155,12 @@ config PCIE_EDR the PCI Firmware Specification r3.2. Enable this if you want to support hybrid DPC model which uses both firmware and OS to implement DPC. + +config PCIE_PCSC + bool "PCI Configuration Space Cache PCIE Support" + depends on PCSC + default y + help + This option adds PCIe support to the PCSC, by expanding the + configuration space to 4K and adding support for PCIe Capabilities. + For more information check PCSC and `/drivers/pci/pcsc.c` diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c new file mode 100644 index 000000000000..dec7c51b5cfd --- /dev/null +++ b/drivers/pci/pcsc.c @@ -0,0 +1,208 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved. + * Author: Evangelos Petrongonas + * + * Implementation of the PCI Configuration Space Cache (PCSC) + * PCSC is a module which caches the PCI Configuration Space Accesses + * It implements a write-invalidate policy, meaning that writes are + * propagated to the device and invalidating the cache. The registers that + * we are caching are based on the values that are safe to cache and we + * are not expecting them to change without OS actions. + * + */ + + #define pr_fmt(fmt) "PCSC: " fmt + +#include + +static bool pcsc_initialised; + +static int pcsc_add_bus(struct pci_bus *bus) +{ + if (!bus->orig_ops || !bus->orig_ops->add_bus) + return 0; + return bus->orig_ops->add_bus(bus); +} + +static void pcsc_remove_bus(struct pci_bus *bus) +{ + if (bus->orig_ops && bus->orig_ops->remove_bus) + bus->orig_ops->remove_bus(bus); +} + +/** + * pcsc_map_bus - Map PCI configuration space for memory-mapped access + * @bus: PCI bus structure + * @devfn: Device and function number + * @where: Offset in configuration space + * + * WARNING: Cache Bypass Issue + * This function returns a memory-mapped I/O address that provides direct + * access to PCI configuration space, completely bypassing the PCSC cache. + * + * Any reads or writes performed through the returned MMIO address will NO= T: + * - Use cached values for reads + * - Update cached values on reads + * - Invalidate cached values on writes + * + * This can lead to cache inconsistency where: + * 1. PCSC cache contains stale data after MMIO writes + * 2. Subsequent cached reads return outdated values + * 3. Cache coherency is lost until the next cache invalidation + * + * Current users include: + * - (pci_generic_config_{read,write}{,32}) which are already handled + * - operations on RCs that are not supported by PCSC. + * Therefore, there is no risk of cache inconsistency here. + * However, any future use of map_bus after cache population poses risks. + * + * IMPORTANT: Callers using the returned MMIO address are responsible for + * maintaining cache consistency. Consider invalidating relevant cache ent= ries + * after MMIO operations if the device's cache may be active. + * + * Return: Virtual address for memory-mapped config space access, or NULL + */ +static void __iomem *pcsc_map_bus(struct pci_bus *bus, unsigned int devfn, + int where) +{ + if (!bus->orig_ops || !bus->orig_ops->map_bus) + return NULL; + return bus->orig_ops->map_bus(bus, devfn, where); +} + +/* Weak references to allow architecture-specific overrides */ +int __weak pcsc_hw_config_read(struct pci_bus *bus, unsigned int devfn, + int where, int size, u32 *val) +{ + /* + * This function is only called from pcsc_cached_config_read, + * which means PCSC ops have already been injected and orig_ops + * should be valid. + */ + if (bus->orig_ops && bus->orig_ops->read) + return bus->orig_ops->read(bus, devfn, where, size, val); + + *val =3D 0xffffffff; + return PCIBIOS_FUNC_NOT_SUPPORTED; +} +EXPORT_SYMBOL_GPL(pcsc_hw_config_read); + +int __weak pcsc_hw_config_write(struct pci_bus *bus, unsigned int devfn, + int where, int size, u32 val) +{ + /* + * This function is only called from pcsc_cached_config_write, + * which means PCSC ops have already been injected and orig_ops + * should be valid. + */ + if (bus->orig_ops && bus->orig_ops->write) + return bus->orig_ops->write(bus, devfn, where, size, val); + + return PCIBIOS_FUNC_NOT_SUPPORTED; +} +EXPORT_SYMBOL_GPL(pcsc_hw_config_write); + +int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int w= here, + int size, u32 *val) +{ + if (!pcsc_initialised) + goto read_from_dev; + +read_from_dev: + return pcsc_hw_config_read(bus, devfn, where, size, val); +} +EXPORT_SYMBOL_GPL(pcsc_cached_config_read); + +int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int = where, + int size, u32 val) +{ + if (!pcsc_initialised) + goto write_to_dev; + +write_to_dev: + return pcsc_hw_config_write(bus, devfn, where, size, val); +} +EXPORT_SYMBOL_GPL(pcsc_cached_config_write); + +static struct pci_ops pcsc_ops =3D { + .add_bus =3D pcsc_add_bus, + .remove_bus =3D pcsc_remove_bus, + .map_bus =3D pcsc_map_bus, + .read =3D pcsc_cached_config_read, + .write =3D pcsc_cached_config_write, +}; + +int pcsc_inject_bus_ops(struct pci_bus *bus) +{ + if (!bus) + return -EINVAL; + + if (!bus->ops) { + WARN_ONCE( + 1, + "PCSC: Cannot inject ops - bus %04x:%02x ops not defined\n", + pci_domain_nr(bus), bus->number); + return -EINVAL; + } + + if (bus->ops->read =3D=3D pcsc_cached_config_read || bus->orig_ops) + return 0; + + bus->orig_ops =3D bus->ops; + bus->ops =3D &pcsc_ops; + + pci_dbg(bus, "PCSC: Injected ops for bus"); + return 0; +} +EXPORT_SYMBOL_GPL(pcsc_inject_bus_ops); + +static void pcsc_remove_bus_ops(struct pci_bus *bus) +{ + if (bus->orig_ops && bus->ops =3D=3D &pcsc_ops) { + bus->ops =3D bus->orig_ops; + bus->orig_ops =3D NULL; + } +} + +static int pcsc_bus_notify(struct notifier_block *nb, unsigned long action, + void *data) +{ + struct device *dev =3D data; + struct pci_bus *bus; + + bus =3D to_pci_bus(dev); + if (!bus) + return NOTIFY_OK; + + switch (action) { + case BUS_NOTIFY_ADD_DEVICE: + pcsc_inject_bus_ops(bus); + break; + case BUS_NOTIFY_DEL_DEVICE: + /* + * Remove on DEL_DEVICE to unhook before device_del() completes. + * This ensures caching is disabled before the final cleanup. + */ + pcsc_remove_bus_ops(bus); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block pcsc_bus_nb =3D { + .notifier_call =3D pcsc_bus_notify, +}; + +static int __init pcsc_init(void) +{ + bus_register_notifier(&pci_bus_type, &pcsc_bus_nb); + + pcsc_initialised =3D true; + pr_info("initialised\n"); + + return 0; +} + +core_initcall(pcsc_init); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 37f5bd476f39..33a186e4bf1e 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -1039,6 +1040,11 @@ static int pci_register_host_bridge(struct pci_host_= bridge *bridge) } #endif =20 +#ifdef CONFIG_PCSC + if (pcsc_inject_bus_ops(bus)) + pci_err(bus, "PCSC: Failed to inject ops\n"); +#endif + b =3D pci_find_bus(pci_domain_nr(bus), bridge->busnr); if (b) { /* Ignore it if we already got here via a different bridge */ @@ -1236,10 +1242,24 @@ static struct pci_bus *pci_alloc_child_bus(struct p= ci_bus *parent, child->bus_flags =3D parent->bus_flags; =20 host =3D pci_find_host_bridge(parent); - if (host->child_ops) + if (host->child_ops) { child->ops =3D host->child_ops; - else +#ifdef CONFIG_PCSC + child->orig_ops =3D host->child_ops; +#endif + } else { child->ops =3D parent->ops; +#ifdef CONFIG_PCSC + child->orig_ops =3D parent->orig_ops; +#endif + } + +#ifdef CONFIG_PCSC + if (child->ops) { + if (pcsc_inject_bus_ops(child)) + pci_err(child, "PCSC: Failed to inject ops\n"); + } +#endif =20 /* * Initialize some portions of the bus device, but don't register diff --git a/include/linux/pci.h b/include/linux/pci.h index d1fdf81fbe1e..b6cbf93db644 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -669,6 +669,9 @@ struct pci_bus { struct resource busn_res; /* Bus numbers routed to this bus */ =20 struct pci_ops *ops; /* Configuration access functions */ +#ifdef CONFIG_PCSC + struct pci_ops *orig_ops; /* Original ops before PCSC injection */ +#endif void *sysdata; /* Hook for sys-specific extension */ struct proc_dir_entry *procdir; /* Directory entry in /proc/bus/pci */ =20 diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h new file mode 100644 index 000000000000..45816eb2b2c8 --- /dev/null +++ b/include/linux/pcsc.h @@ -0,0 +1,86 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved. + * Author: Evangelos Petrongonas + * + */ + +#ifndef _LINUX_PCSC_H +#define _LINUX_PCSC_H + +#include + +/** + * pcsc_hw_config_read - Direct hardware PCI config space read + * @bus: PCI bus + * @devfn: PCI device function + * @where: offset in PCI config space + * @size: size of data to read + * @val: pointer to store read data + * + * This function performs a direct hardware read from PCI configuration sp= ace, + * bypassing the PCSC cache. It is a weak function that can be overridden = by + * architecture-specific implementations. + * + * Return: 0 on success, non-zero error code on failure + */ +int pcsc_hw_config_read(struct pci_bus *bus, unsigned int devfn, int where, + int size, u32 *val); + +/** + * pcsc_hw_config_write - Direct hardware PCI config space write + * @bus: PCI bus + * @devfn: PCI device function + * @where: offset in PCI config space + * @size: size of data to write + * @val: value to write + * + * This function performs a direct hardware write to PCI configuration spa= ce, + * bypassing the PCSC cache. It is a weak function that can be overridden = by + * architecture-specific implementations. + * + * Return: 0 on success, non-zero error code on failure + */ +int pcsc_hw_config_write(struct pci_bus *bus, unsigned int devfn, int wher= e, + int size, u32 val); + +/** + * pcsc_cached_config_read - Read PCI config space register via PCSC + * @bus: PCI bus + * @devfn: PCI device function + * @where: offset in PCI config space + * @size: size of data to read + * @val: pointer to store read data + * + * Reads a register from the PCI configuration space of a device using the + * PCSC infrastructure. + * + * Return: 0 on success, non-zero error code on failure + */ +int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int w= here, + int size, u32 *val); + +/** + * pcsc_cached_config_write - Write PCI config space register via PCSC + * @bus: PCI bus + * @devfn: PCI device function + * @where: offset in PCI config space + * @size: size of data to write + * @val: value to write + * + * Writes a value to a register in the PCI configuration space of a device= using + * the PCSC infrastructure. + * + * Return: 0 on success, non-zero error code on failure + */ +int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int = where, + int size, u32 val); + +/** + * pcsc_inject_bus_ops Inject the pcsc ops into bus pci_ops + * @bus: the bus in which to inject the ops + * + * Return: 0 on success, negative error code on failure + */ +int pcsc_inject_bus_ops(struct pci_bus *bus); +#endif /* _LINUX_PCSC_H */ --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-001.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-001.esa.us-west-2.outbound.mail-perimeter.amazon.com [44.245.243.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D858527B4E5; Fri, 3 Oct 2025 09:02:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=44.245.243.92 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482138; cv=none; b=ftYxDUawK+auGYxctd99tyh5bKL5LmJZH/MhD11XiEUxVy5+m5s4+Al62zy/D+0YAjRViz6zOwmSU1PkILWUj9MEy8ULdlv5FhShujD9/d2Og5f7v91l5yjNGi6tK/aJ+tl/yNO074KM6IL61wxKMW8PLW65okFeHSM+EFt/r50= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482138; c=relaxed/simple; bh=fbuf55nZc0A5Jk8cQxMMStl3mJZ8VqE9MP24zvoQpUo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=RQY7yuEEzJODX2i4Hb/alX/FcGWFkoGmLb23uwcqPhohzXaG3jHeWKrWsIADWQ+LMjGtW/m3jJEh08hXSCkkYnvY73qMtznIcvB5PKZNjuLlAO5VxyJ7G8Qgv+xANpOc9Lbh/FWpjANS7BR8jhY4xhGvK96uzw67IR67GB9T7Lw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=PZpIdR2Y; arc=none smtp.client-ip=44.245.243.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="PZpIdR2Y" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482136; x=1791018136; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ViRJy4MQR6VeNiIo5EZJ98n4zWE+Nw0q/T2qMHtTtP0=; b=PZpIdR2YJHRpNlt54GxceHh4bIA7/jA2dcqMhjEO3CH1Zo8jcvSHqMyK 3hPPMBLpsnvGPXOZpY/7Ki7yWAnCKUxCC4hfPTdiZGH5yh7bhIaTHuVLj fmGl/L1R/T713kt8Bc6OCYKeNio154AmlV40tmR32bsEVsUBfFvv3+62f 4YcANPskUlbfJXWU3BXEnjAqeRTra9EjZjxlkVrjEp7CqCtnztlvb7+CQ w0HmVsKjx/9T+UeeCjRHyprFNzzvi4LRCdxWgr8lEVma5k0OUUnRN5mYn ImkE9JgEK/JyNWTA9iSPimNe1ll6KSmzBAmS02R5CotpcqkAn/jBpx+jA w==; X-CSE-ConnectionGUID: qyaBFtYbQeqL/50QWO6EeA== X-CSE-MsgGUID: MP4XAZg6TIWmKZnUf88AuA== X-IronPort-AV: E=Sophos;i="6.18,281,1751241600"; d="scan'208";a="4212724" Received: from ip-10-5-9-48.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.9.48]) by internal-pdx-out-001.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:02:14 +0000 Received: from EX19MTAUWB002.ant.amazon.com [10.0.38.20:2067] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.25.156:2525] with esmtp (Farcaster) id a9e0fec6-f742-42a7-870c-e33a1c45b9f5; Fri, 3 Oct 2025 09:02:14 +0000 (UTC) X-Farcaster-Flow-ID: a9e0fec6-f742-42a7-870c-e33a1c45b9f5 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWB002.ant.amazon.com (10.250.64.231) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:02:14 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:02:11 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 02/13] pci: pcsc: implement basic functionality Date: Fri, 3 Oct 2025 09:00:38 +0000 Message-ID: X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D040UWB001.ant.amazon.com (10.13.138.82) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement the core functionality of the PCI Configuration Space Cache using per-device cache nodes attached to struct pci_dev. Each cache node stores: - A 256-byte array (4KB for PCIe) representing the configuration space - A cacheable bitmask indicating which registers can be cached - A cached bitmask tracking which bytes are currently cached The implementation attaches cache nodes directly to pci_dev structures during `pci_device_add()` and removes them during `pci_device_remove()`. The cache implements a write-invalidate policy where writes are propagated to the device while invalidating the cache. This design choice improves robustness and increases the number of cacheable registers, particularly for operations like BAR sizing which use write-read sequences to detect read-only bits. Currently, the cacheable bitmask is zero-initialized, effectively disabling the cache. This will be changed in the next commits. This implementation only supports endpoint devices; bridges and root complexes are not cached. Signed-off-by: Evangelos Petrongonas --- drivers/pci/pci-driver.c | 5 + drivers/pci/pcsc.c | 244 ++++++++++++++++++++++++++++++++++++++- drivers/pci/probe.c | 9 ++ include/linux/pci.h | 5 + include/linux/pcsc.h | 38 ++++++ 5 files changed, 299 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 302d61783f6c..7c0cbbd50b32 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "pci.h" #include "pcie/portdrv.h" =20 @@ -497,7 +498,11 @@ static void pci_device_remove(struct device *dev) * horrible the crap we have to deal with is when we are awake... */ =20 + #ifdef CONFIG_PCSC + pcsc_remove_device(pci_dev); +#endif pci_dev_put(pci_dev); + } =20 static void pci_device_shutdown(struct device *dev) diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index dec7c51b5cfd..7531217925e8 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -14,9 +14,16 @@ =20 #define pr_fmt(fmt) "PCSC: " fmt =20 +#include #include =20 static bool pcsc_initialised; +static atomic_t num_nodes =3D ATOMIC_INIT(0); + +inline bool pcsc_is_initialised(void) +{ + return pcsc_initialised; +} =20 static int pcsc_add_bus(struct pci_bus *bus) { @@ -103,13 +110,225 @@ int __weak pcsc_hw_config_write(struct pci_bus *bus,= unsigned int devfn, } EXPORT_SYMBOL_GPL(pcsc_hw_config_write); =20 +static inline int _test_bits(int where, int size, const void *addr) +{ + int i; + int res =3D 1; + + for (i =3D 0; i < size; i++) + res &=3D test_bit(where + i, addr); + return res; +} + +static int pcsc_is_access_cacheable(struct pci_dev *dev, int where, int si= ze) +{ + if (unlikely(!dev || (where + size > PCSC_CFG_SPC_SIZE))) + return 0; + + return _test_bits(where, size, dev->pcsc->cachable_bitmask); +} + +static inline bool pcsc_is_cached(struct pci_dev *dev, int where, int size) +{ + if (unlikely(!dev || !dev->pcsc || !dev->pcsc->cfg_space || + (where + size > PCSC_CFG_SPC_SIZE))) + return 0; + + return _test_bits(where, size, dev->pcsc->cached_bitmask); +} + +static inline void pcsc_set_cached(struct pci_dev *dev, int where, bool ca= ched) +{ + if (WARN_ON(!dev)) + return; + + if (WARN_ON(where >=3D PCSC_CFG_SPC_SIZE)) + return; + + if (cached) + set_bit(where, dev->pcsc->cached_bitmask); + else + clear_bit(where, dev->pcsc->cached_bitmask); +} + +static int pcsc_get_byte(struct pci_dev *dev, int where, u8 *val) +{ + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return -EINVAL; + + if (WARN_ON(where >=3D PCSC_CFG_SPC_SIZE)) + return -EPERM; + *val =3D dev->pcsc->cfg_space[where]; + return 0; +} + +static int pcsc_update_byte(struct pci_dev *dev, int where, u8 val) +{ + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return -EINVAL; + + if (WARN_ON(where >=3D PCSC_CFG_SPC_SIZE)) + return -EPERM; + dev->pcsc->cfg_space[where] =3D val; + pcsc_set_cached(dev, where, true); + + return 0; +} + +int pcsc_add_device(struct pci_dev *dev) +{ + struct pcsc_node *node; + struct pci_bus *bus; + + if (WARN_ON(!dev)) + return -EINVAL; + + bus =3D dev->bus; + + node =3D kzalloc(sizeof(*node), GFP_KERNEL); + if (!node) + return -ENOMEM; + + dev->pcsc =3D node; + /* The current version of the PCSC supports only endpoint devices. + * Bridges and RCs are not supported, but we are still creating + * nodes for these devices, as it simplifies the code flow + */ + if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) { + dev->pcsc->cfg_space =3D kzalloc(PCSC_CFG_SPC_SIZE, GFP_KERNEL); + if (!dev->pcsc->cfg_space) + goto err_free_node; + + } else { + dev->pcsc->cfg_space =3D NULL; + } + + atomic_inc(&num_nodes); + pci_dbg(dev, "PCSC: Created cache node\n"); + + return 0; + +err_free_node: + dev->pcsc =3D NULL; + kfree(node); + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(pcsc_add_device); + +int pcsc_remove_device(struct pci_dev *dev) +{ + if (WARN_ON(!dev)) + return -EINVAL; + + pci_dbg(dev, "PCSC: Removing cache node"); + + atomic_dec(&num_nodes); + + if (dev->pcsc && dev->pcsc->cfg_space) { + kfree(dev->pcsc->cfg_space); + kfree(dev->pcsc); + } + dev->pcsc =3D NULL; + + return 0; +} +EXPORT_SYMBOL_GPL(pcsc_remove_device); + +/** + * pcsc_get_and_insert_multiple - Read multiple bytes from PCI cache or HW + * @dev: PCI device to read from + * @bus: PCI bus to read from + * @devfn: device and function number + * @where: offset in config space + * @word: pointer to store read value + * @size: number of bytes to read (1, 2 or 4) + * + * Reads consecutive bytes from PCI cache or hardware. If values are not c= ached, + * reads from hardware and inserts into cache. + * + * Return: 0 on success, negative error code on failure + */ +static int pcsc_get_and_insert_multiple(struct pci_dev *dev, + struct pci_bus *bus, unsigned int devfn, + int where, u32 *word, int size) +{ + u32 word_cached =3D 0; + u8 byte_val; + int rc, i; + + if (WARN_ON(!dev || !bus || !word)) + return -EINVAL; + + if (WARN_ON(size !=3D 1 && size !=3D 2 && size !=3D 4)) + return -EINVAL; + + /* Check bounds */ + if (where + size > PCSC_CFG_SPC_SIZE) + return -EINVAL; + + if (pcsc_is_cached(dev, where, size)) { + /* Read bytes from cache and assemble them into word_cached + * in little-endian order (as per PCI spec) + */ + for (i =3D 0; i < size; i++) { + pcsc_get_byte(dev, where + i, &byte_val); + word_cached |=3D ((u32)byte_val << (i * 8)); + } + } else { + rc =3D pcsc_hw_config_read(bus, devfn, where, size, &word_cached); + if (rc) { + pci_err(dev, + "%s: Failed to read CFG Space where=3D%d size=3D%d", + __func__, where, size); + return rc; + } + + /* Extract bytes from word_cached in little-endian order + * and store them in cache. + */ + for (i =3D 0; i < size; i++) { + byte_val =3D (word_cached >> (i * 8)) & 0xFF; + pcsc_update_byte(dev, where + i, byte_val); + } + } + + *word =3D word_cached; + return 0; +} + int pcsc_cached_config_read(struct pci_bus *bus, unsigned int devfn, int w= here, int size, u32 *val) { - if (!pcsc_initialised) + int rc; + struct pci_dev *dev; + + if (unlikely(!pcsc_is_initialised())) goto read_from_dev; =20 + if (WARN_ON(!bus || !val || (size !=3D 1 && size !=3D 2 && size !=3D 4) || + where + size > PCSC_CFG_SPC_SIZE)) + return -EINVAL; + + dev =3D pci_get_slot(bus, devfn); + + if (unlikely(!dev || !dev->pcsc)) + goto read_from_dev; + + if (dev->pcsc->cfg_space && + pcsc_is_access_cacheable(dev, where, size)) { + rc =3D pcsc_get_and_insert_multiple(dev, bus, devfn, where, val, + size); + if (likely(!rc)) { + pci_dev_put(dev); + return 0; + } + /* if reading from the cache failed continue and try reading + * from the actual device + */ + } read_from_dev: + if (dev) + pci_dev_put(dev); return pcsc_hw_config_read(bus, devfn, where, size, val); } EXPORT_SYMBOL_GPL(pcsc_cached_config_read); @@ -117,10 +336,31 @@ EXPORT_SYMBOL_GPL(pcsc_cached_config_read); int pcsc_cached_config_write(struct pci_bus *bus, unsigned int devfn, int = where, int size, u32 val) { - if (!pcsc_initialised) + int i; + struct pci_dev *dev; + + if (unlikely(!pcsc_is_initialised())) goto write_to_dev; =20 + if (WARN_ON(!bus || (size !=3D 1 && size !=3D 2 && size !=3D 4) || + where + size > PCSC_CFG_SPC_SIZE)) + return -EINVAL; + + dev =3D pci_get_slot(bus, devfn); + + if (unlikely(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) { + /* Do not add nodes on arbitrary writes */ + goto write_to_dev; + } else { + /* Mark the cache as dirty */ + if (pcsc_is_access_cacheable(dev, where, size)) { + for (i =3D 0; i < size; i++) + pcsc_set_cached(dev, where + i, false); + } + } write_to_dev: + if (dev) + pci_dev_put(dev); return pcsc_hw_config_write(bus, devfn, where, size, val); } EXPORT_SYMBOL_GPL(pcsc_cached_config_write); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 33a186e4bf1e..c231e09e5a6e 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "pci.h" =20 #define CARDBUS_LATENCY_TIMER 176 /* secondary latency timer */ @@ -2801,6 +2802,14 @@ void pci_device_add(struct pci_dev *dev, struct pci_= bus *bus) =20 dev->state_saved =3D false; =20 +#ifdef CONFIG_PCSC + if (likely(pcsc_is_initialised())) + if (!dev->pcsc) + if (pcsc_add_device(dev)) + pci_warn(dev, + "Failed to add PCI device to PCSC\n"); +#endif + pci_init_capabilities(dev); =20 /* diff --git a/include/linux/pci.h b/include/linux/pci.h index b6cbf93db644..e59b585f96bb 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -42,6 +42,7 @@ #include =20 #include +#include =20 #define PCI_STATUS_ERROR_BITS (PCI_STATUS_DETECTED_PARITY | \ PCI_STATUS_SIG_SYSTEM_ERROR | \ @@ -560,6 +561,10 @@ struct pci_dev { u8 tph_mode; /* TPH mode */ u8 tph_req_type; /* TPH requester type */ #endif + +#ifdef CONFIG_PCSC + struct pcsc_node *pcsc; +#endif }; =20 static inline struct pci_dev *pci_physfn(struct pci_dev *dev) diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h index 45816eb2b2c8..516d73931608 100644 --- a/include/linux/pcsc.h +++ b/include/linux/pcsc.h @@ -9,6 +9,20 @@ #define _LINUX_PCSC_H =20 #include +#include +#include + +#ifdef CONFIG_PCIE_PCSC +#define PCSC_CFG_SPC_SIZE (4 * SZ_1K) +#else +#define PCSC_CFG_SPC_SIZE 256 +#endif + +struct pcsc_node { + u8 *cfg_space; + DECLARE_BITMAP(cachable_bitmask, PCSC_CFG_SPC_SIZE); + DECLARE_BITMAP(cached_bitmask, PCSC_CFG_SPC_SIZE); +}; =20 /** * pcsc_hw_config_read - Direct hardware PCI config space read @@ -83,4 +97,28 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsign= ed int devfn, int where, * Return: 0 on success, negative error code on failure */ int pcsc_inject_bus_ops(struct pci_bus *bus); + +/** + * pcsc_add_device - Allocate and initialize a new PCSC node + * This should only be called once for each device + * @dev: PCI device to initialise the cache for + * + * Returns: 0 on success error code on failure + */ +int pcsc_add_device(struct pci_dev *dev); + +/** + * pcsc_remove_device - Clear up any PCSC data + * @dev: PCI device to remove + * + * Returns: 0 on success, -EINVAL if dev is NULL + */ +int pcsc_remove_device(struct pci_dev *dev); + +/** + * @brief Returns if the PCSC infrastructure is initialised + * + */ +bool pcsc_is_initialised(void); + #endif /* _LINUX_PCSC_H */ --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-014.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-014.esa.us-west-2.outbound.mail-perimeter.amazon.com [35.83.148.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 627AE1F152D; Fri, 3 Oct 2025 09:02:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=35.83.148.184 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482167; cv=none; b=NKIH52OkDCa1wR63xrzI3vTOc4mX7ovvar6f8pNCHMDUryIOkxAJwjMVPpLj7DpNDD4gCnaGIQOEHBFQ1Doz9Gn/GCP5JpGhBrjjd+BxDSlZjv9pAXDYsmCH5C4CiOiDvZaNE2MI3zP2hW29L+VofRENabg0UCyGMszIQodVCKk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482167; c=relaxed/simple; bh=Max++AZI+AOKCWikh2iqiErydfn28u9IoTGALpu4yF4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=SfLsygKm5pZ/PwOuPAr5yv5LVdqY9Myd/8Uhs3hVCgptsCZklpidJ/FTF+8FkBq4IPL3qcxLrKe5N+YwEGOMq9tfej7W3iWAEa+HnHYGl18XWTfZHuhYDqWAKyFjm2O0P82zwwgZnNDH0n1ujVE+EgLJSZ2BH65/RJh8UL4pEHQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=bOs8sylV; arc=none smtp.client-ip=35.83.148.184 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="bOs8sylV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482165; x=1791018165; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=PU9V5eyawqU6rD2TMqdBmytt8K7XbHIr3fHrXPcNoJw=; b=bOs8sylV6E8K1hhjsfUR9p0BN3mj9W0qeO9y/RaWq4d5RDpF7qbDzgaq kz7R8PSAVVjfz7jbJwNUceEPZuWkiSCW2boT6Rt0ur0yCjPvqt8Y1YWT7 gXum4wTliiUJY4NIICKFMHVi12HkbiZo0V0PQmCewiks/Th5fnaIn0adC lQJbRPPa3tVz4xdvIyP26ml0WaG/zMIgC3E0U2MbMZaMb+g80WO4B0CKb WlGEZY0yo7q4NaJ3mhwsOPNBTfavs0AxgBf7qeo7dIhCKk3PC8yJ6uzkS ZzD62oEezxsakzEdos8V306ADaCu/pUHYuVwRo0VUHaZEV5hPOzePHufd g==; X-CSE-ConnectionGUID: UCwe02XYRFWunXRMcOXPQg== X-CSE-MsgGUID: h98Tq2JmTK+Qf1ga/GJ0Ew== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4010804" Received: from ip-10-5-9-48.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.9.48]) by internal-pdx-out-014.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:02:43 +0000 Received: from EX19MTAUWA002.ant.amazon.com [10.0.21.151:22597] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.27.3:2525] with esmtp (Farcaster) id a8ad3ff0-c92b-41f7-bba0-e44bb9829389; Fri, 3 Oct 2025 09:02:43 +0000 (UTC) X-Farcaster-Flow-ID: a8ad3ff0-c92b-41f7-bba0-e44bb9829389 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWA002.ant.amazon.com (10.250.64.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:02:42 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:02:39 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 03/13] pci: pcsc: infer cacheability of PCI capabilities Date: Fri, 3 Oct 2025 09:00:39 +0000 Message-ID: X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D039UWB003.ant.amazon.com (10.13.138.93) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement cacheability inference for PCI capabilities to determine which configuration space registers can be safely cached. The first 64 bytes of PCI configuration space follow a standardized format, allowing straightforward cacheability determination. For capability-specific registers, the implementation traverses the PCI capability list to identify supported capabilities. Cacheable registers are identified for the following capabilities: - Power Management (PM) - Message Signaled Interrupts (MSI) - Message Signaled Interrupts Extensions (MSI-X) - PCI Express - PCI Advanced Features (AF) - Enhanced Allocation (EA) - Vital Product Data (VPD) - Vendor Specific The implementation pre-populates the cache with known values including device/vendor IDs and header type to avoid unnecessary configuration space reads during initialization. We are currently not caching the Command/Status registers. The cacheability of all capabilities apart from MSI, are straightforward and can be deduced from the spec. Regarding MSI the MSI flags are read and based on this, the cacheability is inferred. Signed-off-by: Evangelos Petrongonas Signed-off-by: Stanislav Spassov --- drivers/pci/pcsc.c | 261 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 261 insertions(+) diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 7531217925e8..29945eac4190 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -175,6 +175,266 @@ static int pcsc_update_byte(struct pci_dev *dev, int = where, u8 val) return 0; } =20 +static const u8 PCSC_SUPPORTED_CAPABILITIES[] =3D { + PCI_CAP_ID_PM, PCI_CAP_ID_VPD, PCI_CAP_ID_MSI, PCI_CAP_ID_VNDR, + PCI_CAP_ID_MSIX, PCI_CAP_ID_EXP, PCI_CAP_ID_AF, PCI_CAP_ID_EA +}; + +/** + * pcsc_handle_msi_cacheability - Set cacheability for MSI capability regi= sters + * @dev: PCI device + * @cap_pos: Capability position in config space + * + * The MSI capability has four different shapes (12-24 bytes) depending on: + * - 64-bit addressing capability (PCI_MSI_FLAGS_64BIT) + * - Per-vector masking capability (PCI_MSI_FLAGS_MASKBIT) + * + * Cacheable registers: + * - PCI_MSI_FLAGS: Control register + * - PCI_MSI_ADDRESS_LO: Lower 32 bits of message address + * - PCI_MSI_ADDRESS_HI: Upper 32 bits (if 64-bit capable) + * - PCI_MSI_DATA_32/64: Message data register + * - PCI_MSI_MASK_32/64: Mask bits register (if masking capable) + * + * Non-cacheable registers: + * - PCI_MSI_PENDING_32/64: Pending bits (modified by device) + */ +static void pcsc_handle_msi_cacheability(struct pci_dev *dev, int cap_pos) +{ + u32 val; + u16 msi_flags; + bool is_64bit_capable; + bool is_mask_capable; + int data_offset; + int mask_offset; + + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return; + + /* Read MSI flags to determine capability shape */ + if (pcsc_hw_config_read(dev->bus, dev->devfn, cap_pos + PCI_MSI_FLAGS, + 2, &val) !=3D PCIBIOS_SUCCESSFUL) { + pci_warn(dev, "PCSC: Failed to read MSI flags at %#x\n", + cap_pos + PCI_MSI_FLAGS); + return; + } + + msi_flags =3D val & 0xFFFF; + pcsc_update_byte(dev, cap_pos + PCI_MSI_FLAGS, msi_flags & 0xFF); + pcsc_update_byte(dev, cap_pos + PCI_MSI_FLAGS + 1, (msi_flags >> 8) & 0xF= F); + + /* Mark MSI flags as cacheable */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_MSI_FLAGS, 2); + is_64bit_capable =3D !!(msi_flags & PCI_MSI_FLAGS_64BIT); + is_mask_capable =3D !!(msi_flags & PCI_MSI_FLAGS_MASKBIT); + + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_MSI_ADDRESS_LO, + 4); + + if (is_64bit_capable) { + /* PCI_MSI_ADDRESS_HI is cacheable for 64-bit capable devices */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_MSI_ADDRESS_HI, 4); + + data_offset =3D PCI_MSI_DATA_64; + mask_offset =3D PCI_MSI_MASK_64; + } else { + /* Message Data register is at different offset for 32-bit */ + data_offset =3D PCI_MSI_DATA_32; + mask_offset =3D PCI_MSI_MASK_32; + } + + /* + * Message Data register is always cacheable + * Note: PCI spec defines Extended Message Data Capable (bit 9, 0x0200) + * which allows 4-byte message data instead of 2-byte. However, Linux + * doesn't currently define or use this capability, so we conservatively + * mark only 2 bytes as cacheable for compatibility. + */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + data_offset, 2); + + if (is_mask_capable) { + /* Mask bits register is cacheable if masking is supported */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + mask_offset, + 4); + } +} + +static void infer_capability_cacheability(struct pci_dev *dev, int cap_pos, + u8 cap_id) +{ + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return; + + switch (cap_id) { + case PCI_CAP_ID_PM: + /* Power Management Capability */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PM_PMC, + 2); /* PCI_PM_PMC */ + break; + case PCI_CAP_ID_MSI: + /* Message Signaled Interrupts */ + pcsc_handle_msi_cacheability(dev, cap_pos); + break; + case PCI_CAP_ID_VNDR: + /* Vendor Specific */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_CAP_FLAGS, + 1); + /* Only the flag can be cached as the body is opaque */ + break; + case PCI_CAP_ID_MSIX: + /* MSI-X - the entire capability is cacheable */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_MSIX_FLAGS, 10); + break; + case PCI_CAP_ID_EXP: + /* PCI Express capability - All except Status registers */ + bitmap_set( + dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_FLAGS, + 8); /* PCI_EXP_FLAGS, PCI_EXP_DEVCAP, PCI_EXP_DEVCTL */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_LNKCAP, + 6); /* PCI_EXP_LNKCAP, PCI_EXP_LNKCTL */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_SLTCAP, + 6); /* PCI_EXP_SLTCAP, PCI_EXP_SLTCTL */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_RTCTL, + 4); /* PCI_EXP_RTCTL, PCI_EXP_RTCAP */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_DEVCAP2, + 6); /* PCI_EXP_DEVCAP2, PCI_EXP_DEVCTL2 */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_LNKCAP2, + 6); /* PCI_EXP_LNKCAP2, PCI_EXP_LNKCTL2 */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_SLTCAP2, + 6); /* PCI_EXP_SLTCAP2, PCI_EXP_SLTCTL2 */ + break; + case PCI_CAP_ID_AF: + /* PCI Advanced Features */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_AF_LENGTH, + 2); /* PCI_AF_LENGTH, PCI_AF_CAP */ + break; + case PCI_CAP_ID_EA: + /* Enhanced Allocation Theoretically the entire capability could + * be cached, but it is not trivial to deduce its size. + */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EA_NUM_ENT, 2); + break; + case PCI_CAP_ID_VPD: + /* Vital Product Data */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_VPD_ADDR, + 2); /* PCI_VPD_ADDR */ + break; + default: + /* Unsupported capability - We shouldn't reach this point */ + pr_warn("Something is off when iterating through the supported capabilit= ies."); + break; + } +} + +static void infer_capabilities_pointers(struct pci_dev *dev) +{ + u8 pos, cap_id, next_cap; + u32 val; + int i; + + if (pcsc_hw_config_read(dev->bus, dev->devfn, PCI_CAPABILITY_LIST, 1, + &val) !=3D PCIBIOS_SUCCESSFUL) + return; + + pos =3D (val & 0xFF) & ~0x3; + + while (pos) { + if (pos < 0x40 || pos > 0xFE) + break; + + pos &=3D ~0x3; + if (pcsc_hw_config_read(dev->bus, dev->devfn, pos, 2, &val) !=3D + PCIBIOS_SUCCESSFUL) + break; + + cap_id =3D val & 0xFF; /* PCI_CAP_LIST_ID */ + next_cap =3D (val >> 8) & 0xFF; /* PCI_CAP_LIST_NEXT */ + + bitmap_set(dev->pcsc->cachable_bitmask, pos, 2); + pcsc_update_byte(dev, pos, cap_id); /* PCI_CAP_LIST_ID */ + pcsc_update_byte(dev, pos + 1, + next_cap); /* PCI_CAP_LIST_NEXT */ + + pci_dbg(dev, "Capability ID %#x found at %#x\n", cap_id, pos); + + /* Check if this is a supported capability and infer cacheability */ + for (i =3D 0; i < ARRAY_SIZE(PCSC_SUPPORTED_CAPABILITIES); i++) { + if (cap_id =3D=3D PCSC_SUPPORTED_CAPABILITIES[i]) { + infer_capability_cacheability(dev, pos, cap_id); + break; + } + } + + /* Move to next capability */ + pos =3D next_cap; + } +} + +static void infer_cacheability(struct pci_dev *dev) +{ + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return; + + bitmap_zero(dev->pcsc->cachable_bitmask, PCSC_CFG_SPC_SIZE); + + /* Type 0 Configuration Space Header */ + if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) { + /* + * Mark cacheable registers in the PCI configuration space header. + * We cache read-only and rarely changing registers: + * - PCI_VENDOR_ID, PCI_DEVICE_ID (0x00-0x03) + * - PCI_CLASS_REVISION through PCI_CAPABILITY_LIST (0x08-0x34) + * Includes: CLASS_REVISION, CACHE_LINE_SIZE, LATENCY_TIMER, + * HEADER_TYPE, BIST, BASE_ADDRESS_0-5, CARDBUS_CIS, + * SUBSYSTEM_VENDOR_ID, SUBSYSTEM_ID, ROM_ADDRESS, CAPABILITY_LIST + * - PCI_INTERRUPT_LINE through PCI_MAX_LAT (0x3c-0x3f) + * Includes: INTERRUPT_LINE, INTERRUPT_PIN, MIN_GNT, MAX_LAT + */ + bitmap_set(dev->pcsc->cachable_bitmask, PCI_VENDOR_ID, 4); + bitmap_set(dev->pcsc->cachable_bitmask, PCI_CLASS_REVISION, 45); + bitmap_set(dev->pcsc->cachable_bitmask, PCI_INTERRUPT_LINE, 4); + + /* Pre populate the cache with the values that we already know */ + pcsc_update_byte(dev, PCI_HEADER_TYPE, + dev->hdr_type | + (dev->multifunction ? 0x80 : 0)); + + /* + * SR-IOV VFs must return 0xFFFF (PCI_ANY_ID) for vendor/device ID + * registers per PCIe spec. + */ + if (dev->is_virtfn) { + pcsc_update_byte(dev, PCI_VENDOR_ID, 0xFF); + pcsc_update_byte(dev, PCI_VENDOR_ID + 1, 0xFF); + pcsc_update_byte(dev, PCI_DEVICE_ID, 0xFF); + pcsc_update_byte(dev, PCI_DEVICE_ID + 1, 0xFF); + } else { + if (dev->vendor !=3D PCI_ANY_ID) { + pcsc_update_byte(dev, PCI_VENDOR_ID, + dev->vendor & 0xFF); + pcsc_update_byte(dev, PCI_VENDOR_ID + 1, + (dev->vendor >> 8) & 0xFF); + } + if (dev->device !=3D PCI_ANY_ID) { + pcsc_update_byte(dev, PCI_DEVICE_ID, + dev->device & 0xFF); + pcsc_update_byte(dev, PCI_DEVICE_ID + 1, + (dev->device >> 8) & 0xFF); + } + } + + infer_capabilities_pointers(dev); + } +} + int pcsc_add_device(struct pci_dev *dev) { struct pcsc_node *node; @@ -199,6 +459,7 @@ int pcsc_add_device(struct pci_dev *dev) if (!dev->pcsc->cfg_space) goto err_free_node; =20 + infer_cacheability(dev); } else { dev->pcsc->cfg_space =3D NULL; } --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-009.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-009.esa.us-west-2.outbound.mail-perimeter.amazon.com [35.155.198.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3793E280CCE; Fri, 3 Oct 2025 09:03:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=35.155.198.111 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482191; cv=none; b=BvmBVncjpV8x2jb53cK3KHTizIyn20vxPvZsT3YP8KJcQZMB+DFnos1U2B65YykiL934q402lfegd6lmm26PAxWj4qyJ8V+ta2281wzcKmb5UA61FsZ7Lwk0501MCi05+a8xKVPncbxz8Hd8zl42AzWzOSvFut09+4kO4+bvlvo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482191; c=relaxed/simple; bh=0KZ0n4kojPv4sqYsZ4h8eRDRtef+RvVOId6dkEUd5k8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=fyD1TaH+tbNazJaLEOYalHEuuzeDm1aTZORvkmH74XHonM/inYq7My7a43/DeDgffehXGw25SYM/d6fqRvfvylzX/ssb1/JH663l4VhkDHeurBYJNED9/JpkdbL7yahgfM/4PQuOM7LYqhoJR2H5pmcmQ7IpwcZNeyKQ5xcSVD0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=dlkJZ4VR; arc=none smtp.client-ip=35.155.198.111 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="dlkJZ4VR" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482189; x=1791018189; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+ArayRhSq303qFHt5ZQSDT/AviX9lru1vDA/q7ieGNs=; b=dlkJZ4VR2yXcx48pE3zPOnLao0i0/kXqzf7ABfXuQYVPz+W48iige2N8 T67RSoxraWJzfTeOiE+cFB5gENBN5lftxPpzRy6PJG+skzpeDLP25YMEL p2nl8ti+beWfcbtoQA/gZgxUNtPYBtglUWgKm+BvAJqlxZvnYHnrtssKj 2XL+CNozw42/lcK5hfFhk+8LwjAuUghds0RTE20kcdT+U3odcJpkD/15G l8ixlfnIQRt3fGaV+S/hzv8MKw22HMGMcM6bSwkmAWdO37U0UPNdMzj/k M3rfoIHrQpnh/d5mHwgVF2sfMWpd0l10al/dQmgCdmZNuAWqx86k9T7bW w==; X-CSE-ConnectionGUID: IHjvvxWXTAuOdk89amRzMA== X-CSE-MsgGUID: 8bsG8dUQS8mtBpN2Us9IoQ== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4090440" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-009.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:03:07 +0000 Received: from EX19MTAUWB001.ant.amazon.com [10.0.7.35:46879] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.32.215:2525] with esmtp (Farcaster) id ae66a1df-44c0-4093-8c0b-5ecb61355fd0; Fri, 3 Oct 2025 09:03:07 +0000 (UTC) X-Farcaster-Flow-ID: ae66a1df-44c0-4093-8c0b-5ecb61355fd0 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWB001.ant.amazon.com (10.250.64.248) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:03:06 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:03:04 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 04/13] pci: pcsc: infer PCIe extended capabilities Date: Fri, 3 Oct 2025 09:00:40 +0000 Message-ID: <026b1d3e3fcb2a554511de3f23d6a7640b5377b6.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D039UWB001.ant.amazon.com (10.13.138.119) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Extend PCSC to support cacheability inference for PCIe extended capabilities located in the 4KB extended configuration space. Similar to the capabilities, PCIe extended capabilities require traversal of the capability list to determine cacheability. The implementation identifies cacheable registers for capabilities used by the generic PCIe driver: - Advanced Error Reporting (AER) - Access Control Services (ACS) - Alternative Routing-ID (ARI) - SR-IOV - Address Translation Services (ATS) - Page Request Interface (PRI) - Process Address Space ID (PASID) - Downstream Port Containment (DPC) - Precision Time Measurement (PTM) The extended capability header (4 bytes) is always cached to enable efficient capability list traversal. All the extended capabilities apart from the DPC are static. Regarding DPC, the DPC capabilities is read and based on its value the cacheability of RP* registers is inferred. Signed-off-by: Evangelos Petrongonas Signed-off-by: Stanislav Spassov --- drivers/pci/pcsc.c | 203 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 203 insertions(+) diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 29945eac4190..343f8b03831a 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -180,6 +180,65 @@ static const u8 PCSC_SUPPORTED_CAPABILITIES[] =3D { PCI_CAP_ID_MSIX, PCI_CAP_ID_EXP, PCI_CAP_ID_AF, PCI_CAP_ID_EA }; =20 +#ifdef CONFIG_PCIE_PCSC +static const u16 PCSCS_SUPPORTED_EXT_CAPABILITIES[] =3D { + PCI_EXT_CAP_ID_ERR, PCI_EXT_CAP_ID_ACS, PCI_EXT_CAP_ID_ARI, + PCI_EXT_CAP_ID_SRIOV, PCI_EXT_CAP_ID_ATS, PCI_EXT_CAP_ID_PRI, + PCI_EXT_CAP_ID_PASID, PCI_EXT_CAP_ID_DPC, PCI_EXT_CAP_ID_PTM +}; + +/** + * pcsc_handle_dpc_cacheability - Set cacheability for DPC capability regi= sters + * @dev: PCI device + * @cap_pos: Capability position in config space + * + * The DPC capability cacheability depends on whether RP extensions are su= pported: + * - PCI_EXP_DPC_CAP_RP_EXT bit indicates RP extension register presence + */ +static void pcsc_handle_dpc_cacheability(struct pci_dev *dev, int cap_pos) +{ + u32 val; + u16 dpc_cap; + bool has_rp_extensions; + + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return; + + if (pcsc_hw_config_read(dev->bus, dev->devfn, cap_pos + PCI_EXP_DPC_CAP, + 2, &val) !=3D PCIBIOS_SUCCESSFUL) { + pci_warn(dev, "PCSC: Failed to read DPC capability at %#x\n", + cap_pos + PCI_EXP_DPC_CAP); + return; + } + + dpc_cap =3D val & 0xFFFF; + has_rp_extensions =3D !!(dpc_cap & PCI_EXP_DPC_CAP_RP_EXT); + + /* Cache the DPC capability register */ + pcsc_update_byte(dev, cap_pos + PCI_EXP_DPC_CAP, dpc_cap & 0xFF); + pcsc_update_byte(dev, cap_pos + PCI_EXP_DPC_CAP + 1, + (dpc_cap >> 8) & 0xFF); + + /* Always cacheable: main DPC registers */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_DPC_CAP, 2); + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_EXP_DPC_CTL, 2); + + /* Conditionally cacheable: RP extension registers PCI_EXP_DPC_RP_PIO_MA= SK + * PCI_EXP_DPC_RP_PIO_SEVERITY , PCI_EXP_DPC_RP_PIO_SYSERROR, PCI_EXP_DPC= _RP_PIO_EXCEPTION + */ + if (has_rp_extensions) { + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_DPC_RP_PIO_MASK, 16); + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_DPC_RP_PIO_SEVERITY, 4); + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_DPC_RP_PIO_SYSERROR, 4); + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_EXP_DPC_RP_PIO_EXCEPTION, 4); + } +} +#endif + /** * pcsc_handle_msi_cacheability - Set cacheability for MSI capability regi= sters * @dev: PCI device @@ -378,6 +437,146 @@ static void infer_capabilities_pointers(struct pci_de= v *dev) } } =20 +#ifdef CONFIG_PCIE_PCSC + +static void infer_extended_capability_cacheability(struct pci_dev *dev, + int cap_pos, u16 cap_id) +{ + if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) + return; + + switch (cap_id) { + case PCI_EXT_CAP_ID_ERR: + /* Advanced Error Reporting */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_ERR_UNCOR_MASK, + 8); /* PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_ERR_COR_MASK, + 4); /* PCI_ERR_COR_MASK only */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_ERR_ROOT_COMMAND, + 4); /* PCI_ERR_ROOT_COMMAND */ + break; + case PCI_EXT_CAP_ID_ACS: + /* Access Control Services + * We only cache PCI_ACS_CAP and PCI_ACS_CTRL (first 4 bytes). + * The Egress Control Vector that follows (if present) is not + * cached because: + * - Determining its size would require reading PCI_ACS_CAP + * - These registers are typically only written by the OS during + * setup and not read frequently during runtime + * - Caching them would provide no performance benefit + */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_ACS_CAP, + 4); /* PCI_ACS_CAP, PCI_ACS_CTRL */ + break; + case PCI_EXT_CAP_ID_ARI: + /* Alternative Routing-ID: */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_ARI_CAP, + 4); /* PCI_ARI_CAP, PCI_ARI_CTRL */ + break; + case PCI_EXT_CAP_ID_SRIOV: + /* SR-IOV */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_CAP, + 6); /* PCI_SRIOV_CAP, PCI_SRIOV_CTRL */ + /* PCI_SRIOV_INITIAL_VF, PCI_SRIOV_TOTAL_VF, + * PCI_SRIOV_NUM_VF,PCI_SRIOV_FUNC_LINK + */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_SRIOV_INITIAL_VF, 7); + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_SRIOV_VF_OFFSET, + 4); /* PCI_SRIOV_VF_OFFSET, PCI_SRIOV_VF_STRIDE */ + /* PCI_SRIOV_VF_DID, PCI_SRIOV_SUPPORTED_PAGE_SIZES,PCI_SRIOV_PAGE_SIZE = */ + bitmap_set( + dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_VF_DID, + 10); + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_BAR, + 24); /* PCI_SRIOV_BAR0-5 */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_SRIOV_VFM, + 4); /* PCI_SRIOV_VFMM */ + break; + case PCI_EXT_CAP_ID_ATS: + /* Address Translation Service: */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_ATS_CAP, + 4); /* PCI_ATS_CAP, PCI_ATS_CTRL*/ + break; + case PCI_EXT_CAP_ID_PRI: + /* Page Request Interface */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PRI_CTRL, + 2); /* PCI_PRI_CTRL */ + bitmap_set(dev->pcsc->cachable_bitmask, + cap_pos + PCI_PRI_MAX_REQ, + 8); /* PCI_PRI_MAX_REQ, PCI_PRI_ALLOC_REQ */ + break; + case PCI_EXT_CAP_ID_PASID: + /* Process Address Space ID */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PASID_CAP, + 4); /* PCI_PASID_CAP, PCI_PASID_CTRL */ + break; + case PCI_EXT_CAP_ID_DPC: + /* Downstream Port Containment */ + pcsc_handle_dpc_cacheability(dev, cap_pos); + break; + case PCI_EXT_CAP_ID_PTM: + /* Precision Time Measurement */ + bitmap_set(dev->pcsc->cachable_bitmask, cap_pos + PCI_PTM_CAP, + 8); /* PCI_PTM_CAP, PCI_PTM_CTRL */ + break; + default: + /* Unknown extended capability - only cache header */ + break; + } +} + +static void infer_extended_capabilities_pointers(struct pci_dev *dev) +{ + int pos =3D 0x100; + u32 header; + int cap_ver, cap_id; + int i; + + while (pos) { + if (pos > 0xFFC || pos < 0x100) + break; + + pos &=3D ~0x3; + + if (pcsc_hw_config_read(dev->bus, dev->devfn, pos, 4, + &header) !=3D PCIBIOS_SUCCESSFUL) + break; + + if (!header) + break; + + bitmap_set(dev->pcsc->cachable_bitmask, pos, 4); + for (i =3D 0; i < 4; i++) + pcsc_update_byte(dev, pos + i, + (header >> (i * 8)) & 0xFF); + + cap_id =3D PCI_EXT_CAP_ID(header); + cap_ver =3D PCI_EXT_CAP_VER(header); + + pci_dbg(dev, + "Extended capability ID %#x (ver %d) found at %#x, next cap at %#x\n", + cap_id, cap_ver, pos, PCI_EXT_CAP_NEXT(header)); + + /* Check if this is a supported extended capability and infer cacheabili= ty */ + for (i =3D 0; i < ARRAY_SIZE(PCSCS_SUPPORTED_EXT_CAPABILITIES); + i++) { + if (cap_id =3D=3D PCSCS_SUPPORTED_EXT_CAPABILITIES[i]) { + infer_extended_capability_cacheability(dev, pos, + cap_id); + break; + } + } + + pos =3D PCI_EXT_CAP_NEXT(header); + } +} +#endif + static void infer_cacheability(struct pci_dev *dev) { if (WARN_ON(!dev || !dev->pcsc || !dev->pcsc->cfg_space)) @@ -432,6 +631,10 @@ static void infer_cacheability(struct pci_dev *dev) } =20 infer_capabilities_pointers(dev); +#ifdef CONFIG_PCIE_PCSC + if (pci_is_pcie(dev)) + infer_extended_capabilities_pointers(dev); +#endif } } =20 --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-001.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-001.esa.us-west-2.outbound.mail-perimeter.amazon.com [44.245.243.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A17127EFFE; Fri, 3 Oct 2025 09:03:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=44.245.243.92 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482213; cv=none; b=KsU4E7mYMUKp6RPXaetGNLMopXpUq8Fh31tXZgRWmMo6/9pBLXpea7jjBICt3baqia5ZB2Shlqm/8zvGLtx+6ki/ocf1cQaL4Mx39V0sTl1JmyIVrGcikSzxulAVPJar9bVHrIk2V8ZEqpNIuzNN2ddcC5s2y1sZ9Tf7hLC7kR0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482213; c=relaxed/simple; bh=7YJj/l6hm9baYCf0DdWLW1O1HHavwXDrXJi17yievAQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=b07ubKxHQi91GWoI2w4b+CouFvYHUUL+a9uGndZSErQTlYowFJLeiCVSNSMEJOIhf1RxKgGtfj4yKuU3bok1sjwPBk+kcs3dSRc5UGWXVs0IfBCBKbn86TrHgGjrmbnMZQA0oUaaKAJlFHydfYnndECeKUvAYH3WQC5ZV7aFbfY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=m0hOtV3R; arc=none smtp.client-ip=44.245.243.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="m0hOtV3R" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482211; x=1791018211; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R53udJ/Bwhc5LJsFCV4Fcn4Db8eU4XSezxWONzLc/Iw=; b=m0hOtV3RsGF+knvW4hJJeKsPRmtqric9G5Sryrn+HPxwDrfK5MFP9pPu UqoTFmEPf4GSJxoXdp3QJ8cbchAmlwSCVDvEtiKNyOihqu5GwjnYM3hqP E8e2harlr1OZdxdnF+E/CnwAn4rZJPgSnbd/eonhLNYsWCHiWDSZv8lqJ apSiDrLawsfgOeguCcrFhYQcPh/49s8YUG826MW0H37rvpae6LFZy7aRl Zub5m+YGJxA8YVzgDPT2cjLNLnz+t7abPP4/AAr6FUH5fe8ao8nrC+DZW bfaCm6QUScZeBFSsz2uip24I4l0ynuF0Uiqb0C0WCdxaT/ZhOOgd0iQln w==; X-CSE-ConnectionGUID: JbmNmWK0RlOcLU6UPh3vwg== X-CSE-MsgGUID: 5x8eu3ipS52YAasc1TR2kQ== X-IronPort-AV: E=Sophos;i="6.18,281,1751241600"; d="scan'208";a="4212779" Received: from ip-10-5-0-115.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.0.115]) by internal-pdx-out-001.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:03:31 +0000 Received: from EX19MTAUWC002.ant.amazon.com [10.0.7.35:50300] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.32.215:2525] with esmtp (Farcaster) id 08837055-f124-4a21-8817-637d1c0a30bd; Fri, 3 Oct 2025 09:03:30 +0000 (UTC) X-Farcaster-Flow-ID: 08837055-f124-4a21-8817-637d1c0a30bd Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWC002.ant.amazon.com (10.250.64.143) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:03:30 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:03:28 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 05/13] pci: pcsc: control the cache via sysfs and kernel params Date: Fri, 3 Oct 2025 09:00:41 +0000 Message-ID: <2a0e6b85b06fef2d77ddd6879dea4335aeb3021f.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D040UWB002.ant.amazon.com (10.13.138.89) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add kernel parameters and runtime control mechanisms for the PCSC A new kernel parameter 'pcsc_enabled' allows enabling or disabling the cache at boot time. The parameter defaults to disabled. A sysfs interface at /sys/bus/pci/pcsc/enabled provides: - Read access to query current cache status (1=3Denabled, 0=3Ddisabled) - Write access to dynamically enable/disable the cache at runtime Signed-off-by: Evangelos Petrongonas --- Documentation/ABI/testing/sysfs-bus-pci-pcsc | 20 ++++ .../admin-guide/kernel-parameters.txt | 3 + drivers/pci/pcsc.c | 93 ++++++++++++++++++- 3 files changed, 114 insertions(+), 2 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-pcsc diff --git a/Documentation/ABI/testing/sysfs-bus-pci-pcsc b/Documentation/A= BI/testing/sysfs-bus-pci-pcsc new file mode 100644 index 000000000000..ee92bf087816 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-pci-pcsc @@ -0,0 +1,20 @@ +PCI Configuration Space Cache (PCSC) +------------------------------------- + +The PCI Configuration Space Cache (PCSC) is a transparent caching layer +that intercepts configuration space operations to reduce hardware access +overhead. This subsystem addresses performance bottlenecks in PCI +configuration space accesses, particularly in virtualization +environments with high-density SR-IOV deployments where repeated +enumeration of Virtual Functions creates substantial delays. + +What: /sys/bus/pci/pcsc/enabled +Date: September 2025 +Contact: Linux PCI developers +Description: + PCI Configuration Space Cache (PCSC) is a subsystem that + caches accesses to the PCI configuration space of PCI + functions. When this file contains the "1", the kernel + is utilizing the cache, while when on "0" the + system bypasses it. This setting can also be controlled +parameter. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index 747a55abf494..08c7a13f107c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5036,6 +5036,9 @@ =20 pcmv=3D [HW,PCMCIA] BadgePAD 4 =20 + pcsc_enabled=3D [PCSC] enable the use of the PCI Configuration Space + Cache (PCSC). + pd_ignore_unused [PM] Keep all power-domains already enabled by bootloader on, diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 343f8b03831a..44d842733230 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -16,13 +16,21 @@ =20 #include #include +#include + +static bool pcsc_enabled; +static int __init pcsc_enabled_setup(char *str) +{ + return kstrtobool(str, &pcsc_enabled) =3D=3D 0; +} +__setup("pcsc_enabled=3D", pcsc_enabled_setup); =20 static bool pcsc_initialised; static atomic_t num_nodes =3D ATOMIC_INIT(0); =20 inline bool pcsc_is_initialised(void) { - return pcsc_initialised; + return pcsc_initialised && pcsc_enabled; } =20 static int pcsc_add_bus(struct pci_bus *bus) @@ -899,14 +907,95 @@ static struct notifier_block pcsc_bus_nb =3D { .notifier_call =3D pcsc_bus_notify, }; =20 +static ssize_t pcsc_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", pcsc_enabled); +} + +static ssize_t pcsc_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, + size_t count) +{ + bool new_value; + int ret; + + ret =3D kstrtobool(buf, &new_value); + if (ret < 0) + return ret; + + pcsc_enabled =3D new_value; + return count; +} + +static struct kobj_attribute pcsc_enabled_attribute =3D + __ATTR(enabled, 0644, pcsc_enabled_show, pcsc_enabled_store); + +static struct attribute *pcsc_attrs[] =3D { + &pcsc_enabled_attribute.attr, + NULL, +}; + +static struct attribute_group pcsc_attr_group =3D { + .attrs =3D pcsc_attrs, +}; + +static struct kobject *pcsc_kobj; + +static void pcsc_create_sysfs(void) +{ + struct kset *pci_bus_kset; + int ret; + + if (pcsc_kobj) + return; /* Already created */ + + pci_bus_kset =3D bus_get_kset(&pci_bus_type); + if (!pci_bus_kset) { + /* PCI bus kset not ready yet, will be retried later */ + return; + } + + pcsc_kobj =3D kobject_create_and_add("pcsc", &pci_bus_kset->kobj); + if (!pcsc_kobj) { + pr_err("Failed to create sysfs kobject\n"); + return; + } + + ret =3D sysfs_create_group(pcsc_kobj, &pcsc_attr_group); + if (ret) { + pr_err("Failed to create sysfs group\n"); + kobject_put(pcsc_kobj); + pcsc_kobj =3D NULL; + return; + } +} + static int __init pcsc_init(void) { bus_register_notifier(&pci_bus_type, &pcsc_bus_nb); =20 + /* Try to create sysfs entry, but don't fail if PCI bus isn't ready yet */ + pcsc_create_sysfs(); + pcsc_initialised =3D true; - pr_info("initialised\n"); + pr_info("initialised (enabled=3D%d)\n", pcsc_enabled); =20 return 0; } =20 +/* Late initcall to retry sysfs creation if it failed during core_initcall= */ +static int __init pcsc_sysfs_init(void) +{ + pcsc_create_sysfs(); + return 0; +} + core_initcall(pcsc_init); + +/* + * The PCI subsystem is initialised later, therefore we need to add + * our sysfs entries later. This is done to avoid modifying the sysfs + * creation of the core pci driver. + */ +late_initcall(pcsc_sysfs_init); --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.42.203.116]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C68B828150F; Fri, 3 Oct 2025 09:03:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.42.203.116 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482239; cv=none; b=ezcbLtdj/ptqHnXi5Q16XHiXzIWuAmhny9WCNsDQoZC26RFbz7cppcJkEhCyKNRHjNGicSlYtqyusF3a2uP57hI9ozP1GAjARRm09htodjuVXlmsRtcYxp8GTFpP4YpVnXBHwZJ20+AqPiA5oLao620E2BB9AwHGut6kfEt7Voc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482239; c=relaxed/simple; bh=YUIXWB9DQpkrQk4uATLGgMYEq0Q2/OSHXmRDBN8np6c=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=A9PRNIi2Xi+P3B9H/7Vx6F3Y7wfRp4Y9dy+DzYBE200N+zvPWQYBGigthlktMzB2UVXtShs9g5wbLmsZ1bSVCEKZ8kcffHApY21g5ZulqAwioyY2BkFKagzU1Nf325OCplwk4g8sphNlsWKVzQpXEfZ4WSgt0P+KoWxxtfKjKA0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=KDDwukY6; arc=none smtp.client-ip=52.42.203.116 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="KDDwukY6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482237; x=1791018237; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0Mvwx7/lH7MVq6cdziprtPczRXASQRPlRUci+PHX/9w=; b=KDDwukY62IKgrBL423ocvq4gLQ3o87xILNLA+USF+bwlzegOXOgaTtvr VzRms54muyRySYPoVnm3rzVjnM2oBoPGpaccwIR8Xvn6y6FlW3+khE/5Z T58yL4vCM2vOFLxWUsC+C6UmW2ES0W1HDtFCisVK8V8v0t3NXIonW71gb PAvSctz+8BJ+oGheIo7/Hh6fw3WlKRju+MDbBZxaL+zRe6LgZXuecvKya NGIHqUvhohvpqYxtLB9TwkAWwT57ojm9p7soD31nOx+vlKNXMyBlQDVR7 AxkSQzung3jVzCLNwgwV/xqMIhhiBb2golexPl0gbXSRQSnZimoC5Pun3 w==; X-CSE-ConnectionGUID: +lIA1xYvRnynGPZfiLR4Eg== X-CSE-MsgGUID: rVagEA40RwuFBQnOSMwBMg== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4203485" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:03:55 +0000 Received: from EX19MTAUWA001.ant.amazon.com [10.0.38.20:63163] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.20.71:2525] with esmtp (Farcaster) id 97b22730-fdc0-482d-b08f-52743c9bdbae; Fri, 3 Oct 2025 09:03:55 +0000 (UTC) X-Farcaster-Flow-ID: 97b22730-fdc0-482d-b08f-52743c9bdbae Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWA001.ant.amazon.com (10.250.64.204) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:03:54 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:03:52 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 06/13] pci: pcsc: handle device resets Date: Fri, 3 Oct 2025 09:00:42 +0000 Message-ID: <0fa6f46439b535eedaa82c360e1ea19e7f052fca.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D036UWB001.ant.amazon.com (10.13.139.133) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The PCI Configuration Space Cache (PCSC) maintains cached values of configuration space registers for performance optimization. When a PCI device is reset or bus operations are dynamically changed, cached values become stale and can cause incorrect behavior. This patch ensures cache coherency by invalidating the PCSC cache in all scenarios where the underlying configuration space values may have changed. Device Reset Handling: ---------------------- When PCI devices are reset, their configuration space registers return to default values. Add pcsc_device_reset() calls after all device reset operations to invalidate stale cached values: - Function Level Resets (FLR) in `pcie_flr()` - Advanced Features FLR in `pci_af_flr()` - Power Management resets (D3hot->D0 transition) in `pci_pm_reset()` - Device-specific resets in `pci_dev_specific_reset()` - D3cold power state transitions in `__pci_set_power_state()` - ACPI-based resets in `pci_dev_acpi_reset()` - Bus restore operations in `pci_bus_restore_locked()` - Slot restore operations in `pci_slot_restore_locked()` - Secondary bus resets in `pci_bridge_secondary_bus_reset()` For secondary bus resets, `pcsc_reset_bus_recursively()` invalidates the cache for all devices on the secondary bus and subordinate buses. This also covers hotplug slot reset operations since `pciehp_reset_slot()` calls `pci_bridge_secondary_bus_reset()`. In addition, functions like `pci_dev_wait` are configured to bypass the cahce and reads the actual HW values. Dynamic Ops Changes: -------------------- The patch also addresses cache consistency issues when bus operations are dynamically changed via `pci_bus_set_ops()``. Different ops implementations may return different values for the same registers, and hardware state may have changed while using the different ops. This commit resets the cache for all devices on the affected bus Implementation Details: ----------------------- The cache invalidation clears the cached_bitmask while preserving the cacheable_bitmask, as the configuration space layout remains unchanged after a reset. This allows the cache to be repopulated with fresh values on subsequent configuration space accesses. Known Limitations: ------------------ - There is currently a gap in handling PowerPC secondary bus resets, as the architecture-specific `pcibios_reset_secondary_bus()` can bypass the generic `pci_reset_secondary_bus()` where our cache invalidation occurs. Signed-off-by: Evangelos Petrongonas Signed-off-by: Stanislav Spassov --- drivers/pci/access.c | 13 ++++++--- drivers/pci/pci-acpi.c | 4 +++ drivers/pci/pci.c | 60 +++++++++++++++++++++++++++++++++++++++++- drivers/pci/pcsc.c | 17 ++++++++++++ drivers/pci/quirks.c | 7 ++++- include/linux/pcsc.h | 11 ++++++++ 6 files changed, 106 insertions(+), 6 deletions(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index b89e9210d330..0a5de8d76bfe 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -245,11 +245,16 @@ struct pci_ops *pci_bus_set_ops(struct pci_bus *bus, = struct pci_ops *ops) /* * IMPORTANT: Dynamic ops changes after PCSC injection can lead to * cache consistency issues if operations were performed that should - * have invalidated the cache. We re-inject PCSC ops here, but the - * caller is responsible for ensuring cache consistency if needed. - * This will be fixed in a future commit, when PCSC resets are - * introduced. + * have invalidated the cache. We must reset the cache for all + * devices on this bus to ensure consistency. (No need for recursive + * reset on subordinate buses) */ + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + if (dev->pcsc && dev->pcsc->cached_bitmask) + bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE); + } =20 pr_warn("PCSC: Dynamic ops change detected on bus %04x:%02x, resetting c= ache\n", pci_domain_nr(bus), bus->number); diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c index 9369377725fa..0b638115c7c7 100644 --- a/drivers/pci/pci-acpi.c +++ b/drivers/pci/pci-acpi.c @@ -983,6 +983,10 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe) return -ENOTTY; } =20 +#ifdef CONFIG_PCSC + pcsc_device_reset(dev); +#endif + return 0; } =20 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index f518cfa266b5..db940f8fd408 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -1248,11 +1249,19 @@ static int pci_dev_wait(struct pci_dev *dev, char *= reset_type, int timeout) } =20 if (root && root->config_rrs_sv) { +#ifdef CONFIG_PCSC + pcsc_hw_config_read(dev->bus, dev->devfn, PCI_VENDOR_ID, 4, &id); +#else pci_read_config_dword(dev, PCI_VENDOR_ID, &id); +#endif if (!pci_bus_rrs_vendor_id(id)) break; } else { +#ifdef CONFIG_PCSC + pcsc_hw_config_read(dev->bus, dev->devfn, PCI_COMMAND, 4, &id); +#else pci_read_config_dword(dev, PCI_COMMAND, &id); +#endif if (!PCI_POSSIBLE_ERROR(id)) break; } @@ -1564,7 +1573,9 @@ static int __pci_set_power_state(struct pci_dev *dev,= pci_power_t state, bool lo =20 if (pci_platform_power_transition(dev, PCI_D3cold)) return error; - + #ifdef CONFIG_PCSC + pcsc_device_reset(dev); + #endif /* Powering off a bridge may power off the whole hierarchy */ if (dev->current_state =3D=3D PCI_D3cold) __pci_bus_set_current_state(dev->subordinate, PCI_D3cold, locked); @@ -4493,6 +4504,10 @@ int pcie_flr(struct pci_dev *dev) */ msleep(100); =20 +#ifdef CONFIG_PCSC + pcsc_device_reset(dev); +#endif + return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS); } EXPORT_SYMBOL_GPL(pcie_flr); @@ -4560,6 +4575,10 @@ static int pci_af_flr(struct pci_dev *dev, bool prob= e) */ msleep(100); =20 +#ifdef CONFIG_PCSC + pcsc_device_reset(dev); +#endif + return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS); } =20 @@ -4605,6 +4624,10 @@ static int pci_pm_reset(struct pci_dev *dev, bool pr= obe) pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr); pci_dev_d3_sleep(dev); =20 +#ifdef CONFIG_PCSC + pcsc_device_reset(dev); +#endif + return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS); } =20 @@ -4904,6 +4927,31 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev= *dev, char *reset_type) PCIE_RESET_READY_POLL_MS - delay); } =20 +#ifdef CONFIG_PCSC +/** + * pcsc_reset_bus_recursively - Recursively reset PCSC cache for all devic= es + * in bus hierarchy + * @bus: PCI bus to process + * + * Recursively invalidate PCSC cache for all devices on the given bus + * and all subordinate buses. + */ +static void pcsc_reset_bus_recursively(struct pci_bus *bus) +{ + struct pci_dev *dev; + + if (!bus) + return; + + list_for_each_entry(dev, &bus->devices, bus_list) { + pcsc_device_reset(dev); + /* If this device is a bridge, recursively process its subordinate bus */ + if (dev->subordinate) + pcsc_reset_bus_recursively(dev->subordinate); + } +} +#endif + void pci_reset_secondary_bus(struct pci_dev *dev) { u16 ctrl; @@ -4920,6 +4968,10 @@ void pci_reset_secondary_bus(struct pci_dev *dev) =20 ctrl &=3D ~PCI_BRIDGE_CTL_BUS_RESET; pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); + +#ifdef CONFIG_PCSC + pcsc_reset_bus_recursively(dev->subordinate); +#endif } =20 void __weak pcibios_reset_secondary_bus(struct pci_dev *dev) @@ -5542,6 +5594,9 @@ static void pci_bus_restore_locked(struct pci_bus *bu= s) =20 list_for_each_entry(dev, &bus->devices, bus_list) { pci_dev_restore(dev); +#ifdef CONFIG_PCSC + pcsc_device_reset(dev); +#endif if (dev->subordinate) { pci_bridge_wait_for_secondary_bus(dev, "bus reset"); pci_bus_restore_locked(dev->subordinate); @@ -5579,6 +5634,9 @@ static void pci_slot_restore_locked(struct pci_slot *= slot) if (!dev->slot || dev->slot !=3D slot) continue; pci_dev_restore(dev); +#ifdef CONFIG_PCSC + pcsc_device_reset(dev); +#endif if (dev->subordinate) { pci_bridge_wait_for_secondary_bus(dev, "slot reset"); pci_bus_restore_locked(dev->subordinate); diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 44d842733230..5412dea23446 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -837,6 +837,23 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsi= gned int devfn, int where, } EXPORT_SYMBOL_GPL(pcsc_cached_config_write); =20 +int pcsc_device_reset(struct pci_dev *dev) +{ + if (unlikely((!dev))) + return -EINVAL; + + if (unlikely(!pcsc_is_initialised())) + return 0; + + /* The layout of the CFG Space is not going to change after a device + * reset, whether the reset is FLR or conventional. Only the values + * are going to change. We could further optimise the cache to maintain + * some of the HWInt values that are going to remain constant after a res= et. + */ + bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE); + return 0; +} + static struct pci_ops pcsc_ops =3D { .add_bus =3D pcsc_add_bus, .remove_bus =3D pcsc_remove_bus, diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6eb3d20386e9..97555fbba938 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -4239,8 +4239,13 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool= probe) if ((i->vendor =3D=3D dev->vendor || i->vendor =3D=3D (u16)PCI_ANY_ID) && (i->device =3D=3D dev->device || - i->device =3D=3D (u16)PCI_ANY_ID)) + i->device =3D=3D (u16)PCI_ANY_ID)) { +#ifdef CONFIG_PCSC + if (!probe) + pcsc_device_reset(dev); +#endif return i->reset(dev, probe); + } } =20 return -ENOTTY; diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h index 516d73931608..85471273c0a9 100644 --- a/include/linux/pcsc.h +++ b/include/linux/pcsc.h @@ -121,4 +121,15 @@ int pcsc_remove_device(struct pci_dev *dev); */ bool pcsc_is_initialised(void); =20 +/** + * pcsc_device_reset - Handle PCI device reset + * @dev: PCI device being reset + * + * This function should be called when a PCI device is being reset + * to ensure the cache is properly invalidated. + * + * Returns: 0 on success, negative error code on failure + */ +int pcsc_device_reset(struct pci_dev *dev); + #endif /* _LINUX_PCSC_H */ --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-013.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-013.esa.us-west-2.outbound.mail-perimeter.amazon.com [34.218.115.239]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E75371D88D0; Fri, 3 Oct 2025 09:04:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=34.218.115.239 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482263; cv=none; b=o11o5v44cD/19E+1pkTjFF/4sBf/ygFGhD0iPDDLxmAOChecb8HcfHoGKpt+aCiKNPdc3P3q8Zggupp2A3FoBYvhskMG0SEbgHj9iIs5VcxptXryn7Nt29Jcn5FBknDuoQwcClLfE9lNKz3hp2cyI5+UtWj7WTC72aaHM54mosk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482263; c=relaxed/simple; bh=dIoo5Wn/TwilQe/1KY9uprCu5D7mZvUDyVyVfixKjRA=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=n07dATntMXOe3D0t9FTa44gVTOAIC/GuZdEJpHF7Qi5bWL/Lf+KyhaNxJ8Y09YsK8doEprtGEeM9bUmll4r/D5IrRtZxd21DdTE6aLmHSLj14CSf07vDHVmKja+onj+xRPxbB2TSnWRZsidWyCOTgSMnBqy/iMs4Oma/A4cpvpA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=oJOYB9T4; arc=none smtp.client-ip=34.218.115.239 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="oJOYB9T4" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482261; x=1791018261; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=P9MVvP96kcAAq5QcR/4ak3wqiSOgfOY+p+dSjZVM698=; b=oJOYB9T4d/8mKYT0O/u7u8YCEn4/i4TcdCIBkBzs38dYZLYDU0NJVAh4 Bl7gTPAzNCGyNm9nYPPdtQnd1mcaYoLzYvt7JJp8dyUKLdX/uFPQHc5gS choKP4OHvDEMSRuSMLbuHTHBXDo+N/pBTFIhp0hmBeFuvnXXojJYH5ITc YCV8Stl0YekOvRNLB4BJdJCED8gC+s5R/QNuWjVm5pFLrgompFYiEg7I5 yVHtjElYexZhsKoJ5KIFHs79dvMpgqmvBxzn6Gf9hX76j7uihnt8XWdVW 8xhKN4PG2bDkjUceJM8XgSMuhpzcH//NLU2qB/CmA2NTPJOj1G2qn1d1/ w==; X-CSE-ConnectionGUID: ZxfWATLIREerzXHQdn60hQ== X-CSE-MsgGUID: UeR53ARTReqAwdtAycqv4Q== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4024614" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-013.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:04:19 +0000 Received: from EX19MTAUWB002.ant.amazon.com [10.0.38.20:13370] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.25.156:2525] with esmtp (Farcaster) id 2d0a494a-5863-4724-ba7a-105b85049e1a; Fri, 3 Oct 2025 09:04:19 +0000 (UTC) X-Farcaster-Flow-ID: 2d0a494a-5863-4724-ba7a-105b85049e1a Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWB002.ant.amazon.com (10.250.64.231) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:04:19 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:04:16 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 07/13] pci: pcsc: introduce statistic gathering tools Date: Fri, 3 Oct 2025 09:00:43 +0000 Message-ID: <762a6242ba9688aeb432c738e297cc8d039d0273.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D032UWB004.ant.amazon.com (10.13.139.136) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce optional statistics gathering for the PCI Configuration Space Cache to measure cache effectiveness and performance impact. When CONFIG_PCSC_STATS is enabled, the implementation tracks: - Cache hits and misses - Uncacheable reads - Write operations and cache invalidations - Total reads and hardware reads - Time spent in cache vs hardware accesses - Number of Device Resets Statistics are exposed via /sys/bus/pci/pcsc/stats in a human-readable format including calculated hit rates and access times in microseconds. Signed-off-by: Evangelos Petrongonas --- Documentation/ABI/testing/sysfs-bus-pci-pcsc | 9 + drivers/pci/Kconfig | 7 + drivers/pci/pcsc.c | 183 ++++++++++++++++++- 3 files changed, 196 insertions(+), 3 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci-pcsc b/Documentation/A= BI/testing/sysfs-bus-pci-pcsc index ee92bf087816..daf0d06c89c8 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci-pcsc +++ b/Documentation/ABI/testing/sysfs-bus-pci-pcsc @@ -18,3 +18,12 @@ Description: is utilizing the cache, while when on "0" the system bypasses it. This setting can also be controlled parameter. + +What: /sys/bus/pci/pcsc/stats +Date: March 2025 +Contact: Evangelos Petrongonas +Description: + PCI Configuration Space Cache (PCSC) if the PCSC + Statistics are enabled via the PCSC_STATS + configuration option, the statistics can be recovered + via reading this sysfs. diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index c26162b58365..9b5275ef2d16 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -50,6 +50,13 @@ config PCSC intercepts configuration space operations and maintains cached copies of register values =20 +config PCSC_STATS + bool "PCI Configuration Space Cache Statistics" + depends on PCSC + default n + help + This option allows the collection of statistics for the PCSC. + source "drivers/pci/pcie/Kconfig" =20 config PCI_MSI diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 5412dea23446..304239b7ff8a 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -25,9 +25,84 @@ static int __init pcsc_enabled_setup(char *str) } __setup("pcsc_enabled=3D", pcsc_enabled_setup); =20 +#ifdef CONFIG_PCSC_STATS +struct pcsc_stats { + /* Operation Counters */ + unsigned long cache_hits; + unsigned long cache_misses; + unsigned long uncachable_reads; + unsigned long writes; + unsigned long cache_invalidations; + unsigned long total_reads; + unsigned long hw_reads; + unsigned long device_resets; + u64 total_cache_access_time; /* in milliseconds */ + u64 total_hw_access_time; /* in milliseconds */ + u64 hw_access_time_due_to_misses; /* in milliseconds */ +}; +#endif + static bool pcsc_initialised; static atomic_t num_nodes =3D ATOMIC_INIT(0); =20 +#ifdef CONFIG_PCSC_STATS +struct pcsc_stats pcsc_stats; + +static inline void pcsc_count_cache_hit(void) +{ + pcsc_stats.cache_hits++; + pcsc_stats.total_reads++; +} + +static inline void pcsc_count_cache_miss(void) +{ + pcsc_stats.cache_misses++; + pcsc_stats.total_reads++; + pcsc_stats.hw_reads++; +} + +static inline void pcsc_count_uncachable_read(void) +{ + pcsc_stats.uncachable_reads++; + pcsc_stats.total_reads++; + pcsc_stats.hw_reads++; +} + +static inline void pcsc_count_write(void) +{ + pcsc_stats.writes++; +} + +static inline void pcsc_count_cache_invalidation(void) +{ + pcsc_stats.cache_invalidations++; +} + +static inline void pcsc_count_device_reset(void) +{ + pcsc_stats.device_resets++; +} +#else +static inline void pcsc_count_cache_hit(void) +{ +} +static inline void pcsc_count_cache_miss(void) +{ +} +static inline void pcsc_count_uncachable_read(void) +{ +} +static inline void pcsc_count_write(void) +{ +} +static inline void pcsc_count_cache_invalidation(void) +{ +} +static inline void pcsc_count_device_reset(void) +{ +} +#endif + inline bool pcsc_is_initialised(void) { return pcsc_initialised && pcsc_enabled; @@ -727,6 +802,10 @@ static int pcsc_get_and_insert_multiple(struct pci_dev= *dev, u32 word_cached =3D 0; u8 byte_val; int rc, i; +#ifdef CONFIG_PCSC_STATS + ktime_t start_time; + u64 duration; +#endif =20 if (WARN_ON(!dev || !bus || !word)) return -EINVAL; @@ -734,7 +813,6 @@ static int pcsc_get_and_insert_multiple(struct pci_dev = *dev, if (WARN_ON(size !=3D 1 && size !=3D 2 && size !=3D 4)) return -EINVAL; =20 - /* Check bounds */ if (where + size > PCSC_CFG_SPC_SIZE) return -EINVAL; =20 @@ -746,8 +824,17 @@ static int pcsc_get_and_insert_multiple(struct pci_dev= *dev, pcsc_get_byte(dev, where + i, &byte_val); word_cached |=3D ((u32)byte_val << (i * 8)); } + pcsc_count_cache_hit(); } else { +#ifdef CONFIG_PCSC_STATS + start_time =3D ktime_get(); +#endif rc =3D pcsc_hw_config_read(bus, devfn, where, size, &word_cached); +#ifdef CONFIG_PCSC_STATS + duration =3D ktime_to_ns(ktime_sub(ktime_get(), start_time)); + pcsc_stats.hw_access_time_due_to_misses +=3D duration; + pcsc_stats.total_hw_access_time +=3D duration; +#endif if (rc) { pci_err(dev, "%s: Failed to read CFG Space where=3D%d size=3D%d", @@ -762,6 +849,7 @@ static int pcsc_get_and_insert_multiple(struct pci_dev = *dev, byte_val =3D (word_cached >> (i * 8)) & 0xFF; pcsc_update_byte(dev, where + i, byte_val); } + pcsc_count_cache_miss(); } =20 *word =3D word_cached; @@ -773,6 +861,17 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsig= ned int devfn, int where, { int rc; struct pci_dev *dev; +#ifdef CONFIG_PCSC_STATS + ktime_t hw_start_time; + u64 hw_duration; +#endif + +#ifdef CONFIG_PCSC_STATS + u64 duration; + ktime_t start_time; + + start_time =3D ktime_get(); +#endif =20 if (unlikely(!pcsc_is_initialised())) goto read_from_dev; @@ -790,6 +889,10 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsig= ned int devfn, int where, pcsc_is_access_cacheable(dev, where, size)) { rc =3D pcsc_get_and_insert_multiple(dev, bus, devfn, where, val, size); +#ifdef CONFIG_PCSC_STATS + duration =3D ktime_to_ns(ktime_sub(ktime_get(), start_time)); + pcsc_stats.total_cache_access_time +=3D duration; +#endif if (likely(!rc)) { pci_dev_put(dev); return 0; @@ -797,11 +900,23 @@ int pcsc_cached_config_read(struct pci_bus *bus, unsi= gned int devfn, int where, /* if reading from the cache failed continue and try reading * from the actual device */ + } else { + if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) + pcsc_count_uncachable_read(); } read_from_dev: +#ifdef CONFIG_PCSC_STATS + hw_start_time =3D ktime_get(); +#endif if (dev) pci_dev_put(dev); - return pcsc_hw_config_read(bus, devfn, where, size, val); + rc =3D pcsc_hw_config_read(bus, devfn, where, size, val); +#ifdef CONFIG_PCSC_STATS + hw_duration =3D ktime_to_ns(ktime_sub(ktime_get(), hw_start_time)); + /* Add timing for uncacheable reads */ + pcsc_stats.total_hw_access_time +=3D hw_duration; +#endif + return rc; } EXPORT_SYMBOL_GPL(pcsc_cached_config_read); =20 @@ -810,6 +925,11 @@ int pcsc_cached_config_write(struct pci_bus *bus, unsi= gned int devfn, int where, { int i; struct pci_dev *dev; + int rc; +#ifdef CONFIG_PCSC_STATS + ktime_t hw_start_time; + u64 hw_duration; +#endif =20 if (unlikely(!pcsc_is_initialised())) goto write_to_dev; @@ -828,12 +948,22 @@ int pcsc_cached_config_write(struct pci_bus *bus, uns= igned int devfn, int where, if (pcsc_is_access_cacheable(dev, where, size)) { for (i =3D 0; i < size; i++) pcsc_set_cached(dev, where + i, false); + pcsc_count_cache_invalidation(); } } write_to_dev: + pcsc_count_write(); if (dev) pci_dev_put(dev); - return pcsc_hw_config_write(bus, devfn, where, size, val); +#ifdef CONFIG_PCSC_STATS + hw_start_time =3D ktime_get(); +#endif + rc =3D pcsc_hw_config_write(bus, devfn, where, size, val); +#ifdef CONFIG_PCSC_STATS + hw_duration =3D ktime_to_ns(ktime_sub(ktime_get(), hw_start_time)); + pcsc_stats.total_hw_access_time +=3D hw_duration; +#endif + return rc; } EXPORT_SYMBOL_GPL(pcsc_cached_config_write); =20 @@ -851,6 +981,7 @@ int pcsc_device_reset(struct pci_dev *dev) * some of the HWInt values that are going to remain constant after a res= et. */ bitmap_zero(dev->pcsc->cached_bitmask, PCSC_CFG_SPC_SIZE); + pcsc_count_device_reset(); return 0; } =20 @@ -948,8 +1079,50 @@ static ssize_t pcsc_enabled_store(struct kobject *kob= j, static struct kobj_attribute pcsc_enabled_attribute =3D __ATTR(enabled, 0644, pcsc_enabled_show, pcsc_enabled_store); =20 +#ifdef CONFIG_PCSC_STATS +static ssize_t pcsc_stats_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit( + buf, + "Cache Hits: %lu\n" + "Cache Misses: %lu\n" + "Uncachable Reads: %lu\n" + "Writes: %lu\n" + "Cache Invalidations: %lu\n" + "Device Resets: %lu\n" + "Total Reads: %lu\n" + "Hardware Reads: %lu\n" + "Hit Rate: %lu%%\n" + "Total Cache Access Time: %llu us\n" + "Cache Access Time (without HW reads due to Misses): %llu us\n" + "HW Access Time due to misses: %llu us\n" + "Total Hardware Access Time: %llu us\n", + pcsc_stats.cache_hits, pcsc_stats.cache_misses, + pcsc_stats.uncachable_reads, pcsc_stats.writes, + pcsc_stats.cache_invalidations, pcsc_stats.device_resets, + pcsc_stats.total_reads, + pcsc_stats.hw_reads, + pcsc_stats.total_reads ? + (pcsc_stats.cache_hits * 100) / pcsc_stats.total_reads : + 0, + pcsc_stats.total_cache_access_time / 1000, + (pcsc_stats.total_cache_access_time - + pcsc_stats.hw_access_time_due_to_misses) / + 1000, + pcsc_stats.hw_access_time_due_to_misses / 1000, + pcsc_stats.total_hw_access_time / 1000); +} + +static struct kobj_attribute pcsc_stats_attribute =3D + __ATTR(stats, 0444, pcsc_stats_show, NULL); +#endif + static struct attribute *pcsc_attrs[] =3D { &pcsc_enabled_attribute.attr, +#ifdef CONFIG_PCSC_STATS + &pcsc_stats_attribute.attr, +#endif NULL, }; =20 @@ -995,6 +1168,10 @@ static int __init pcsc_init(void) /* Try to create sysfs entry, but don't fail if PCI bus isn't ready yet */ pcsc_create_sysfs(); =20 +#ifdef CONFIG_PCSC_STATS + memset(&pcsc_stats, 0, sizeof(pcsc_stats)); +#endif + pcsc_initialised =3D true; pr_info("initialised (enabled=3D%d)\n", pcsc_enabled); =20 --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.42.203.116]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 59830288522; Fri, 3 Oct 2025 09:04:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.42.203.116 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482287; cv=none; b=W7ShcRJdAxQkTq7ZZsXFL7sfCsPlyfbvw57Sw7NY3Q3qvkCRJ31A5oujy90t6ZaT/55LSCZMUkpAkXOuqBmqL2Ki/wc7i3+BY4ekBcIB9BGKcLFL7qqOSS09+2oWOmA58Qo3KxVXKOkUyV8h8NOH9742womIBtlrZdGFPEww7Iw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482287; c=relaxed/simple; bh=7N5QWNataeghBjDyL+w2/JfSvD4OPbDvKNeREaM6sOg=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=MdeIqJnefmcONvtTF5mzwbCmE2aj4hK7u4OgpBPJwp0cx+8nMEvj3Tys8xSkSukKhFSEzBzb6tbVrbpXuXjQxAA3FrEl02Z/aRL8aQmWPsdKugS75hwUX0b8GLoW0txet1jlcvXy0qWe0W+0lFiwe42nMT6EmM3g8J3UaV6B+Wk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=bLhP90S7; arc=none smtp.client-ip=52.42.203.116 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="bLhP90S7" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482286; x=1791018286; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EThcknCv/RwuqYAnw3foo0dHvAQW6OeUi0mVdys6TZg=; b=bLhP90S7beDcuJy3mgMSYnErGSx+kxD1nw5i/qdGdTzJswqmMkVfsHze ewqCqqClPB49kanKhoe+1fXSnApz3tJdzZXwzlhHSYZOLn9vemp1fL+d/ MnvwtpKkQAWriWheYgVV0rA0LJOcMe1cgmlGnTcAhW60bXv/Daq9UFmrZ bXu833W7RZv3ZYeKVc9AtRthpbPsinsUN0Gqk75Nrai/xOx+1JBZVaJ0f yFOcrOyfVKZVPhhtSjW3xlUFxVN3ycI6JxDvFFdOAjPHB7S0xpARl3jCJ e05X/T5KVOfgRllS8glMAn4P1tTAeAg6CMC2OWVh2WVIuykeCpZ0fwqsA g==; X-CSE-ConnectionGUID: TUrawTrnQYqfAxqHhC7PVQ== X-CSE-MsgGUID: RXclbhseT6mKd9DrcPsrxg== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4203505" Received: from ip-10-5-9-48.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.9.48]) by internal-pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:04:45 +0000 Received: from EX19MTAUWA002.ant.amazon.com [10.0.21.151:25443] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.54.171:2525] with esmtp (Farcaster) id 20957580-7853-4403-8847-56ac8f77b02f; Fri, 3 Oct 2025 09:04:45 +0000 (UTC) X-Farcaster-Flow-ID: 20957580-7853-4403-8847-56ac8f77b02f Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWA002.ant.amazon.com (10.250.64.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:04:45 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:04:42 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 08/13] pci: Save only spec-defined configuration space Date: Fri, 3 Oct 2025 09:00:44 +0000 Message-ID: <93623324232f4ec4dcda830d497ac2890b19215f.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D039UWA002.ant.amazon.com (10.13.139.32) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Change PCI configuration space save/restore operations by saving only the regions defined by the PCI specification avoiding any potential side effects of undefined behaviour. The current implementation saves the entire configuration space for device restore operations, including reserved and undefined regions. This change modifies the save logic to save only architecturally defined configuration space regions and skipping the undefined areas. This benefits the PCSC hitrate, as a 4byte access to a region where only 2 bytes are cacheable and 2 are undefined, therefore uncached, will lead to a HW access instead. Signed-off-by: Evangelos Petrongonas --- drivers/pci/pci.c | 61 +++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 56 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index db940f8fd408..3e99baaaf8cd 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1752,11 +1752,62 @@ static void pci_restore_pcix_state(struct pci_dev *= dev) int pci_save_state(struct pci_dev *dev) { int i; - /* XXX: 100% dword access ok here? */ - for (i =3D 0; i < 16; i++) { - pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]); - pci_dbg(dev, "save config %#04x: %#010x\n", - i * 4, dev->saved_config_space[i]); + + if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) { + for (i =3D 0; i < 13; i++) { + pci_read_config_dword(dev, i * 4, + &dev->saved_config_space[i]); + pci_dbg(dev, + "saving config space at offset %#x (reading %#x)\n", + i * 4, dev->saved_config_space[i]); + } + pci_read_config_byte( + dev, PCI_CAPABILITY_LIST, + (u8 *)(&dev->saved_config_space[PCI_CAPABILITY_LIST / + 4]) + + (PCI_CAPABILITY_LIST % 4)); + pci_dbg(dev, + "saving config space at offset %#x (reading %#x)\n", + PCI_CAPABILITY_LIST, + dev->saved_config_space[PCI_CAPABILITY_LIST]); + pci_read_config_dword( + dev, PCI_INTERRUPT_LINE, + &dev->saved_config_space[PCI_INTERRUPT_LINE / 4]); + pci_dbg(dev, + "saving config space at offset %#x (reading %#x)\n", + PCI_INTERRUPT_LINE, + dev->saved_config_space[PCI_INTERRUPT_LINE]); + } else if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_BRIDGE) { + for (i =3D 0; i < 13; i++) { + pci_read_config_dword(dev, i * 4, + &dev->saved_config_space[i]); + pci_dbg(dev, + "saving config space at offset %#x (reading %#x)\n", + i * 4, dev->saved_config_space[i]); + } + pci_read_config_byte( + dev, PCI_CAPABILITY_LIST, + (u8 *)(&dev->saved_config_space[PCI_CAPABILITY_LIST / + 4]) + + (PCI_CAPABILITY_LIST % 4)); + pci_dbg(dev, + "saving config space at offset %#x (reading %#x)\n", + PCI_CAPABILITY_LIST, + dev->saved_config_space[PCI_CAPABILITY_LIST]); + for (i =3D 14; i < 16; i++) { + pci_read_config_dword(dev, i * 4, + &dev->saved_config_space[i]); + pci_dbg(dev, + "saving config space at offset %#x (reading %#x)\n", + i * 4, dev->saved_config_space[i]); + } + } else { + for (i =3D 0; i < 16; i++) { + pci_read_config_dword(dev, i * 4, + &dev->saved_config_space[i]); + pci_dbg(dev, "save config %#04x: %#010x\n", i * 4, + dev->saved_config_space[i]); + } } dev->state_saved =3D true; =20 --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-010.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-010.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.12.53.23]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AF10280025; Fri, 3 Oct 2025 09:05:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.12.53.23 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482318; cv=none; b=J03YeRt6Wbauwj3t9QOTBx/NrpBII8UudI2hX/dd4q7TB5spLLRWcnojYWP+tUbxVtlZE9LykMsx1BjUH4jxHN7shYtg4UFfrMuXRqxcmcBh8895GFUpV0oPEm49i1fTpSfE16WjmcN0YDsAqO+JoXTXHsbtaVNkfRJaZPkmGq8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482318; c=relaxed/simple; bh=7fzfstjpymKacJqV+W3sVXrBT6YXSsV27syOG1K1/r8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=DToRRT+u/9/cRwcyv89xT3J7Gs1GXn9vktnPzfkQFp/HpmGUMlP5ppKBpSXDtFYtynifO0xY28+8HKLHjDrDIm35YuiyBXVJzsO5Ec7/t9/waHUf/sv2PPv1c0LBmgcPtGn0wtwEyG3K/8yEK2pj+B5oG0xrCI28c8+PpNibUFs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=XzFaLGYc; arc=none smtp.client-ip=52.12.53.23 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="XzFaLGYc" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482316; x=1791018316; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HZH4M3TJDFEiGVNF57d3NtXFvxz/3W42480/86LBMCc=; b=XzFaLGYc5jkK4rpbAgCvE3yJtNlZdAfYSDr2ePrvmiyKYOfRB9TthG+D 5PszdhDRzl0GZZPTSPH655dnknu8f7naPGSAB1/GB7io76F/52rOoS95Z bUFh6KAKXYZDc5RsHGfPmGi30NsyZ2vsfw6l8hZPD+t62CSeHBxGSf4lS 80pianl2VSCCcsihKpOCyLPHrzIZ+Yde8nEl8P/93tMVvaAcu7kwCgdu5 KR5cf6/8X/6tOZfpdD8WxtmLnfNiq6L+koP88DwHNx91fbClaU2DsCT99 DtIggxpDtRdf9RICPoz1pIavQchRQP12Sy4y4cL61iP9U8KtqUhr0pPT3 A==; X-CSE-ConnectionGUID: lpDY9ksMQaqmZ/7Tbc6i+Q== X-CSE-MsgGUID: oSS405nVTdCIFvuP3NXK1A== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4093513" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-010.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:05:14 +0000 Received: from EX19MTAUWB001.ant.amazon.com [10.0.21.151:37411] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.13.233:2525] with esmtp (Farcaster) id 106f3034-d241-4716-b882-d45ef7434a48; Fri, 3 Oct 2025 09:05:14 +0000 (UTC) X-Farcaster-Flow-ID: 106f3034-d241-4716-b882-d45ef7434a48 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWB001.ant.amazon.com (10.250.64.248) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:05:12 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:05:10 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 09/13] vfio: pci: Fill only spec-defined configuration space regions Date: Fri, 3 Oct 2025 09:00:45 +0000 Message-ID: X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D033UWA001.ant.amazon.com (10.13.139.103) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Ammend VFIO PCI configuration space initialization by filling only the regions defined by the PCI specification, avoiding unnecessary reads from undefined or reserved areas. The current implementation reads the entire configuration space during initialization, including reserved regions that may not be implemented by the device or may have side effects when accessed. This change modifies vfio_fill_vconfig_bytes() skips reserved regions in the standard configuration space header. This benefits the PCSC hit rate, as 4byte access to a region where only 2 bytes are cacheable and 2 are undefined, therefore uncached, will lead to a HW access instead. Signed-off-by: Evangelos Petrongonas --- drivers/vfio/pci/vfio_pci_config.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci= _config.c index 8f02f236b5b4..4fc7156a77d1 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1485,7 +1485,18 @@ static int vfio_fill_vconfig_bytes(struct vfio_pci_c= ore_device *vdev, while (size) { int filled; =20 - if (size >=3D 4 && !(offset % 4)) { + if (offset =3D=3D PCI_CAPABILITY_LIST) { + u8 *byte =3D &vdev->vconfig[offset]; + + ret =3D pci_read_config_byte(pdev, offset, byte); + if (ret) + return ret; + /* Skip the reserved area */ + filled =3D 4; + } else if (offset =3D=3D 0x38) { + /* Skip the reserved area */ + filled =3D 4; + } else if (size >=3D 4 && !(offset % 4)) { __le32 *dwordp =3D (__le32 *)&vdev->vconfig[offset]; u32 dword; =20 --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-006.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-006.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.26.1.71]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A35FD1F63CD; Fri, 3 Oct 2025 09:05:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.26.1.71 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482339; cv=none; b=tIO3MRCpgJeeZn2LmN4j0cyOUr2CvbpOmFjVgojqvbJpum/GmlhXIiS5u0spFyAmzP6IOGoARRt+96U6Tt69HVSOCE8Q76HmnfSKJIxsuCNAZgEStDHACKinx67c2KglOSlTZFNV3hjnY0DOiLuAz3Zd1dAOyGWy38OgJthn4m8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482339; c=relaxed/simple; bh=slLWGxQfls1x1S/FVs4FN7heJzZTdg6FH91JCzX11E8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ee91qfiJ7HxeN1QYWHY7tfWwR2B6/6NnHKWOgzA4dGiZuxBhq7K4uaD2EXUlZJZxIsk9hX7PHDUOiz0BBi3RbLaipXvfLB00+1g+R5P1Uc0aN7PH6Xn+pOxVRSJZ3+nk4JUN533oiTOsneDoh2zNmU63W83pAucMbBl8+MyKYac= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=qOKnudwI; arc=none smtp.client-ip=52.26.1.71 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="qOKnudwI" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482337; x=1791018337; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Tn0GQRVzT5fKEu1tgXMFMDyjYjbCRfljUqvSP9velpg=; b=qOKnudwIPrdDIUOAh5v4VWsBOS4Hj/o4HIZYD2ZwvvRlHVteHuaimq7Y ymaJ6Ryra6kk5FPwOwcEHSpwL2orjKA3i8364sEa1au7k3JdPl+VYXX9k 6INyBrT2OsYafJ1kbMO+sS4+outWCuLplZHV/6xNXILPcw1yuPm8OLD9j jRvDCAkpSCxT1WFs4rCdM1PeKqWF50PLtYp1GHeg8tkFc/PU2ouKkJMJR Rm9fUjPPmj3rPrVSFPq4jxFcY+M6N28YrTlHYqLR6FrQhY9q+mNa6cfq7 UCmRVBLB3HNQwvvUdWX2odTyKPoeGaNrjw/WX/C+Wrn8NCjHholGIVWbS A==; X-CSE-ConnectionGUID: Dtu+UvqTT5SZjS6EPSHS9g== X-CSE-MsgGUID: zmyYvmr+Q7KbR3Erf9NvWw== X-IronPort-AV: E=Sophos;i="6.18,281,1751241600"; d="scan'208";a="4212893" Received: from ip-10-5-9-48.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.9.48]) by internal-pdx-out-006.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:05:37 +0000 Received: from EX19MTAUWC002.ant.amazon.com [10.0.7.35:44698] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.25.156:2525] with esmtp (Farcaster) id 9215df3b-2cb7-42da-a9b3-e947d8efb987; Fri, 3 Oct 2025 09:05:37 +0000 (UTC) X-Farcaster-Flow-ID: 9215df3b-2cb7-42da-a9b3-e947d8efb987 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWC002.ant.amazon.com (10.250.64.143) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:05:36 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:05:33 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 10/13] pci: pcsc: Use contiguous pages for the cache data Date: Fri, 3 Oct 2025 09:00:46 +0000 Message-ID: <914285f708739992f14c16fb3be336c6e8afed52.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D039UWB001.ant.amazon.com (10.13.138.119) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch refactors PCSC to use a single contiguous memory block for all per-device data. This improves memory allocation efficiency. This is a preparatory step for KHO persistence support, as it is easier to manipulate physically continuous pages. Signed-off-by: Evangelos Petrongonas --- drivers/pci/pcsc.c | 28 ++++++++++++++++++++++++---- include/linux/pcsc.h | 32 ++++++++++++++++++++++++++++++-- 2 files changed, 54 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 304239b7ff8a..18d508f76649 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -725,6 +725,7 @@ int pcsc_add_device(struct pci_dev *dev) { struct pcsc_node *node; struct pci_bus *bus; + size_t data_size; =20 if (WARN_ON(!dev)) return -EINVAL; @@ -741,12 +742,27 @@ int pcsc_add_device(struct pci_dev *dev) * nodes for these devices, as it simplifies the code flow */ if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) { - dev->pcsc->cfg_space =3D kzalloc(PCSC_CFG_SPC_SIZE, GFP_KERNEL); - if (!dev->pcsc->cfg_space) + /* Allocate contiguous, page aligned data block. This will be + * needed for persisting the data with KHO. + */ + data_size =3D sizeof(struct pcsc_data); + + dev->pcsc->data =3D + (struct pcsc_data *)__get_free_pages( + GFP_KERNEL | __GFP_ZERO, get_order(data_size)); + if (!dev->pcsc->data) + goto err_free_node; =20 + dev->pcsc->cachable_bitmask =3D dev->pcsc->data->cachable_bitmask; + dev->pcsc->cached_bitmask =3D dev->pcsc->data->cached_bitmask; + dev->pcsc->cfg_space =3D dev->pcsc->data->cfg_space; + infer_cacheability(dev); } else { + dev->pcsc->data =3D NULL; + dev->pcsc->cachable_bitmask =3D NULL; + dev->pcsc->cached_bitmask =3D NULL; dev->pcsc->cfg_space =3D NULL; } =20 @@ -771,8 +787,12 @@ int pcsc_remove_device(struct pci_dev *dev) =20 atomic_dec(&num_nodes); =20 - if (dev->pcsc && dev->pcsc->cfg_space) { - kfree(dev->pcsc->cfg_space); + if (dev->pcsc && dev->pcsc->data) { + size_t data_size =3D sizeof(struct pcsc_data); + size_t total_size =3D PAGE_ALIGN(data_size); + + free_pages((unsigned long)dev->pcsc->data, + get_order(total_size)); kfree(dev->pcsc); } dev->pcsc =3D NULL; diff --git a/include/linux/pcsc.h b/include/linux/pcsc.h index 85471273c0a9..88894f641257 100644 --- a/include/linux/pcsc.h +++ b/include/linux/pcsc.h @@ -18,12 +18,40 @@ #define PCSC_CFG_SPC_SIZE 256 #endif =20 -struct pcsc_node { - u8 *cfg_space; +/* + * struct pcsc__data - Continuous data block for PCSC + * + * This structure contains all the PCSC data in a single continuous + * memory block. + * + * @cfg_space: Configuration space cache + * @cachable_bitmask: Bitmap of cacheable configuration space offsets + * @cached_bitmask: Bitmap of cached configuration space offsets + */ +struct pcsc_data { + u8 cfg_space[PCSC_CFG_SPC_SIZE]; DECLARE_BITMAP(cachable_bitmask, PCSC_CFG_SPC_SIZE); DECLARE_BITMAP(cached_bitmask, PCSC_CFG_SPC_SIZE); }; =20 +/* + * struct pcsc_node - PCSC node structure + * + * This structure represents a PCSC node for a PCI device. + * It contains pointers into the data block for convenient access. + * + * @data: Pointer to the continuous data block + * @cachable_bitmask: Pointer to cachable_bitmask in data + * @cached_bitmask: Pointer to cached_bitmask in data + * @cfg_space: Pointer to cfg_space in data + */ +struct pcsc_node { + struct pcsc_data *data; /* Pointer to continuous data block */ + unsigned long *cachable_bitmask; /* Convenience pointer into data */ + unsigned long *cached_bitmask; /* Convenience pointer into data */ + u8 *cfg_space; /* Convenience pointer into data */ +}; + /** * pcsc_hw_config_read - Direct hardware PCI config space read * @bus: PCI bus --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from iad-out-015.esa.us-east-1.outbound.mail-perimeter.amazon.com (iad-out-015.esa.us-east-1.outbound.mail-perimeter.amazon.com [44.210.169.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C29C1D88D0; Fri, 3 Oct 2025 09:06:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=44.210.169.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482366; cv=none; b=HWYitk9QzA5bZO3NTY4CKywoB626DCHXNZhfQRaaKGLLo4U7WqyDHAUh5Kplxr4nCBFdmeahJi3DwSLVsgH7V/LaL7/2nqeQSrrgJ8mQ0CJl+r6UxH/5i3gh6L5fFOKLJDFVL9EbG6dXCCJGq4Dcl0c6hRMZ1Wjuo7g4OSCcSpg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482366; c=relaxed/simple; bh=793WX9TaObVcbPm3DXqnAdkl4B1gGi6PvrRcBDqGyaw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=aEvEall0QcNTf7C1PGBgfVXr9hnfMFniBAJocC69AGuHNibbX9NSS2nPLXHbHzzhJGF4NAz3xECK8G5tdWMevrhLK35UESjQDKSvPA1dFvLAEoDk5nuslxNhzhxc7t9wOZlm0pELldgALCvd0wjHiVx6RRk/Tm1LV+mev387wCQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=D08sQR5B; arc=none smtp.client-ip=44.210.169.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="D08sQR5B" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482363; x=1791018363; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gVBdbL52QjJ3u23Yr50KnrHvSCrn7ewWkkBMTFZNU48=; b=D08sQR5Bkz8+JznhJ+oHibJDS4g1qiML9/7jOzws2L+iXYY8h2WcXwQS HFjq62GVIXzXF2k3Cq751fzUSbpsbXWcGLXnUsmJqBgPz0wSWIAcG7OcS Hbvmvl4yM7O/d06/VOd83e/2mBmd0LxUY0d630ok6KU28VAiBD5Tjo8wl gro5sa8XK2b01xRLaUl4XU7mD/uNleNRUw5Nofel4U77L662AE7MrK6ha S6nf1ZCy9Y069jIezUpUGQIhLc3OKSOmiIaJdEuHjaJgG7anAtjkYEQ2u 7yC4g+ERaBc2XwI6Z99kPg01s3KrDyPVsFAnbix7YdAIrBkRf9sTKDxfI w==; X-CSE-ConnectionGUID: tY20QcJhT+yf6CnAGtJfEw== X-CSE-MsgGUID: 2ClKXQIBRPuNitUyBBVQXQ== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="3274622" Received: from ip-10-4-10-75.ec2.internal (HELO smtpout.naws.us-east-1.prod.farcaster.email.amazon.dev) ([10.4.10.75]) by internal-iad-out-015.esa.us-east-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:06:00 +0000 Received: from EX19MTAUWA001.ant.amazon.com [10.0.7.35:49226] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.13.233:2525] with esmtp (Farcaster) id 3fb372fc-dc48-429a-866d-7527114cfbd0; Fri, 3 Oct 2025 09:06:00 +0000 (UTC) X-Farcaster-Flow-ID: 3fb372fc-dc48-429a-866d-7527114cfbd0 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWA001.ant.amazon.com (10.250.64.204) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:06:00 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:05:57 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 11/13] pci: pcsc: Add kexec persistence support via KHO Date: Fri, 3 Oct 2025 09:00:47 +0000 Message-ID: X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D042UWA002.ant.amazon.com (10.13.139.17) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add support for preserving PCI Configuration Space Cache (PCSC) data across kexec operations using the Kexec Handover (KHO) framework. This allows the cached PCI configuration data to survive kexec, eliminating the need to re-probe PCI configuration space after kexec, which can significantly reduce boot time on systems with many PCI devices. To enable PCSC persistence, the kernel must be built with CONFIG_PCSC_KHO enabled, which depends on both `CONFIG_PCSC` and `CONFIG_KEXEC_HANDOVER`. When enabled, persistence can be controlled at runtime using the 'pcsc_persistence_enabled' kernel parameter. By default, persistence is disabled, but it can be enabled by passing 'pcsc_persistence_enabled=3D0' on the kernel command line. During kexec preparation, the implementation iterates through all PCI devices and saves the PCSC data for endpoint devices (header type 0). It creates a Flattened Device Tree (FDT) structure containing device information and physical addresses of the preserved data. The physical memory pages containing PCSC data are preserved through KHO, and the FDT is added to the KHO tree for the new kernel to discover. After kexec, during PCI device initialization, the implementation checks if KHO data is available for each device being initialized. If found, it restores the cached configuration space data, avoiding the need to re-probe the device. The implementation tracks timing statistics to measure the performance benefits of this optimization. Performance metrics are collected and reported, showing both the time taken to save devices during kexec and the time saved during restore in the new kernel. This helps quantify the boot time improvements. The implementation handles error cases gracefully, falling back to normal PCSC initialization if KHO data is not available or corrupted. This ensures that the system remains functional even if persistence cannot be achieved. The time complexity of this implementation is O(n^2), where n is the number of restored devices, as for every device the FDT needs to be traversed again. This will be improved in a future patch Signed-off-by: Evangelos Petrongonas --- .../admin-guide/kernel-parameters.txt | 4 + drivers/pci/Kconfig | 10 + drivers/pci/pcsc.c | 389 +++++++++++++++++- 3 files changed, 386 insertions(+), 17 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index 08c7a13f107c..39f71e27df2d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5039,6 +5039,10 @@ pcsc_enabled=3D [PCSC] enable the use of the PCI Configuration Space Cache (PCSC). =20 + pcsc_persistence_enabled=3D [PCSC] enable the persistence over kexec + using KHO of the PCI Configuration Space Cache Data. For more + information seen drivers/pci/pcsc.c + pd_ignore_unused [PM] Keep all power-domains already enabled by bootloader on, diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 9b5275ef2d16..0eb189ad526b 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -57,6 +57,16 @@ config PCSC_STATS help This option allows the collection of statistics for the PCSC. =20 +config PCSC_KHO + bool "PCI Configuration Space Cache persist data over kexec" + depends on PCSC + depends on KEXEC_HANDOVER + default n + help + This option enables the persistence of the cache data over kexec + using Kexec Handover KHO. For more information, check + `drivers/pci/pcsc.c' + source "drivers/pci/pcie/Kconfig" =20 config PCI_MSI diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 18d508f76649..0c4ae73744d6 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -17,6 +17,11 @@ #include #include #include +#include +#include +#include +#include +#include =20 static bool pcsc_enabled; static int __init pcsc_enabled_setup(char *str) @@ -25,6 +30,16 @@ static int __init pcsc_enabled_setup(char *str) } __setup("pcsc_enabled=3D", pcsc_enabled_setup); =20 +static bool pcsc_persistence_enabled; +static int __init pcsc_persistence_enabled_setup(char *str) +{ + return kstrtobool(str, &pcsc_persistence_enabled) =3D=3D 0; +} +__setup("pcsc_persistence_enabled=3D", pcsc_persistence_enabled_setup); + +#define PCSC_KHO_FDT "pcsc" +#define PCSC_KHO_NODE_COMPATIBLE "pcsc-v1" + #ifdef CONFIG_PCSC_STATS struct pcsc_stats { /* Operation Counters */ @@ -39,6 +54,10 @@ struct pcsc_stats { u64 total_cache_access_time; /* in milliseconds */ u64 total_hw_access_time; /* in milliseconds */ u64 hw_access_time_due_to_misses; /* in milliseconds */ +#ifdef CONFIG_PCSC_KHO + u64 pcsc_kho_total_restore_time_ns; + u32 pcsc_kho_restored_device_count; +#endif }; #endif =20 @@ -82,6 +101,12 @@ static inline void pcsc_count_device_reset(void) { pcsc_stats.device_resets++; } +#ifdef CONFIG_PCSC_KHO +static inline void pcsc_count_restored_devices(void) +{ + pcsc_stats.pcsc_kho_restored_device_count++; +} +#endif #else static inline void pcsc_count_cache_hit(void) { @@ -101,6 +126,11 @@ static inline void pcsc_count_cache_invalidation(void) static inline void pcsc_count_device_reset(void) { } +#ifdef CONFIG_PCSC_KHO +static inline void pcsc_count_restored_devices(void) +{ +} +#endif #endif =20 inline bool pcsc_is_initialised(void) @@ -721,6 +751,288 @@ static void infer_cacheability(struct pci_dev *dev) } } =20 +#ifdef CONFIG_PCSC_KHO +static struct page *pcsc_kho_fdt; +static int pcsc_kho_fdt_order; + +static int pcsc_kho_save_device(struct pci_dev *dev, void *fdt) +{ + char node_name[32]; + size_t data_size, total_size; + u64 data_addr; + int err =3D 0; + + if (!dev->pcsc || !dev->pcsc->data) + return 1; + + if (dev->hdr_type !=3D PCI_HEADER_TYPE_NORMAL) + return 1; + + /* Create FDT node for this device - node name contains device identifer = */ + snprintf(node_name, sizeof(node_name), "dev_%04x_%02x_%02x_%x", + pci_domain_nr(dev->bus), dev->bus->number, + PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn)); + + err =3D fdt_begin_node(fdt, node_name); + if (err) { + pci_err(dev, "PCSC: Failed to begin FDT node '%s': %d\n", + node_name, err); + return err; + } + + data_size =3D sizeof(struct pcsc_data); + total_size =3D PAGE_ALIGN(data_size); + + data_addr =3D virt_to_phys(dev->pcsc->data); + err =3D kho_preserve_phys(data_addr, total_size); + if (err) { + pci_err(dev, "PCSC: Failed to preserve data buffer: %d\n", err); + return err; + } + + err =3D fdt_property(fdt, "da", &data_addr, sizeof(data_addr)); + if (err) { + pci_err(dev, "PCSC: Failed to set da property: %d\n", + err); + return err; + } + + err =3D fdt_end_node(fdt); + if (err) { + pci_err(dev, "PCSC: Failed to end FDT node: %d\n", err); + return err; + } + + return 0; +} + +static int pcsc_kho_notifier(struct notifier_block *self, unsigned long cm= d, + void *v) +{ + struct kho_serialization *ser =3D v; + struct pci_dev *dev =3D NULL; + void *fdt; + int err =3D 0; + size_t fdt_size; + u32 dev_count =3D 0; + u32 eligible_count =3D 0; + u32 saved_count =3D 0; + u32 skipped_count =3D 0; + + switch (cmd) { + case KEXEC_KHO_ABORT: + if (pcsc_kho_fdt) { + __free_pages(pcsc_kho_fdt, pcsc_kho_fdt_order); + pcsc_kho_fdt =3D NULL; + } + return NOTIFY_DONE; + case KEXEC_KHO_FINALIZE: + /* Handled below */ + break; + default: + return NOTIFY_BAD; + } + +#ifdef CONFIG_PCSC_STATS + ktime_t start_time =3D ktime_get(); +#endif + + for_each_pci_dev(dev) { + dev_count++; + if (dev->pcsc && dev->pcsc->cfg_space && + dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) + eligible_count++; + } + + pr_info("Total PCI devices: %u, eligible for save: %u\n", + dev_count, eligible_count); + + if (eligible_count =3D=3D 0) + return NOTIFY_DONE; + + /* Allocate FDT with size calculation (conservative estimates): + * - Per device: node_name(~20) + node_overhead(~12) + da_property(~20) + * =3D ~52 bytes, round up to 64 for alignment/margin + * - Fixed overhead: header(40) + root_node(~40) + strings_table(~30) + * + misc(~32) =3D ~144 bytes, round up to 256 + */ + fdt_size =3D PAGE_ALIGN((eligible_count * 64 + 256)); + pcsc_kho_fdt_order =3D get_order(fdt_size); + pcsc_kho_fdt =3D alloc_pages(GFP_KERNEL, pcsc_kho_fdt_order); + if (!pcsc_kho_fdt) { + pr_err("PCSC: Failed to allocate FDT pages (size=3D%zu, order=3D%d)\n", + fdt_size, pcsc_kho_fdt_order); + return NOTIFY_BAD; + } + + fdt =3D page_to_virt(pcsc_kho_fdt); + + /* Create FDT */ + err =3D fdt_create(fdt, fdt_size); + if (err) { + pr_err("PCSC: Failed to create FDT: %d\n", err); + goto error_cleanup; + } + + err =3D fdt_finish_reservemap(fdt); + if (err) { + pr_err("PCSC: Failed to finish FDT reservemap: %d\n", err); + goto error_cleanup; + } + + err =3D fdt_begin_node(fdt, ""); + if (err) { + pr_err("PCSC: Failed to begin root FDT node: %d\n", err); + goto error_cleanup; + } + + err =3D fdt_property_string(fdt, "compatible", PCSC_KHO_NODE_COMPATIBLE); + if (err) { + pr_err("PCSC: Failed to set compatible property: %d\n", err); + goto error_cleanup; + } + + for_each_pci_dev(dev) { + int save_err =3D pcsc_kho_save_device(dev, fdt); + + if (save_err =3D=3D 0) { + saved_count++; + } else if (save_err =3D=3D 1) { + /* Skipped (not eligible) */ + skipped_count++; + } else { + pr_err("Failed to save device %04x:%02x:%02x.%d: %d\n", + pci_domain_nr(dev->bus), dev->bus->number, + PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), + save_err); + break; + } + } + + err =3D fdt_end_node(fdt); + if (err) { + pr_err("Failed to end root FDT node: %d\n", err); + goto error_cleanup; + } + + err =3D fdt_finish(fdt); + if (err) { + pr_err("Failed to finish FDT: %d\n", err); + goto error_cleanup; + } + + int fdt_final_size =3D fdt_totalsize(fdt); + int num_pages =3D PAGE_ALIGN(fdt_final_size) / PAGE_SIZE; + + err =3D kho_preserve_phys(page_to_phys(pcsc_kho_fdt), + num_pages * PAGE_SIZE); + if (err) { + pr_err("Failed to preserve FDT pages: %d\n", err); + goto error_cleanup; + } + + err =3D kho_add_subtree(ser, PCSC_KHO_FDT, fdt); + if (err) { + pr_err("Failed to add FDT to KHO tree: %d\n", err); + goto error_cleanup; + } + +#ifdef CONFIG_PCSC_STATS + ktime_t end_time =3D ktime_get(); + u64 duration_ns =3D ktime_to_ns(ktime_sub(end_time, start_time)); + u64 duration_us =3D duration_ns / 1000; + + pr_info("Saved %u devices to KHO in %llu us (%llu.%03llu ms)\n", + saved_count, duration_us, duration_us / 1000, + duration_us % 1000); +#endif + return NOTIFY_DONE; + +error_cleanup: + pr_err("KHO save failed with error %d\n", err); + __free_pages(pcsc_kho_fdt, pcsc_kho_fdt_order); + pcsc_kho_fdt =3D NULL; + return NOTIFY_BAD; +} + +static struct notifier_block pcsc_kho_nb =3D { + .notifier_call =3D pcsc_kho_notifier, +}; + +static bool pcsc_kho_restore_device(struct pci_dev *dev, const void *fdt, + int node) +{ + const struct pcsc_data *preserved_data; + const u64 *data_addr; + int len; + + data_addr =3D fdt_getprop(fdt, node, "da", &len); + if (!data_addr || len !=3D sizeof(*data_addr)) + return false; + + preserved_data =3D phys_to_virt(*data_addr); + if (!preserved_data) + return false; + + + dev->pcsc->data =3D (struct pcsc_data *)preserved_data; + dev->pcsc->cachable_bitmask =3D dev->pcsc->data->cachable_bitmask; + dev->pcsc->cached_bitmask =3D dev->pcsc->data->cached_bitmask; + dev->pcsc->cfg_space =3D dev->pcsc->data->cfg_space; + + return true; +} + +static bool pcsc_kho_check_restore(struct pci_dev *dev) +{ + phys_addr_t fdt_phys; + const void *fdt; + int node, err; + bool restored =3D false; + char node_name[32]; +#ifdef CONFIG_PCSC_STATS + ktime_t start_time, end_time; + u64 duration_ns; +#endif + + err =3D kho_retrieve_subtree(PCSC_KHO_FDT, &fdt_phys); + if (err) { + pci_dbg(dev, "PCSC: kho_retrieve_subtree failed: %d\n", err); + return false; + } + + fdt =3D phys_to_virt(fdt_phys); + if (fdt_node_check_compatible(fdt, 0, PCSC_KHO_NODE_COMPATIBLE)) { + pci_dbg(dev, "PCSC: FDT node not compatible\n"); + return false; + } + +#ifdef CONFIG_PCSC_STATS + start_time =3D ktime_get(); +#endif + + snprintf(node_name, sizeof(node_name), "dev_%04x_%02x_%02x_%x", + pci_domain_nr(dev->bus), dev->bus->number, + PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn)); + + node =3D fdt_subnode_offset(fdt, 0, node_name); + if (node >=3D 0) + restored =3D pcsc_kho_restore_device(dev, fdt, node); + +#ifdef CONFIG_PCSC_STATS + if (restored) { + end_time =3D ktime_get(); + duration_ns =3D ktime_to_ns(ktime_sub(end_time, start_time)); + + pcsc_stats.pcsc_kho_total_restore_time_ns +=3D duration_ns; + pcsc_count_restored_devices(); + } +#endif + + return restored; +} +#endif + int pcsc_add_device(struct pci_dev *dev) { struct pcsc_node *node; @@ -742,23 +1054,34 @@ int pcsc_add_device(struct pci_dev *dev) * nodes for these devices, as it simplifies the code flow */ if (dev->hdr_type =3D=3D PCI_HEADER_TYPE_NORMAL) { - /* Allocate contiguous, page aligned data block. This will be - * needed for persisting the data with KHO. - */ - data_size =3D sizeof(struct pcsc_data); - - dev->pcsc->data =3D - (struct pcsc_data *)__get_free_pages( - GFP_KERNEL | __GFP_ZERO, get_order(data_size)); - if (!dev->pcsc->data) +#ifdef CONFIG_PCSC_KHO + bool restored =3D false; =20 - goto err_free_node; + /* Try to restore from KHO first, before any allocation */ + if (pcsc_persistence_enabled && kho_is_enabled()) + restored =3D pcsc_kho_check_restore(dev); =20 - dev->pcsc->cachable_bitmask =3D dev->pcsc->data->cachable_bitmask; - dev->pcsc->cached_bitmask =3D dev->pcsc->data->cached_bitmask; - dev->pcsc->cfg_space =3D dev->pcsc->data->cfg_space; - - infer_cacheability(dev); + if (!restored) { +#endif + /* Allocate contiguous, page aligned data block. This is + * needed for persisting the data with KHO. + */ + data_size =3D sizeof(struct pcsc_data); + + dev->pcsc->data =3D + (struct pcsc_data *)__get_free_pages( + GFP_KERNEL | __GFP_ZERO, get_order(data_size)); + if (!dev->pcsc->data) + goto err_free_node; + + dev->pcsc->cachable_bitmask =3D dev->pcsc->data->cachable_bitmask; + dev->pcsc->cached_bitmask =3D dev->pcsc->data->cached_bitmask; + dev->pcsc->cfg_space =3D dev->pcsc->data->cfg_space; + + infer_cacheability(dev); +#ifdef CONFIG_PCSC_KHO + } +#endif } else { dev->pcsc->data =3D NULL; dev->pcsc->cachable_bitmask =3D NULL; @@ -1103,7 +1426,9 @@ static struct kobj_attribute pcsc_enabled_attribute = =3D static ssize_t pcsc_stats_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sysfs_emit( + ssize_t ret; + + ret =3D sysfs_emit( buf, "Cache Hits: %lu\n" "Cache Misses: %lu\n" @@ -1132,6 +1457,20 @@ static ssize_t pcsc_stats_show(struct kobject *kobj, 1000, pcsc_stats.hw_access_time_due_to_misses / 1000, pcsc_stats.total_hw_access_time / 1000); + +#ifdef CONFIG_PCSC_KHO + u64 total_restore_time_us =3D pcsc_stats.pcsc_kho_total_restore_time_ns /= 1000; + + ret +=3D sysfs_emit_at(buf, ret, + "KHO Restore Statistics:\n" + " Restored Devices: %u\n" + " Total Restore Time: %llu us\n", + pcsc_stats.pcsc_kho_restored_device_count, + total_restore_time_us); + +#endif + + return ret; } =20 static struct kobj_attribute pcsc_stats_attribute =3D @@ -1183,6 +1522,10 @@ static void pcsc_create_sysfs(void) =20 static int __init pcsc_init(void) { +#ifdef CONFIG_PCSC_KHO + int ret; +#endif + bus_register_notifier(&pci_bus_type, &pcsc_bus_nb); =20 /* Try to create sysfs entry, but don't fail if PCI bus isn't ready yet */ @@ -1192,8 +1535,20 @@ static int __init pcsc_init(void) memset(&pcsc_stats, 0, sizeof(pcsc_stats)); #endif =20 +#ifdef CONFIG_PCSC_KHO + /* Register KHO notifier if persistence is enabled */ + if (pcsc_persistence_enabled && kho_is_enabled()) { + ret =3D register_kho_notifier(&pcsc_kho_nb); + if (ret =3D=3D 0) + pr_info("KHO notifier registered successfully\n"); + else + pr_err("Failed to register KHO notifier: %d\n", ret); + } +#endif /* CONFIG_PCSC_KHO */ + pcsc_initialised =3D true; - pr_info("initialised (enabled=3D%d)\n", pcsc_enabled); + pr_info("initialised (enabled=3D%d, persistence=3D%d)\n", + pcsc_enabled, pcsc_persistence_enabled); =20 return 0; } --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.42.203.116]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2FFCE23BF80; Fri, 3 Oct 2025 09:06:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.42.203.116 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482387; cv=none; b=ZBUx3AJAoUD1uEeGJeBhODubF5GOd1cOQnK2GlaNDlH+jOzGJK22BMwOi+h7wDLzGj10GBqhkU2AUQ97jJGifyoTalxGGmCymDBciyB30XBTkP2309OPulAHM+RvvqYHdpe0Wv90SgEqvAteALmBgwLJK28UDPGwzpBmup/1FUs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482387; c=relaxed/simple; bh=qBYenI8/F27BMF7+8lrlzwzqPRSN/NUwrjvd3tUjqRw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rr9Aft/DiWeHzWeq1JJn9AmFpFvUEs+Vx/E9x42+1D8OcJ3YQ6iC0oMczx/zGhiMGvZKilw0ltK2DFOKWPkIQVKG3RoP/rMolL8N4IzljlA64Yb/NfkR2hmZ0DiOOY/W6Z3DxfkE16Pqz5EPfrCEaLGgLqn41frqQNyY1NAdjFs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=H7yBi7KY; arc=none smtp.client-ip=52.42.203.116 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="H7yBi7KY" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482386; x=1791018386; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HXeSm5z0xR2Kf4aWS/AqIIdJFx8TbL/siyNLVrwCYnI=; b=H7yBi7KYsseoiet1QMNOaWS25DqaDOajYt4nWkq4uKxO7jUy3iNwptVA On3QjU1YwWgJIIbYfDwiHqEhRfEnI3UwHc61YVtkyGpK5M0/XyExhqKXb O5EAjnDwyZqNXIIGnJf0AhvD5B2803kD4kXEfc8zuzCXrqhMbJ9RLviBs xRXooJE6w+bFmFAb+tRIGOZeN99lQvZV1eGZQ3dq+6ESSaKNCGne0d35u wTKTuH7og0JkX8ruuaoFl41/TeDyLsuX/mq7wz9bcVJS3/EnaAbclsTAf CWu8/wI2YR5KBlTUjsTX/okYApf+yPmxvkbo5x7XEJW34NsZ9DodMH2qd w==; X-CSE-ConnectionGUID: FXykvy0QSYWchLUd1a/k7w== X-CSE-MsgGUID: SMHSJhWNQhO4jkUUqJjzhg== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4203677" Received: from ip-10-5-9-48.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.9.48]) by internal-pdx-out-008.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:06:25 +0000 Received: from EX19MTAUWC001.ant.amazon.com [10.0.21.151:43236] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.25.156:2525] with esmtp (Farcaster) id 3291c77e-5c90-4790-8e8d-2e7361cba638; Fri, 3 Oct 2025 09:06:25 +0000 (UTC) X-Farcaster-Flow-ID: 3291c77e-5c90-4790-8e8d-2e7361cba638 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWC001.ant.amazon.com (10.250.64.174) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:06:25 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:06:22 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 12/13] pci: pcsc: introduce persistence versioning Date: Fri, 3 Oct 2025 09:00:48 +0000 Message-ID: <630bdfc9fc9591178329983a308642dde68136e5.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D035UWB003.ant.amazon.com (10.13.138.85) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The cacheability of the devices can change, either because registers need to be removed due to a bug / non-compliant device, or adding more cacheable registers. The current version does not support the concept of changing the `is_cacheable` bitmask and in order to do so, the whole persistence mechanism needs to be disabled, change the bitmask and reenable it again. This adds maintenance difficulty, as well as negatively impacting the cache hit rate. This commit adds a mechanism to handle those changes more efficiently. A version number is identified and it is stored in the FDT during the save process in the outgoing kernel. The incoming kernel, if compiled with a different persistence version, will re-infer the cacheability of all the saved devices, without touching the `is_cached` or the actual configuration space saved data. This is safe as all access to the cache are guarded by the `is_cacheable` bitmask. As a result changing the cacheability will only change the differing registers, while the rest of the cache will remain valid. Signed-off-by: Evangelos Petrongonas --- drivers/pci/pcsc.c | 40 +++++++++++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 0c4ae73744d6..8ff91ed24a37 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -23,6 +23,9 @@ #include #include =20 +/* PCSC persistent data version - increment when the cacheability is chang= ed */ +#define PCSC_PERSISTENT_VERSION 1U + static bool pcsc_enabled; static int __init pcsc_enabled_setup(char *str) { @@ -818,6 +821,7 @@ static int pcsc_kho_notifier(struct notifier_block *sel= f, unsigned long cmd, u32 eligible_count =3D 0; u32 saved_count =3D 0; u32 skipped_count =3D 0; + const u32 version =3D PCSC_PERSISTENT_VERSION; =20 switch (cmd) { case KEXEC_KHO_ABORT: @@ -853,8 +857,8 @@ static int pcsc_kho_notifier(struct notifier_block *sel= f, unsigned long cmd, /* Allocate FDT with size calculation (conservative estimates): * - Per device: node_name(~20) + node_overhead(~12) + da_property(~20) * =3D ~52 bytes, round up to 64 for alignment/margin - * - Fixed overhead: header(40) + root_node(~40) + strings_table(~30) - * + misc(~32) =3D ~144 bytes, round up to 256 + * - Fixed overhead: header(40) + root_node(~48) + strings_table(~30) + * + misc(~32) =3D ~152 bytes, round up to 256 */ fdt_size =3D PAGE_ALIGN((eligible_count * 64 + 256)); pcsc_kho_fdt_order =3D get_order(fdt_size); @@ -892,6 +896,12 @@ static int pcsc_kho_notifier(struct notifier_block *se= lf, unsigned long cmd, goto error_cleanup; } =20 + err =3D fdt_property(fdt, "pv", &version, sizeof(version)); + if (err) { + pr_err("PCSC: Failed to set version property: %d\n", err); + goto error_cleanup; + } + for_each_pci_dev(dev) { int save_err =3D pcsc_kho_save_device(dev, fdt); =20 @@ -960,7 +970,7 @@ static struct notifier_block pcsc_kho_nb =3D { }; =20 static bool pcsc_kho_restore_device(struct pci_dev *dev, const void *fdt, - int node) + int node, bool version_mismatch) { const struct pcsc_data *preserved_data; const u64 *data_addr; @@ -980,6 +990,9 @@ static bool pcsc_kho_restore_device(struct pci_dev *dev= , const void *fdt, dev->pcsc->cached_bitmask =3D dev->pcsc->data->cached_bitmask; dev->pcsc->cfg_space =3D dev->pcsc->data->cfg_space; =20 + if (version_mismatch) + infer_cacheability(dev); + return true; } =20 @@ -987,9 +1000,12 @@ static bool pcsc_kho_check_restore(struct pci_dev *de= v) { phys_addr_t fdt_phys; const void *fdt; - int node, err; + int node, err, len; bool restored =3D false; + bool version_mismatch =3D false; char node_name[32]; + const u32 *version_ptr; + u32 saved_version; #ifdef CONFIG_PCSC_STATS ktime_t start_time, end_time; u64 duration_ns; @@ -1007,6 +1023,20 @@ static bool pcsc_kho_check_restore(struct pci_dev *d= ev) return false; } =20 + version_ptr =3D fdt_getprop(fdt, 0, "pv", &len); + if (version_ptr && len =3D=3D sizeof(*version_ptr)) { + saved_version =3D *version_ptr; + if (saved_version !=3D PCSC_PERSISTENT_VERSION) + version_mismatch =3D true; + + } else { + /* No version found, assume version 0 */ + pci_info( + dev, + "PCSC: No version found in restored data. Re-infer Cacheability.\n"); + version_mismatch =3D true; + } + #ifdef CONFIG_PCSC_STATS start_time =3D ktime_get(); #endif @@ -1017,7 +1047,7 @@ static bool pcsc_kho_check_restore(struct pci_dev *de= v) =20 node =3D fdt_subnode_offset(fdt, 0, node_name); if (node >=3D 0) - restored =3D pcsc_kho_restore_device(dev, fdt, node); + restored =3D pcsc_kho_restore_device(dev, fdt, node, version_mismatch); =20 #ifdef CONFIG_PCSC_STATS if (restored) { --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From nobody Wed Dec 17 17:39:08 2025 Received: from pdx-out-009.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-009.esa.us-west-2.outbound.mail-perimeter.amazon.com [35.155.198.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 094A61E9B1C; Fri, 3 Oct 2025 09:06:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=35.155.198.111 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482410; cv=none; b=rIMn5gSwqZfPuFgykZYpUJBpx9mCKda3pqJomnT/KjZEaUSgrkmnf5v3aTOLvI2o219cVBNvPUoFNnxqcrcQR6JEBblPcAVfxJZWbGuXh/LJIx5fa6ctav2Jre/0Yx3i4n3pknO8Bd3H7aiDgvNkpX0A/wSa7bfIiHBpUkIiEvc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759482410; c=relaxed/simple; bh=Sk5cv03ktkOxmsjeIW0yNq34W1bhRxcfm2vUOJlm0k8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rxTw2+YvuBaGNk/nvcWHtffP6tA235VQu87PE2GRL9Cn/eC21MARZuK627Kr+sG4Dg5mDgE5HUqf5KG9Ga+wF15UgkOfYsrYQ5lFCrDGH0C1xrt1mQyG02wOqxp62TfL+5Z+JMqL62YjrjMIjRHvbSvju6O6tnYzmIMGudm/AY0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=MGMzxOI6; arc=none smtp.client-ip=35.155.198.111 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="MGMzxOI6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1759482409; x=1791018409; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=si+G2wK775Ivi61pXTR8kUWgdomkX6hUkDZOI52TaOk=; b=MGMzxOI69WXgSqPi+DoTxJNVQqXFtIkGtFOfP/Km4eFSVk6/dyJiwdnm la8aLPiDIs5Lq9huvWZd5euT0RUEbVlN1eVbk6Sfg/XkFfhgElFWE616S X+BuReY6pVWDe06Bxx73DtHM2Vsm+DYYNz3EGtbDj+7Btd4A0Yvic9Kl2 /syWk6ZbCeCuGUr8Z8xXg/vrktOPtstzC87ht7nr1WxKMzL8VVaM1Qobq r4sdXNKP/NAZU7SV1CLCsLWRR7j8CW7BBDu9iV0L29d7VDiZmAzjpF5rS bL6xL5gxRKSU67jT2oTsxiiz6tPvqCweJUZ+xKi8i4d1ey8w7JyhPWq/K A==; X-CSE-ConnectionGUID: FAzMc7QmQ52PRN7aOvM0Xw== X-CSE-MsgGUID: sH9cb2wESLy+JvGssX94Bw== X-IronPort-AV: E=Sophos;i="6.18,312,1751241600"; d="scan'208";a="4090680" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-009.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2025 09:06:48 +0000 Received: from EX19MTAUWB002.ant.amazon.com [10.0.38.20:63188] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.54.171:2525] with esmtp (Farcaster) id 3276e722-a2a3-40bc-8626-119d12cd5277; Fri, 3 Oct 2025 09:06:48 +0000 (UTC) X-Farcaster-Flow-ID: 3276e722-a2a3-40bc-8626-119d12cd5277 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWB002.ant.amazon.com (10.250.64.231) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:06:48 +0000 Received: from dev-dsk-epetron-1c-1d4d9719.eu-west-1.amazon.com (10.253.109.105) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Fri, 3 Oct 2025 09:06:45 +0000 From: Evangelos Petrongonas To: Bjorn Helgaas , Alex Williamson , "Rafael J . Wysocki" , "Len Brown" CC: Evangelos Petrongonas , Pasha Tatashin , David Matlack , "Vipin Sharma" , Chris Li , Jason Miu , Pratyush Yadav , "Stanislav Spassov" , , , , Subject: [RFC PATCH 13/13] pci: pcsc: introduce hashtable lookup to speed up restoration Date: Fri, 3 Oct 2025 09:00:49 +0000 Message-ID: <101c11154d1cdbb0910bb8468f2da150eac15600.1759312886.git.epetron@amazon.de> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D042UWA002.ant.amazon.com (10.13.139.17) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" During PCSC KHO (Kexec Handover) restoration, the current implementation searches for each device's saved data by performing a linear traversal through the FDT (Flattened Device Tree) nodes. This approach requires examining every node until finding a match, resulting in O(n) complexity per device lookup. On systems with thousands of PCI devices, this becomes a significant bottleneck, as restoring all devices requires O(n^2) operations in total. This patch replaces the linear search with a hashtable that provides O(1) lookup performance. During initialization, when KHO restore data is detected, we build a hashtable by traversing the FDT once and indexing all device nodes. Each device is keyed by a 32-bit value combining its PCI domain (16 bits), bus number (8 bits), and device function (8 bits), ensuring unique identification across the entire PCI topology. The hashtable uses 1024 buckets (PCSC_KHO_HASH_BITS=3D10). As devices are successfully restored, their entries are removed from the hashtable, progressively freeing memory. Once all devices have been restored, the entire hashtable structure is cleaned up. An additional optimization moves the version compatibility check from per-device to a single global check during hashtable initialization. This eliminates redundant version comparisons and simplifies the restoration logic. Since the hashtable is built once during module initialization and never modified during lookups, no locking is required. Signed-off-by: Evangelos Petrongonas --- drivers/pci/pcsc.c | 244 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 205 insertions(+), 39 deletions(-) diff --git a/drivers/pci/pcsc.c b/drivers/pci/pcsc.c index 8ff91ed24a37..43880b85b3f9 100644 --- a/drivers/pci/pcsc.c +++ b/drivers/pci/pcsc.c @@ -22,6 +22,7 @@ #include #include #include +#include =20 /* PCSC persistent data version - increment when the cacheability is chang= ed */ #define PCSC_PERSISTENT_VERSION 1U @@ -755,8 +756,182 @@ static void infer_cacheability(struct pci_dev *dev) } =20 #ifdef CONFIG_PCSC_KHO + +/* + * Hash table for O(1) device lookup during restore + * Size is chosen to handle systems with up to 4096 devices + */ +#define PCSC_KHO_HASH_BITS 10 /* 1024 buckets */ +static DEFINE_HASHTABLE(pcsc_kho_lookup_table, PCSC_KHO_HASH_BITS); + +struct pcsc_kho_lookup_entry { + struct hlist_node node; + u32 key; /* Hash of domain:bus:devfn */ + int fdt_offset; /* Offset of this device's node in FDT */ +}; + static struct page *pcsc_kho_fdt; static int pcsc_kho_fdt_order; +static bool pcsc_kho_lookup_initialized; +static phys_addr_t pcsc_kho_fdt_phys; +static const void *pcsc_kho_fdt_virt; +static bool version_mismatch; +static atomic_t pcsc_kho_hashtable_entries =3D ATOMIC_INIT(0); +static u32 pcsc_kho_initial_hashtable_entries; +#ifdef CONFIG_PCSC_STATS +static u64 pcsc_kho_hashtable_build_time_ns; +#endif + +static inline u32 pcsc_kho_device_key(u32 domain, u32 bus, u32 devfn) +{ + return (domain << 16) | (bus << 8) | devfn; +} + +static int pcsc_kho_build_lookup_table(const void *fdt) +{ + int node; + u32 domain, bus, slot, func, devfn; + struct pcsc_kho_lookup_entry *entry; + u32 key; + int count =3D 0; + + fdt_for_each_subnode(node, fdt, 0) { + const char *name =3D fdt_get_name(fdt, node, NULL); + + if (!name) + continue; + + if (sscanf(name, "dev_%x_%x_%x_%x", &domain, &bus, &slot, + &func) !=3D 4) + continue; + + devfn =3D PCI_DEVFN(slot, func); + key =3D pcsc_kho_device_key(domain, bus, devfn); + + entry =3D kmalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) + return -ENOMEM; + + entry->key =3D key; + entry->fdt_offset =3D node; + + hash_add(pcsc_kho_lookup_table, &entry->node, key); + count++; + } + + atomic_set(&pcsc_kho_hashtable_entries, count); + pcsc_kho_initial_hashtable_entries =3D count; + pr_info("Built KHO lookup table with %d entries\n", count); + return 0; +} + +static int pcsc_kho_init_lookup(void) +{ + int err, len; + const u32 *version_ptr; + u32 saved_version; +#ifdef CONFIG_PCSC_STATS + ktime_t start_time, end_time; +#endif + + if (pcsc_kho_lookup_initialized) + return 0; + +#ifdef CONFIG_PCSC_STATS + start_time =3D ktime_get(); +#endif + + err =3D kho_retrieve_subtree(PCSC_KHO_FDT, &pcsc_kho_fdt_phys); + if (err) + return err; + + pcsc_kho_fdt_virt =3D phys_to_virt(pcsc_kho_fdt_phys); + if (fdt_node_check_compatible(pcsc_kho_fdt_virt, 0, + PCSC_KHO_NODE_COMPATIBLE)) + return -EINVAL; + + /* Check version for all devices */ + version_ptr =3D fdt_getprop(pcsc_kho_fdt_virt, 0, "pv", &len); + if (version_ptr && len =3D=3D sizeof(*version_ptr)) { + saved_version =3D *version_ptr; + if (saved_version !=3D PCSC_PERSISTENT_VERSION) { + version_mismatch =3D true; + pr_info("Version mismatch (saved=3D%u, current=3D%u), will re-infer cac= heability\n", + saved_version, PCSC_PERSISTENT_VERSION); + } + } else { + /* No version found, assume version 0 */ + version_mismatch =3D true; + pr_info("No version found in restored data, will re-infer cacheability\n= "); + } + + err =3D pcsc_kho_build_lookup_table(pcsc_kho_fdt_virt); + if (err) + return err; + +#ifdef CONFIG_PCSC_STATS + end_time =3D ktime_get(); + pcsc_kho_hashtable_build_time_ns =3D + ktime_to_ns(ktime_sub(end_time, start_time)); +#endif + + pcsc_kho_lookup_initialized =3D true; + pr_info("KHO lookup table initialized in %llu us\n", +#ifdef CONFIG_PCSC_STATS + pcsc_kho_hashtable_build_time_ns / 1000 +#else + 0ULL +#endif + ); + return 0; +} + +static int pcsc_kho_find_device_node(u32 domain, u32 bus, u32 devfn) +{ + u32 key =3D pcsc_kho_device_key(domain, bus, devfn); + struct pcsc_kho_lookup_entry *entry; + int offset =3D -1; + + hash_for_each_possible(pcsc_kho_lookup_table, entry, node, key) { + if (entry->key =3D=3D key) { + offset =3D entry->fdt_offset; + break; + } + } + + return offset; +} + + +static void pcsc_kho_cleanup_hashtable(void) +{ + if (!pcsc_kho_lookup_initialized) + return; + + pcsc_kho_lookup_initialized =3D false; + pr_info("KHO hashtable cleaned up - all devices restored\n"); +} + +static void pcsc_kho_remove_device_entry(u32 domain, u32 bus, u32 devfn) +{ + u32 key =3D pcsc_kho_device_key(domain, bus, devfn); + struct pcsc_kho_lookup_entry *entry; + struct hlist_node *tmp; + + hash_for_each_possible_safe(pcsc_kho_lookup_table, entry, tmp, node, + key) { + if (entry->key =3D=3D key) { + hash_del(&entry->node); + kfree(entry); + + if (atomic_dec_and_test(&pcsc_kho_hashtable_entries)) + pcsc_kho_cleanup_hashtable(); + + break; + } + } +} + =20 static int pcsc_kho_save_device(struct pci_dev *dev, void *fdt) { @@ -998,56 +1173,31 @@ static bool pcsc_kho_restore_device(struct pci_dev *= dev, const void *fdt, =20 static bool pcsc_kho_check_restore(struct pci_dev *dev) { - phys_addr_t fdt_phys; - const void *fdt; - int node, err, len; + int node; bool restored =3D false; - bool version_mismatch =3D false; - char node_name[32]; - const u32 *version_ptr; - u32 saved_version; + u32 domain, bus_num; #ifdef CONFIG_PCSC_STATS ktime_t start_time, end_time; u64 duration_ns; -#endif =20 - err =3D kho_retrieve_subtree(PCSC_KHO_FDT, &fdt_phys); - if (err) { - pci_dbg(dev, "PCSC: kho_retrieve_subtree failed: %d\n", err); - return false; - } + start_time =3D ktime_get(); +#endif =20 - fdt =3D phys_to_virt(fdt_phys); - if (fdt_node_check_compatible(fdt, 0, PCSC_KHO_NODE_COMPATIBLE)) { - pci_dbg(dev, "PCSC: FDT node not compatible\n"); + if (!pcsc_kho_lookup_initialized) return false; - } =20 - version_ptr =3D fdt_getprop(fdt, 0, "pv", &len); - if (version_ptr && len =3D=3D sizeof(*version_ptr)) { - saved_version =3D *version_ptr; - if (saved_version !=3D PCSC_PERSISTENT_VERSION) - version_mismatch =3D true; - - } else { - /* No version found, assume version 0 */ - pci_info( - dev, - "PCSC: No version found in restored data. Re-infer Cacheability.\n"); - version_mismatch =3D true; - } + domain =3D pci_domain_nr(dev->bus); + bus_num =3D dev->bus->number; =20 -#ifdef CONFIG_PCSC_STATS - start_time =3D ktime_get(); -#endif + node =3D pcsc_kho_find_device_node(domain, bus_num, dev->devfn); =20 - snprintf(node_name, sizeof(node_name), "dev_%04x_%02x_%02x_%x", - pci_domain_nr(dev->bus), dev->bus->number, - PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn)); + if (node >=3D 0) { + restored =3D pcsc_kho_restore_device(dev, pcsc_kho_fdt_virt, node, + version_mismatch); =20 - node =3D fdt_subnode_offset(fdt, 0, node_name); - if (node >=3D 0) - restored =3D pcsc_kho_restore_device(dev, fdt, node, version_mismatch); + if (restored) + pcsc_kho_remove_device_entry(domain, bus_num, dev->devfn); + } =20 #ifdef CONFIG_PCSC_STATS if (restored) { @@ -1056,6 +1206,9 @@ static bool pcsc_kho_check_restore(struct pci_dev *de= v) =20 pcsc_stats.pcsc_kho_total_restore_time_ns +=3D duration_ns; pcsc_count_restored_devices(); + + pci_dbg(dev, "PCSC: Restored from KHO in %llu ns\n", + duration_ns); } #endif =20 @@ -1498,6 +1651,11 @@ static ssize_t pcsc_stats_show(struct kobject *kobj, pcsc_stats.pcsc_kho_restored_device_count, total_restore_time_us); =20 + ret +=3D sysfs_emit_at(buf, ret, + " Hashtable Initial Entries: %u\n" + " Hashtable Build Time: %llu us\n", + pcsc_kho_initial_hashtable_entries, + pcsc_kho_hashtable_build_time_ns / 1000); #endif =20 return ret; @@ -1573,6 +1731,14 @@ static int __init pcsc_init(void) pr_info("KHO notifier registered successfully\n"); else pr_err("Failed to register KHO notifier: %d\n", ret); + + ret =3D pcsc_kho_init_lookup(); + if (ret =3D=3D 0) + pr_info("KHO lookup table initialized during init\n"); + else if (ret =3D=3D -ENOENT) + pr_info("No KHO saved data found - fresh boot\n"); + else + pr_err("Failed to initialize KHO lookup table: %d\n", ret); } #endif /* CONFIG_PCSC_KHO */ =20 --=20 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597