drivers/acpi/numa/hmat.c | 4 + drivers/cxl/Kconfig | 4 - drivers/cxl/acpi.c | 50 +++++-- drivers/cxl/core/Makefile | 2 +- drivers/cxl/core/{suspend.c => probe_state.c} | 10 +- drivers/cxl/core/region.c | 135 ++++++++++++++++++ drivers/cxl/cxl.h | 4 + drivers/cxl/cxlmem.h | 9 -- drivers/dax/hmem/Makefile | 1 + drivers/dax/hmem/device.c | 62 ++++---- drivers/dax/hmem/hmem.c | 14 +- drivers/dax/hmem/hmem_notify.c | 29 ++++ include/linux/dax.h | 7 +- include/linux/ioport.h | 1 + include/linux/pm.h | 7 - kernel/resource.c | 34 +++++ 16 files changed, 307 insertions(+), 66 deletions(-) rename drivers/cxl/core/{suspend.c => probe_state.c} (62%) create mode 100644 drivers/dax/hmem/hmem_notify.c
This series introduces the ability to manage SOFT RESERVED iomem resources, enabling the CXL driver to remove any portions that intersect with created CXL regions. The current approach of leaving SOFT RESERVED entries as is can result in failures during device hotplug such as CXL because the address range remains reserved and unavailable for reuse even after region teardown. To address this, the CXL driver now uses a background worker that waits for cxl_mem driver probe to complete before scanning for intersecting resources. Then the driver walks through created CXL regions to trim any intersections with SOFT RESERVED resources in the iomem tree. The following scenarios have been tested: Example 1: Exact alignment, soft reserved is a child of the region |---------- "Soft Reserved" -----------| |-------------- "Region #" ------------| Before: 1050000000-304fffffff : CXL Window 0 1050000000-304fffffff : region0 1050000000-304fffffff : Soft Reserved 1080000000-2fffffffff : dax0.0 1080000000-2fffffffff : System RAM (kmem) After: 1050000000-304fffffff : CXL Window 0 1050000000-304fffffff : region0 1080000000-2fffffffff : dax0.0 1080000000-2fffffffff : System RAM (kmem) Example 2: Start and/or end aligned and soft reserved spans multiple regions |----------- "Soft Reserved" -----------| |-------- "Region #" -------| or |----------- "Soft Reserved" -----------| |-------- "Region #" -------| Before: 850000000-684fffffff : Soft Reserved 850000000-284fffffff : CXL Window 0 850000000-284fffffff : region3 850000000-284fffffff : dax0.0 850000000-284fffffff : System RAM (kmem) 2850000000-484fffffff : CXL Window 1 2850000000-484fffffff : region4 2850000000-484fffffff : dax1.0 2850000000-484fffffff : System RAM (kmem) 4850000000-684fffffff : CXL Window 2 4850000000-684fffffff : region5 4850000000-684fffffff : dax2.0 4850000000-684fffffff : System RAM (kmem) After: 850000000-284fffffff : CXL Window 0 850000000-284fffffff : region3 850000000-284fffffff : dax0.0 850000000-284fffffff : System RAM (kmem) 2850000000-484fffffff : CXL Window 1 2850000000-484fffffff : region4 2850000000-484fffffff : dax1.0 2850000000-484fffffff : System RAM (kmem) 4850000000-684fffffff : CXL Window 2 4850000000-684fffffff : region5 4850000000-684fffffff : dax2.0 4850000000-684fffffff : System RAM (kmem) Example 3: No alignment |---------- "Soft Reserved" ----------| |---- "Region #" ----| Before: 00000000-3050000ffd : Soft Reserved .. .. 1050000000-304fffffff : CXL Window 0 1050000000-304fffffff : region1 1080000000-2fffffffff : dax0.0 1080000000-2fffffffff : System RAM (kmem) After: 00000000-104fffffff : Soft Reserved .. .. 1050000000-304fffffff : CXL Window 0 1050000000-304fffffff : region1 1080000000-2fffffffff : dax0.0 1080000000-2fffffffff : System RAM (kmem) 3050000000-3050000ffd : Soft Reserved Link to v4: https://lore.kernel.org/linux-cxl/20250603221949.53272-1-Smita.KoralahalliChannabasappa@amd.com v5 updates: - Handled cases where CXL driver loads early even before HMEM driver is initialized. - Introduced callback functions to resolve dependencies. - Rename suspend.c to probe_state.c. - Refactor cxl_acpi_probe() to use a single exit path. - Commit description update to justify cxl_mem_active() usage. - Change from kmalloc -> kzalloc in add_soft_reserved(). - Change from goto to if else blocks inside remove_soft_reserved(). - DEFINE_RES_MEM_NAMED -> DEFINE_RES_NAMED_DESC. - Comments for flags inside remove_soft_reserved(). - Add resource_lock inside normalize_resource(). - bus_find_next_device -> bus_find_device. - Skip DAX consumption of soft reserves inside hmat with CONFIG_CXL_ACPI checks. v4 updates: - Split first patch into 4 smaller patches. - Correct the logic for cxl_pci_loaded() and cxl_mem_active() to return false at default instead of true. - Cleanup cxl_wait_for_pci_mem() to remove config checks for cxl_pci and cxl_mem. - Fixed multiple bugs and build issues which includes correcting walk_iomem_resc_desc() and calculations of alignments. v3 updates: - Remove srmem resource tree from kernel/resource.c, this is no longer needed in the current implementation. All SOFT RESERVE resources now put on the iomem resource tree. - Remove the no longer needed SOFT_RESERVED_MANAGED kernel config option. - Add the 'nid' parameter back to hmem_register_resource(); - Remove the no longer used soft reserve notification chain (introduced in v2). The dax driver is now notified of SOFT RESERVED resources by the CXL driver. v2 updates: - Add config option SOFT_RESERVE_MANAGED to control use of the separate srmem resource tree at boot. - Only add SOFT RESERVE resources to the soft reserve tree during boot, they go to the iomem resource tree after boot. - Remove the resource trimming code in the previous patch to re-use the existing code in kernel/resource.c - Add functionality for the cxl acpi driver to wait for the cxl PCI and mem drivers to load. Smita Koralahalli (7): cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX registration cxl/core: Rename suspend.c to probe_state.c and remove CONFIG_CXL_SUSPEND cxl/acpi: Add background worker to coordinate with cxl_mem probe completion cxl/region: Introduce SOFT RESERVED resource removal on region teardown dax/hmem: Save the DAX HMEM platform device pointer dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until after CXL region creation dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads late drivers/acpi/numa/hmat.c | 4 + drivers/cxl/Kconfig | 4 - drivers/cxl/acpi.c | 50 +++++-- drivers/cxl/core/Makefile | 2 +- drivers/cxl/core/{suspend.c => probe_state.c} | 10 +- drivers/cxl/core/region.c | 135 ++++++++++++++++++ drivers/cxl/cxl.h | 4 + drivers/cxl/cxlmem.h | 9 -- drivers/dax/hmem/Makefile | 1 + drivers/dax/hmem/device.c | 62 ++++---- drivers/dax/hmem/hmem.c | 14 +- drivers/dax/hmem/hmem_notify.c | 29 ++++ include/linux/dax.h | 7 +- include/linux/ioport.h | 1 + include/linux/pm.h | 7 - kernel/resource.c | 34 +++++ 16 files changed, 307 insertions(+), 66 deletions(-) rename drivers/cxl/core/{suspend.c => probe_state.c} (62%) create mode 100644 drivers/dax/hmem/hmem_notify.c -- 2.17.1
Smita, I have not yet to complete all of my local patterns. Nonetheless, in addition to the issues highlighted by Alison, I have also encountered some regressions. Based on your conversation with Alison, it appears you have decided to have a refactor. Thus, I intend to stop testing on this version until the updated iteration is available. Here is what I have verified thus far (kernel built upon the cxl/next 20250718): A) No Soft reserved (BIOS did not expose EFI_SPECIAL_PURPOSE) - A.1 Decoder not committed (default QEMU emulation) Before: ``` fffc0000-ffffffff : Reserved 100000000-27fffffff : System RAM 5c0001128-5c00011b7 : port1 5d0000000-6cfffffff : CXL Window 0 6d0000000-7cfffffff : CXL Window 1 7000000000-700000ffff : PCI Bus 0000:0c 7000000000-700000ffff : 0000:0c:00.0 7000010000-700001ffff : PCI Bus 0000:0e 7000010000-700001ffff : 0000:0e:00.0 7000011080-70000110d7 : mem0 ``` After (CXL window is absent): ``` fed00000-fed003ff : PNP0103:00 fed1c000-fed1ffff : Reserved feffc000-feffffff : Reserved fffc0000-ffffffff : Reserved 100000000-27fffffff : System RAM 7000000000-700000ffff : PCI Bus 0000:0c 7000000000-700000ffff : 0000:0c:00.0 7000010000-700001ffff : PCI Bus 0000:0e 7000010000-700001ffff : 0000:0e:00.0 7000020000-703fffffff : PCI Bus 0000:00 ``` - A.2 Decoder is committed Before: ``` 100000000-27fffffff : System RAM 5c0001128-5c00011b7 : port1 5d0000000-6cfffffff : CXL Window 0 5d0000000-6cfffffff : region0 5d0000000-6cfffffff : dax0.0 5d0000000-6cfffffff : System RAM (kmem) 7000000000-700000ffff : PCI Bus 0000:0c 7000000000-700000ffff : 0000:0c:00.0 ``` After (CXL window is absent): ``` feffc000-feffffff : Reserved fffc0000-ffffffff : Reserved 100000000-27fffffff : System RAM 7000000000-700000ffff : PCI Bus 0000:0c 7000000000-700000ffff : 0000:0c:00.0 7000010000-700001ffff : PCI Bus 0000:0e 7000010000-700001ffff : 0000:0e:00.0 7000020000-703fffffff : PCI Bus 0000:00 ``` B) EFI_SPECIAL_PURPOSE is set - B.1 Decoder not committed Before: ``` 5d0000000-7cfffffff : Soft Reserved 5d0000000-6cfffffff : CXL Window 0 6d0000000-7cfffffff : CXL Window 1 ``` After (fallback to hmem): ``` 5d0000000-7cfffffff : Soft Reserved 5d0000000-7cfffffff : dax0.0 5d0000000-7cfffffff : System RAM (kmem) ``` - B.2 Decoder is committed Before: ``` 5d0000000-6cfffffff : CXL Window 0 5d0000000-6cfffffff : region0 5d0000000-6cfffffff : Soft Reserved 5d0000000-6cfffffff : dax0.0 5d0000000-6cfffffff : System RAM (kmem) ``` After (fallback to hmem): ``` 5d0000000-6cfffffff : Soft Reserved 5d0000000-6cfffffff : dax0.0 5d0000000-6cfffffff : System RAM (kmem) ``` Thanks Zhijian On 16/07/2025 02:04, Smita Koralahalli wrote: > This series introduces the ability to manage SOFT RESERVED iomem > resources, enabling the CXL driver to remove any portions that > intersect with created CXL regions. > > The current approach of leaving SOFT RESERVED entries as is can result > in failures during device hotplug such as CXL because the address range > remains reserved and unavailable for reuse even after region teardown. > > To address this, the CXL driver now uses a background worker that waits > for cxl_mem driver probe to complete before scanning for intersecting > resources. Then the driver walks through created CXL regions to trim any > intersections with SOFT RESERVED resources in the iomem tree. > > The following scenarios have been tested: > > Example 1: Exact alignment, soft reserved is a child of the region > > |---------- "Soft Reserved" -----------| > |-------------- "Region #" ------------| > > Before: > 1050000000-304fffffff : CXL Window 0 > 1050000000-304fffffff : region0 > 1050000000-304fffffff : Soft Reserved > 1080000000-2fffffffff : dax0.0 > 1080000000-2fffffffff : System RAM (kmem) > > After: > 1050000000-304fffffff : CXL Window 0 > 1050000000-304fffffff : region0 > 1080000000-2fffffffff : dax0.0 > 1080000000-2fffffffff : System RAM (kmem) > > Example 2: Start and/or end aligned and soft reserved spans multiple > regions > |----------- "Soft Reserved" -----------| > |-------- "Region #" -------| > or > |----------- "Soft Reserved" -----------| > |-------- "Region #" -------| > > Before: > 850000000-684fffffff : Soft Reserved > 850000000-284fffffff : CXL Window 0 > 850000000-284fffffff : region3 > 850000000-284fffffff : dax0.0 > 850000000-284fffffff : System RAM (kmem) > 2850000000-484fffffff : CXL Window 1 > 2850000000-484fffffff : region4 > 2850000000-484fffffff : dax1.0 > 2850000000-484fffffff : System RAM (kmem) > 4850000000-684fffffff : CXL Window 2 > 4850000000-684fffffff : region5 > 4850000000-684fffffff : dax2.0 > 4850000000-684fffffff : System RAM (kmem) > > After: > 850000000-284fffffff : CXL Window 0 > 850000000-284fffffff : region3 > 850000000-284fffffff : dax0.0 > 850000000-284fffffff : System RAM (kmem) > 2850000000-484fffffff : CXL Window 1 > 2850000000-484fffffff : region4 > 2850000000-484fffffff : dax1.0 > 2850000000-484fffffff : System RAM (kmem) > 4850000000-684fffffff : CXL Window 2 > 4850000000-684fffffff : region5 > 4850000000-684fffffff : dax2.0 > 4850000000-684fffffff : System RAM (kmem) > > Example 3: No alignment > |---------- "Soft Reserved" ----------| > |---- "Region #" ----| > > Before: > 00000000-3050000ffd : Soft Reserved > .. > .. > 1050000000-304fffffff : CXL Window 0 > 1050000000-304fffffff : region1 > 1080000000-2fffffffff : dax0.0 > 1080000000-2fffffffff : System RAM (kmem) > > After: > 00000000-104fffffff : Soft Reserved > .. > .. > 1050000000-304fffffff : CXL Window 0 > 1050000000-304fffffff : region1 > 1080000000-2fffffffff : dax0.0 > 1080000000-2fffffffff : System RAM (kmem) > 3050000000-3050000ffd : Soft Reserved > > Link to v4: > https://lore.kernel.org/linux-cxl/20250603221949.53272-1-Smita.KoralahalliChannabasappa@amd.com > > v5 updates: > - Handled cases where CXL driver loads early even before HMEM driver is > initialized. > - Introduced callback functions to resolve dependencies. > - Rename suspend.c to probe_state.c. > - Refactor cxl_acpi_probe() to use a single exit path. > - Commit description update to justify cxl_mem_active() usage. > - Change from kmalloc -> kzalloc in add_soft_reserved(). > - Change from goto to if else blocks inside remove_soft_reserved(). > - DEFINE_RES_MEM_NAMED -> DEFINE_RES_NAMED_DESC. > - Comments for flags inside remove_soft_reserved(). > - Add resource_lock inside normalize_resource(). > - bus_find_next_device -> bus_find_device. > - Skip DAX consumption of soft reserves inside hmat with > CONFIG_CXL_ACPI checks. > > v4 updates: > - Split first patch into 4 smaller patches. > - Correct the logic for cxl_pci_loaded() and cxl_mem_active() to return > false at default instead of true. > - Cleanup cxl_wait_for_pci_mem() to remove config checks for cxl_pci > and cxl_mem. > - Fixed multiple bugs and build issues which includes correcting > walk_iomem_resc_desc() and calculations of alignments. > > v3 updates: > - Remove srmem resource tree from kernel/resource.c, this is no longer > needed in the current implementation. All SOFT RESERVE resources now > put on the iomem resource tree. > - Remove the no longer needed SOFT_RESERVED_MANAGED kernel config option. > - Add the 'nid' parameter back to hmem_register_resource(); > - Remove the no longer used soft reserve notification chain (introduced > in v2). The dax driver is now notified of SOFT RESERVED resources by > the CXL driver. > > v2 updates: > - Add config option SOFT_RESERVE_MANAGED to control use of the > separate srmem resource tree at boot. > - Only add SOFT RESERVE resources to the soft reserve tree during > boot, they go to the iomem resource tree after boot. > - Remove the resource trimming code in the previous patch to re-use > the existing code in kernel/resource.c > - Add functionality for the cxl acpi driver to wait for the cxl PCI > and mem drivers to load. > > Smita Koralahalli (7): > cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX > registration > cxl/core: Rename suspend.c to probe_state.c and remove > CONFIG_CXL_SUSPEND > cxl/acpi: Add background worker to coordinate with cxl_mem probe > completion > cxl/region: Introduce SOFT RESERVED resource removal on region > teardown > dax/hmem: Save the DAX HMEM platform device pointer > dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until > after CXL region creation > dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads > late > > drivers/acpi/numa/hmat.c | 4 + > drivers/cxl/Kconfig | 4 - > drivers/cxl/acpi.c | 50 +++++-- > drivers/cxl/core/Makefile | 2 +- > drivers/cxl/core/{suspend.c => probe_state.c} | 10 +- > drivers/cxl/core/region.c | 135 ++++++++++++++++++ > drivers/cxl/cxl.h | 4 + > drivers/cxl/cxlmem.h | 9 -- > drivers/dax/hmem/Makefile | 1 + > drivers/dax/hmem/device.c | 62 ++++---- > drivers/dax/hmem/hmem.c | 14 +- > drivers/dax/hmem/hmem_notify.c | 29 ++++ > include/linux/dax.h | 7 +- > include/linux/ioport.h | 1 + > include/linux/pm.h | 7 - > kernel/resource.c | 34 +++++ > 16 files changed, 307 insertions(+), 66 deletions(-) > rename drivers/cxl/core/{suspend.c => probe_state.c} (62%) > create mode 100644 drivers/dax/hmem/hmem_notify.c >
On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: > This series introduces the ability to manage SOFT RESERVED iomem > resources, enabling the CXL driver to remove any portions that > intersect with created CXL regions. Hi Smita, This set applied cleanly to todays cxl-next but fails like appended before region probe. BTW - there were sparse warnings in the build that look related: CHECK drivers/dax/hmem/hmem_notify.c drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit This isn't all the logs, I trimmed. Let me know if you need more or other info to reproduce. [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 [ 53.653540] preempt_count: 1, expected: 0 [ 53.653554] RCU nest depth: 0, expected: 0 [ 53.653568] 3 locks held by kworker/46:1/1875: [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 [ 53.653598] Preemption disabled at: [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] [ 53.653648] Call Trace: [ 53.653649] <TASK> [ 53.653652] dump_stack_lvl+0xa8/0xd0 [ 53.653658] dump_stack+0x14/0x20 [ 53.653659] __might_resched+0x1ae/0x2d0 [ 53.653666] __might_sleep+0x48/0x70 [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 [ 53.653674] ? __devm_add_action+0x3d/0x160 [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 [ 53.653688] __devres_alloc_node+0x4a/0x90 [ 53.653689] ? __devres_alloc_node+0x4a/0x90 [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] [ 53.653693] __devm_add_action+0x3d/0x160 [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] [ 53.653700] hmem_fallback_register_device+0x37/0x60 [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] [ 53.653739] walk_iomem_res_desc+0x55/0xb0 [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 [ 53.653768] process_one_work+0x1fa/0x630 [ 53.653774] worker_thread+0x1b2/0x360 [ 53.653777] kthread+0x128/0x250 [ 53.653781] ? __pfx_worker_thread+0x10/0x10 [ 53.653784] ? __pfx_kthread+0x10/0x10 [ 53.653786] ret_from_fork+0x139/0x1e0 [ 53.653790] ? __pfx_kthread+0x10/0x10 [ 53.653792] ret_from_fork_asm+0x1a/0x30 [ 53.653801] </TASK> [ 53.654193] ============================= [ 53.654203] [ BUG: Invalid wait context ] [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W [ 53.654623] ----------------------------- [ 53.654785] kworker/46:1/1875 is trying to lock: [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 [ 53.655115] other info that might help us debug this: [ 53.655273] context-{5:5} [ 53.655428] 3 locks held by kworker/46:1/1875: [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 [ 53.656062] stack backtrace: [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) [ 53.656227] Tainted: [W]=WARN [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] [ 53.656232] Call Trace: [ 53.656232] <TASK> [ 53.656234] dump_stack_lvl+0x85/0xd0 [ 53.656238] dump_stack+0x14/0x20 [ 53.656239] __lock_acquire+0xaf4/0x2200 [ 53.656246] lock_acquire+0xd8/0x300 [ 53.656248] ? kernfs_add_one+0x34/0x390 [ 53.656252] ? __might_resched+0x208/0x2d0 [ 53.656257] down_write+0x44/0xe0 [ 53.656262] ? kernfs_add_one+0x34/0x390 [ 53.656263] kernfs_add_one+0x34/0x390 [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 [ 53.656270] kobject_add_internal+0xb1/0x2f0 [ 53.656273] kobject_add+0x7d/0xf0 [ 53.656275] ? get_device_parent+0x28/0x1e0 [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 [ 53.656282] device_add+0x124/0x8b0 [ 53.656285] ? dev_set_name+0x56/0x70 [ 53.656287] platform_device_add+0x102/0x260 [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] [ 53.656291] hmem_fallback_register_device+0x37/0x60 [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] [ 53.656323] walk_iomem_res_desc+0x55/0xb0 [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 [ 53.656346] process_one_work+0x1fa/0x630 [ 53.656350] worker_thread+0x1b2/0x360 [ 53.656352] kthread+0x128/0x250 [ 53.656354] ? __pfx_worker_thread+0x10/0x10 [ 53.656356] ? __pfx_kthread+0x10/0x10 [ 53.656357] ret_from_fork+0x139/0x1e0 [ 53.656360] ? __pfx_kthread+0x10/0x10 [ 53.656361] ret_from_fork_asm+0x1a/0x30 [ 53.656366] </TASK> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 [ 53.663552] schedule+0x4a/0x160 [ 53.663553] schedule_timeout+0x10a/0x120 [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 [ 53.663558] __wait_for_common+0xb9/0x1c0 [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 [ 53.663561] wait_for_completion+0x28/0x30 [ 53.663562] __synchronize_srcu+0xbf/0x180 [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 [ 53.663571] ? i2c_repstart+0x30/0x80 [ 53.663576] synchronize_srcu+0x46/0x120 [ 53.663577] kill_dax+0x47/0x70 [ 53.663580] __devm_create_dev_dax+0x112/0x470 [ 53.663582] devm_create_dev_dax+0x26/0x50 [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] [ 53.663585] platform_probe+0x61/0xd0 [ 53.663589] really_probe+0xe2/0x390 [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 [ 53.663593] __driver_probe_device+0x7e/0x160 [ 53.663594] driver_probe_device+0x23/0xa0 [ 53.663596] __device_attach_driver+0x92/0x120 [ 53.663597] bus_for_each_drv+0x8c/0xf0 [ 53.663599] __device_attach+0xc2/0x1f0 [ 53.663601] device_initial_probe+0x17/0x20 [ 53.663603] bus_probe_device+0xa8/0xb0 [ 53.663604] device_add+0x687/0x8b0 [ 53.663607] ? dev_set_name+0x56/0x70 [ 53.663609] platform_device_add+0x102/0x260 [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] [ 53.663612] hmem_fallback_register_device+0x37/0x60 [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] [ 53.663637] walk_iomem_res_desc+0x55/0xb0 [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 [ 53.663658] process_one_work+0x1fa/0x630 [ 53.663662] worker_thread+0x1b2/0x360 [ 53.663664] kthread+0x128/0x250 [ 53.663666] ? __pfx_worker_thread+0x10/0x10 [ 53.663668] ? __pfx_kthread+0x10/0x10 [ 53.663670] ret_from_fork+0x139/0x1e0 [ 53.663672] ? __pfx_kthread+0x10/0x10 [ 53.663673] ret_from_fork_asm+0x1a/0x30 [ 53.663677] </TASK> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 [ 53.700264] INFO: lockdep is turned off. [ 53.701315] Preemption disabled at: [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) [ 53.701633] Tainted: [W]=WARN [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] [ 53.701638] Call Trace: [ 53.701638] <TASK> [ 53.701640] dump_stack_lvl+0xa8/0xd0 [ 53.701644] dump_stack+0x14/0x20 [ 53.701645] __schedule_bug+0xa2/0xd0 [ 53.701649] __schedule+0xe6f/0x10d0 [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 [ 53.701655] ? lock_release+0x1e6/0x2b0 [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 [ 53.701661] schedule+0x4a/0x160 [ 53.701662] schedule_timeout+0x10a/0x120 [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 [ 53.701667] __wait_for_common+0xb9/0x1c0 [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 [ 53.701670] wait_for_completion+0x28/0x30 [ 53.701671] __synchronize_srcu+0xbf/0x180 [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 [ 53.701682] ? i2c_repstart+0x30/0x80 [ 53.701685] synchronize_srcu+0x46/0x120 [ 53.701687] kill_dax+0x47/0x70 [ 53.701689] __devm_create_dev_dax+0x112/0x470 [ 53.701691] devm_create_dev_dax+0x26/0x50 [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] [ 53.701695] platform_probe+0x61/0xd0 [ 53.701698] really_probe+0xe2/0x390 [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 [ 53.701701] __driver_probe_device+0x7e/0x160 [ 53.701703] driver_probe_device+0x23/0xa0 [ 53.701704] __device_attach_driver+0x92/0x120 [ 53.701706] bus_for_each_drv+0x8c/0xf0 [ 53.701708] __device_attach+0xc2/0x1f0 [ 53.701710] device_initial_probe+0x17/0x20 [ 53.701711] bus_probe_device+0xa8/0xb0 [ 53.701712] device_add+0x687/0x8b0 [ 53.701715] ? dev_set_name+0x56/0x70 [ 53.701717] platform_device_add+0x102/0x260 [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] [ 53.701720] hmem_fallback_register_device+0x37/0x60 [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] [ 53.701734] walk_iomem_res_desc+0x55/0xb0 [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 [ 53.701756] process_one_work+0x1fa/0x630 [ 53.701760] worker_thread+0x1b2/0x360 [ 53.701762] kthread+0x128/0x250 [ 53.701765] ? __pfx_worker_thread+0x10/0x10 [ 53.701766] ? __pfx_kthread+0x10/0x10 [ 53.701768] ret_from_fork+0x139/0x1e0 [ 53.701771] ? __pfx_kthread+0x10/0x10 [ 53.701772] ret_from_fork_asm+0x1a/0x30 [ 53.701777] </TASK>
Hi Alison, On 7/15/2025 2:07 PM, Alison Schofield wrote: > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: >> This series introduces the ability to manage SOFT RESERVED iomem >> resources, enabling the CXL driver to remove any portions that >> intersect with created CXL regions. > > Hi Smita, > > This set applied cleanly to todays cxl-next but fails like appended > before region probe. > > BTW - there were sparse warnings in the build that look related: > CHECK drivers/dax/hmem/hmem_notify.c > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit Thanks for pointing this bug. I failed to release the spinlock before calling hmem_register_device(), which internally calls platform_device_add() and can sleep. The following fix addresses that bug. I’ll incorporate this into v6: diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c index 6c276c5bd51d..8f411f3fe7bd 100644 --- a/drivers/dax/hmem/hmem_notify.c +++ b/drivers/dax/hmem/hmem_notify.c @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const struct resource *res) { walk_hmem_fn hmem_fn; - guard(spinlock)(&hmem_notify_lock); + spin_lock(&hmem_notify_lock); hmem_fn = hmem_fallback_fn; + spin_unlock(&hmem_notify_lock); if (hmem_fn) hmem_fn(target_nid, res); -- As for the log: [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing I’m still analyzing that. Here's what was my thought process so far. - This occurs when cxl_acpi_probe() runs significantly earlier than cxl_mem_probe(), so CXL region creation (which happens in cxl_port_endpoint_probe()) may or may not have completed by the time trimming is attempted. - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does guarantee load order when all components are built as modules. So even if the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and cxl_mem in modular configurations. As a result, region creation is eventually guaranteed, and wait_for_device_probe() will succeed once the relevant probes complete. - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish before cxl_port_probe() even begins, which can cause wait_for_device_probe() to return prematurely and trigger the timeout. - In my local setup, I observed that a 30-second timeout was generally sufficient to catch this race, allowing cxl_port_probe() to load while cxl_acpi_probe() is still active. Since we cannot mix built-in and modular components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a best-effort mechanism. After the timeout, wait_for_device_probe() ensures cxl_port_probe() has completed before trimming proceeds, making the logic good enough to most boot-time races. One possible improvement I’m considering is to schedule a delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait slightly longer for cxl_mem_probe() to complete (which itself softdeps on cxl_port) before initiating the soft reserve trimming. That said, I'm still evaluating better options to more robustly coordinate probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and looking for suggestions here. Thanks Smita > > > This isn't all the logs, I trimmed. Let me know if you need more or > other info to reproduce. > > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing > [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 > [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 > [ 53.653540] preempt_count: 1, expected: 0 > [ 53.653554] RCU nest depth: 0, expected: 0 > [ 53.653568] 3 locks held by kworker/46:1/1875: > [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 > [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 > [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 > [ 53.653598] Preemption disabled at: > [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 > [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > [ 53.653648] Call Trace: > [ 53.653649] <TASK> > [ 53.653652] dump_stack_lvl+0xa8/0xd0 > [ 53.653658] dump_stack+0x14/0x20 > [ 53.653659] __might_resched+0x1ae/0x2d0 > [ 53.653666] __might_sleep+0x48/0x70 > [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 > [ 53.653674] ? __devm_add_action+0x3d/0x160 > [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 > [ 53.653688] __devres_alloc_node+0x4a/0x90 > [ 53.653689] ? __devres_alloc_node+0x4a/0x90 > [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] > [ 53.653693] __devm_add_action+0x3d/0x160 > [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] > [ 53.653700] hmem_fallback_register_device+0x37/0x60 > [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > [ 53.653739] walk_iomem_res_desc+0x55/0xb0 > [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 53.653768] process_one_work+0x1fa/0x630 > [ 53.653774] worker_thread+0x1b2/0x360 > [ 53.653777] kthread+0x128/0x250 > [ 53.653781] ? __pfx_worker_thread+0x10/0x10 > [ 53.653784] ? __pfx_kthread+0x10/0x10 > [ 53.653786] ret_from_fork+0x139/0x1e0 > [ 53.653790] ? __pfx_kthread+0x10/0x10 > [ 53.653792] ret_from_fork_asm+0x1a/0x30 > [ 53.653801] </TASK> > > [ 53.654193] ============================= > [ 53.654203] [ BUG: Invalid wait context ] > [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W > [ 53.654623] ----------------------------- > [ 53.654785] kworker/46:1/1875 is trying to lock: > [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 > [ 53.655115] other info that might help us debug this: > [ 53.655273] context-{5:5} > [ 53.655428] 3 locks held by kworker/46:1/1875: > [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 > [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 > [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 > [ 53.656062] stack backtrace: > [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > [ 53.656227] Tainted: [W]=WARN > [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > [ 53.656232] Call Trace: > [ 53.656232] <TASK> > [ 53.656234] dump_stack_lvl+0x85/0xd0 > [ 53.656238] dump_stack+0x14/0x20 > [ 53.656239] __lock_acquire+0xaf4/0x2200 > [ 53.656246] lock_acquire+0xd8/0x300 > [ 53.656248] ? kernfs_add_one+0x34/0x390 > [ 53.656252] ? __might_resched+0x208/0x2d0 > [ 53.656257] down_write+0x44/0xe0 > [ 53.656262] ? kernfs_add_one+0x34/0x390 > [ 53.656263] kernfs_add_one+0x34/0x390 > [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 > [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 > [ 53.656270] kobject_add_internal+0xb1/0x2f0 > [ 53.656273] kobject_add+0x7d/0xf0 > [ 53.656275] ? get_device_parent+0x28/0x1e0 > [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 > [ 53.656282] device_add+0x124/0x8b0 > [ 53.656285] ? dev_set_name+0x56/0x70 > [ 53.656287] platform_device_add+0x102/0x260 > [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] > [ 53.656291] hmem_fallback_register_device+0x37/0x60 > [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > [ 53.656323] walk_iomem_res_desc+0x55/0xb0 > [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 53.656346] process_one_work+0x1fa/0x630 > [ 53.656350] worker_thread+0x1b2/0x360 > [ 53.656352] kthread+0x128/0x250 > [ 53.656354] ? __pfx_worker_thread+0x10/0x10 > [ 53.656356] ? __pfx_kthread+0x10/0x10 > [ 53.656357] ret_from_fork+0x139/0x1e0 > [ 53.656360] ? __pfx_kthread+0x10/0x10 > [ 53.656361] ret_from_fork_asm+0x1a/0x30 > [ 53.656366] </TASK> > [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 > [ 53.663552] schedule+0x4a/0x160 > [ 53.663553] schedule_timeout+0x10a/0x120 > [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 > [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 > [ 53.663558] __wait_for_common+0xb9/0x1c0 > [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 > [ 53.663561] wait_for_completion+0x28/0x30 > [ 53.663562] __synchronize_srcu+0xbf/0x180 > [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 > [ 53.663571] ? i2c_repstart+0x30/0x80 > [ 53.663576] synchronize_srcu+0x46/0x120 > [ 53.663577] kill_dax+0x47/0x70 > [ 53.663580] __devm_create_dev_dax+0x112/0x470 > [ 53.663582] devm_create_dev_dax+0x26/0x50 > [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] > [ 53.663585] platform_probe+0x61/0xd0 > [ 53.663589] really_probe+0xe2/0x390 > [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 > [ 53.663593] __driver_probe_device+0x7e/0x160 > [ 53.663594] driver_probe_device+0x23/0xa0 > [ 53.663596] __device_attach_driver+0x92/0x120 > [ 53.663597] bus_for_each_drv+0x8c/0xf0 > [ 53.663599] __device_attach+0xc2/0x1f0 > [ 53.663601] device_initial_probe+0x17/0x20 > [ 53.663603] bus_probe_device+0xa8/0xb0 > [ 53.663604] device_add+0x687/0x8b0 > [ 53.663607] ? dev_set_name+0x56/0x70 > [ 53.663609] platform_device_add+0x102/0x260 > [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] > [ 53.663612] hmem_fallback_register_device+0x37/0x60 > [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > [ 53.663637] walk_iomem_res_desc+0x55/0xb0 > [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 53.663658] process_one_work+0x1fa/0x630 > [ 53.663662] worker_thread+0x1b2/0x360 > [ 53.663664] kthread+0x128/0x250 > [ 53.663666] ? __pfx_worker_thread+0x10/0x10 > [ 53.663668] ? __pfx_kthread+0x10/0x10 > [ 53.663670] ret_from_fork+0x139/0x1e0 > [ 53.663672] ? __pfx_kthread+0x10/0x10 > [ 53.663673] ret_from_fork_asm+0x1a/0x30 > [ 53.663677] </TASK> > [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 > [ 53.700264] INFO: lockdep is turned off. > [ 53.701315] Preemption disabled at: > [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 > [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > [ 53.701633] Tainted: [W]=WARN > [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > [ 53.701638] Call Trace: > [ 53.701638] <TASK> > [ 53.701640] dump_stack_lvl+0xa8/0xd0 > [ 53.701644] dump_stack+0x14/0x20 > [ 53.701645] __schedule_bug+0xa2/0xd0 > [ 53.701649] __schedule+0xe6f/0x10d0 > [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 > [ 53.701655] ? lock_release+0x1e6/0x2b0 > [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 > [ 53.701661] schedule+0x4a/0x160 > [ 53.701662] schedule_timeout+0x10a/0x120 > [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 > [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 > [ 53.701667] __wait_for_common+0xb9/0x1c0 > [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 > [ 53.701670] wait_for_completion+0x28/0x30 > [ 53.701671] __synchronize_srcu+0xbf/0x180 > [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 > [ 53.701682] ? i2c_repstart+0x30/0x80 > [ 53.701685] synchronize_srcu+0x46/0x120 > [ 53.701687] kill_dax+0x47/0x70 > [ 53.701689] __devm_create_dev_dax+0x112/0x470 > [ 53.701691] devm_create_dev_dax+0x26/0x50 > [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] > [ 53.701695] platform_probe+0x61/0xd0 > [ 53.701698] really_probe+0xe2/0x390 > [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 > [ 53.701701] __driver_probe_device+0x7e/0x160 > [ 53.701703] driver_probe_device+0x23/0xa0 > [ 53.701704] __device_attach_driver+0x92/0x120 > [ 53.701706] bus_for_each_drv+0x8c/0xf0 > [ 53.701708] __device_attach+0xc2/0x1f0 > [ 53.701710] device_initial_probe+0x17/0x20 > [ 53.701711] bus_probe_device+0xa8/0xb0 > [ 53.701712] device_add+0x687/0x8b0 > [ 53.701715] ? dev_set_name+0x56/0x70 > [ 53.701717] platform_device_add+0x102/0x260 > [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] > [ 53.701720] hmem_fallback_register_device+0x37/0x60 > [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > [ 53.701734] walk_iomem_res_desc+0x55/0xb0 > [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 53.701756] process_one_work+0x1fa/0x630 > [ 53.701760] worker_thread+0x1b2/0x360 > [ 53.701762] kthread+0x128/0x250 > [ 53.701765] ? __pfx_worker_thread+0x10/0x10 > [ 53.701766] ? __pfx_kthread+0x10/0x10 > [ 53.701768] ret_from_fork+0x139/0x1e0 > [ 53.701771] ? __pfx_kthread+0x10/0x10 > [ 53.701772] ret_from_fork_asm+0x1a/0x30 > [ 53.701777] </TASK> >
Koralahalli Channabasappa, Smita wrote: [..] > That said, I'm still evaluating better options to more robustly > coordinate probe ordering between cxl_acpi, cxl_port, cxl_mem and > cxl_region and looking for suggestions here. I never quite understood the arguments around why wait_for_device_probe() does not work, but I did find a bug in my prior thinking on the way towards this RFC [1]. The misunderstanding was that MODULE_SOFTDEP() only guarantees that the module gets loaded eventually, but it does not guarantee that the softdep has completed init before the caller performs its own init. It works sometimes, and that is probably what misled me about that contract. request_module() is synchronous. With that in place I now see what wait_for_device_probe() does the right thing. It flushes cxl_pci attach for devices present at boot, and all follow-on probe work gets flushed as well. With that in hand the RFC now has a stable quiesce point to walk the CXL topology and make decisions. The RFC is effectively a fix for platforms where CXL loses the MODULE_SOFTDEP() race. [1]: http://lore.kernel.org/68808fb4e4cbf_137e6b100cc@dwillia2-xfh.jf.intel.com.notmuch
On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: > Hi Alison, > > On 7/15/2025 2:07 PM, Alison Schofield wrote: > > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: > > > This series introduces the ability to manage SOFT RESERVED iomem > > > resources, enabling the CXL driver to remove any portions that > > > intersect with created CXL regions. > > > > Hi Smita, > > > > This set applied cleanly to todays cxl-next but fails like appended > > before region probe. > > > > BTW - there were sparse warnings in the build that look related: > > CHECK drivers/dax/hmem/hmem_notify.c > > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit > > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit > > Thanks for pointing this bug. I failed to release the spinlock before > calling hmem_register_device(), which internally calls platform_device_add() > and can sleep. The following fix addresses that bug. I’ll incorporate this > into v6: > > diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c > index 6c276c5bd51d..8f411f3fe7bd 100644 > --- a/drivers/dax/hmem/hmem_notify.c > +++ b/drivers/dax/hmem/hmem_notify.c > @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const > struct resource *res) > { > walk_hmem_fn hmem_fn; > > - guard(spinlock)(&hmem_notify_lock); > + spin_lock(&hmem_notify_lock); > hmem_fn = hmem_fallback_fn; > + spin_unlock(&hmem_notify_lock); > > if (hmem_fn) > hmem_fn(target_nid, res); > -- Hi Smita, Adding the above got me past that, and doubling the timeout below stopped that from happening. After that, I haven't had time to trace so, I'll just dump on you for now: In /proc/iomem Here, we see a regions resource, no CXL Window, and no dax, and no actual region, not even disabled, is available. c080000000-c47fffffff : region0 And, here no CXL Window, no region, and a soft reserved. 68e80000000-70e7fffffff : Soft Reserved 68e80000000-70e7fffffff : dax1.0 68e80000000-70e7fffffff : System RAM (kmem) I haven't yet walked through the v4 to v5 changes so I'll do that next. > > As for the log: > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for > cxl_mem probing > > I’m still analyzing that. Here's what was my thought process so far. > > - This occurs when cxl_acpi_probe() runs significantly earlier than > cxl_mem_probe(), so CXL region creation (which happens in > cxl_port_endpoint_probe()) may or may not have completed by the time > trimming is attempted. > > - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does > guarantee load order when all components are built as modules. So even if > the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, > MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and > cxl_mem in modular configurations. As a result, region creation is > eventually guaranteed, and wait_for_device_probe() will succeed once the > relevant probes complete. > > - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no > guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish > before cxl_port_probe() even begins, which can cause wait_for_device_probe() > to return prematurely and trigger the timeout. > > - In my local setup, I observed that a 30-second timeout was generally > sufficient to catch this race, allowing cxl_port_probe() to load while > cxl_acpi_probe() is still active. Since we cannot mix built-in and modular > components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a > best-effort mechanism. After the timeout, wait_for_device_probe() ensures > cxl_port_probe() has completed before trimming proceeds, making the logic > good enough to most boot-time races. > > One possible improvement I’m considering is to schedule a > delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait > slightly longer for cxl_mem_probe() to complete (which itself softdeps on > cxl_port) before initiating the soft reserve trimming. > > That said, I'm still evaluating better options to more robustly coordinate > probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and > looking for suggestions here. > > Thanks > Smita > > > > > > > This isn't all the logs, I trimmed. Let me know if you need more or > > other info to reproduce. > > > > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing > > [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 > > [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 > > [ 53.653540] preempt_count: 1, expected: 0 > > [ 53.653554] RCU nest depth: 0, expected: 0 > > [ 53.653568] 3 locks held by kworker/46:1/1875: > > [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 > > [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 > > [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 > > [ 53.653598] Preemption disabled at: > > [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 > > [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > > [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > > [ 53.653648] Call Trace: > > [ 53.653649] <TASK> > > [ 53.653652] dump_stack_lvl+0xa8/0xd0 > > [ 53.653658] dump_stack+0x14/0x20 > > [ 53.653659] __might_resched+0x1ae/0x2d0 > > [ 53.653666] __might_sleep+0x48/0x70 > > [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 > > [ 53.653674] ? __devm_add_action+0x3d/0x160 > > [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 > > [ 53.653688] __devres_alloc_node+0x4a/0x90 > > [ 53.653689] ? __devres_alloc_node+0x4a/0x90 > > [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] > > [ 53.653693] __devm_add_action+0x3d/0x160 > > [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] > > [ 53.653700] hmem_fallback_register_device+0x37/0x60 > > [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > [ 53.653739] walk_iomem_res_desc+0x55/0xb0 > > [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 53.653768] process_one_work+0x1fa/0x630 > > [ 53.653774] worker_thread+0x1b2/0x360 > > [ 53.653777] kthread+0x128/0x250 > > [ 53.653781] ? __pfx_worker_thread+0x10/0x10 > > [ 53.653784] ? __pfx_kthread+0x10/0x10 > > [ 53.653786] ret_from_fork+0x139/0x1e0 > > [ 53.653790] ? __pfx_kthread+0x10/0x10 > > [ 53.653792] ret_from_fork_asm+0x1a/0x30 > > [ 53.653801] </TASK> > > > > [ 53.654193] ============================= > > [ 53.654203] [ BUG: Invalid wait context ] > > [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W > > [ 53.654623] ----------------------------- > > [ 53.654785] kworker/46:1/1875 is trying to lock: > > [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 > > [ 53.655115] other info that might help us debug this: > > [ 53.655273] context-{5:5} > > [ 53.655428] 3 locks held by kworker/46:1/1875: > > [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 > > [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 > > [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 > > [ 53.656062] stack backtrace: > > [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > > [ 53.656227] Tainted: [W]=WARN > > [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > > [ 53.656232] Call Trace: > > [ 53.656232] <TASK> > > [ 53.656234] dump_stack_lvl+0x85/0xd0 > > [ 53.656238] dump_stack+0x14/0x20 > > [ 53.656239] __lock_acquire+0xaf4/0x2200 > > [ 53.656246] lock_acquire+0xd8/0x300 > > [ 53.656248] ? kernfs_add_one+0x34/0x390 > > [ 53.656252] ? __might_resched+0x208/0x2d0 > > [ 53.656257] down_write+0x44/0xe0 > > [ 53.656262] ? kernfs_add_one+0x34/0x390 > > [ 53.656263] kernfs_add_one+0x34/0x390 > > [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 > > [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 > > [ 53.656270] kobject_add_internal+0xb1/0x2f0 > > [ 53.656273] kobject_add+0x7d/0xf0 > > [ 53.656275] ? get_device_parent+0x28/0x1e0 > > [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 > > [ 53.656282] device_add+0x124/0x8b0 > > [ 53.656285] ? dev_set_name+0x56/0x70 > > [ 53.656287] platform_device_add+0x102/0x260 > > [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] > > [ 53.656291] hmem_fallback_register_device+0x37/0x60 > > [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > [ 53.656323] walk_iomem_res_desc+0x55/0xb0 > > [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 53.656346] process_one_work+0x1fa/0x630 > > [ 53.656350] worker_thread+0x1b2/0x360 > > [ 53.656352] kthread+0x128/0x250 > > [ 53.656354] ? __pfx_worker_thread+0x10/0x10 > > [ 53.656356] ? __pfx_kthread+0x10/0x10 > > [ 53.656357] ret_from_fork+0x139/0x1e0 > > [ 53.656360] ? __pfx_kthread+0x10/0x10 > > [ 53.656361] ret_from_fork_asm+0x1a/0x30 > > [ 53.656366] </TASK> > > [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 > > [ 53.663552] schedule+0x4a/0x160 > > [ 53.663553] schedule_timeout+0x10a/0x120 > > [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 > > [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 > > [ 53.663558] __wait_for_common+0xb9/0x1c0 > > [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 > > [ 53.663561] wait_for_completion+0x28/0x30 > > [ 53.663562] __synchronize_srcu+0xbf/0x180 > > [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 > > [ 53.663571] ? i2c_repstart+0x30/0x80 > > [ 53.663576] synchronize_srcu+0x46/0x120 > > [ 53.663577] kill_dax+0x47/0x70 > > [ 53.663580] __devm_create_dev_dax+0x112/0x470 > > [ 53.663582] devm_create_dev_dax+0x26/0x50 > > [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] > > [ 53.663585] platform_probe+0x61/0xd0 > > [ 53.663589] really_probe+0xe2/0x390 > > [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 > > [ 53.663593] __driver_probe_device+0x7e/0x160 > > [ 53.663594] driver_probe_device+0x23/0xa0 > > [ 53.663596] __device_attach_driver+0x92/0x120 > > [ 53.663597] bus_for_each_drv+0x8c/0xf0 > > [ 53.663599] __device_attach+0xc2/0x1f0 > > [ 53.663601] device_initial_probe+0x17/0x20 > > [ 53.663603] bus_probe_device+0xa8/0xb0 > > [ 53.663604] device_add+0x687/0x8b0 > > [ 53.663607] ? dev_set_name+0x56/0x70 > > [ 53.663609] platform_device_add+0x102/0x260 > > [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] > > [ 53.663612] hmem_fallback_register_device+0x37/0x60 > > [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > [ 53.663637] walk_iomem_res_desc+0x55/0xb0 > > [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 53.663658] process_one_work+0x1fa/0x630 > > [ 53.663662] worker_thread+0x1b2/0x360 > > [ 53.663664] kthread+0x128/0x250 > > [ 53.663666] ? __pfx_worker_thread+0x10/0x10 > > [ 53.663668] ? __pfx_kthread+0x10/0x10 > > [ 53.663670] ret_from_fork+0x139/0x1e0 > > [ 53.663672] ? __pfx_kthread+0x10/0x10 > > [ 53.663673] ret_from_fork_asm+0x1a/0x30 > > [ 53.663677] </TASK> > > [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 > > [ 53.700264] INFO: lockdep is turned off. > > [ 53.701315] Preemption disabled at: > > [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 > > [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > > [ 53.701633] Tainted: [W]=WARN > > [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > > [ 53.701638] Call Trace: > > [ 53.701638] <TASK> > > [ 53.701640] dump_stack_lvl+0xa8/0xd0 > > [ 53.701644] dump_stack+0x14/0x20 > > [ 53.701645] __schedule_bug+0xa2/0xd0 > > [ 53.701649] __schedule+0xe6f/0x10d0 > > [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 > > [ 53.701655] ? lock_release+0x1e6/0x2b0 > > [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 > > [ 53.701661] schedule+0x4a/0x160 > > [ 53.701662] schedule_timeout+0x10a/0x120 > > [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 > > [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 > > [ 53.701667] __wait_for_common+0xb9/0x1c0 > > [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 > > [ 53.701670] wait_for_completion+0x28/0x30 > > [ 53.701671] __synchronize_srcu+0xbf/0x180 > > [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 > > [ 53.701682] ? i2c_repstart+0x30/0x80 > > [ 53.701685] synchronize_srcu+0x46/0x120 > > [ 53.701687] kill_dax+0x47/0x70 > > [ 53.701689] __devm_create_dev_dax+0x112/0x470 > > [ 53.701691] devm_create_dev_dax+0x26/0x50 > > [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] > > [ 53.701695] platform_probe+0x61/0xd0 > > [ 53.701698] really_probe+0xe2/0x390 > > [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 > > [ 53.701701] __driver_probe_device+0x7e/0x160 > > [ 53.701703] driver_probe_device+0x23/0xa0 > > [ 53.701704] __device_attach_driver+0x92/0x120 > > [ 53.701706] bus_for_each_drv+0x8c/0xf0 > > [ 53.701708] __device_attach+0xc2/0x1f0 > > [ 53.701710] device_initial_probe+0x17/0x20 > > [ 53.701711] bus_probe_device+0xa8/0xb0 > > [ 53.701712] device_add+0x687/0x8b0 > > [ 53.701715] ? dev_set_name+0x56/0x70 > > [ 53.701717] platform_device_add+0x102/0x260 > > [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] > > [ 53.701720] hmem_fallback_register_device+0x37/0x60 > > [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > [ 53.701734] walk_iomem_res_desc+0x55/0xb0 > > [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 53.701756] process_one_work+0x1fa/0x630 > > [ 53.701760] worker_thread+0x1b2/0x360 > > [ 53.701762] kthread+0x128/0x250 > > [ 53.701765] ? __pfx_worker_thread+0x10/0x10 > > [ 53.701766] ? __pfx_kthread+0x10/0x10 > > [ 53.701768] ret_from_fork+0x139/0x1e0 > > [ 53.701771] ? __pfx_kthread+0x10/0x10 > > [ 53.701772] ret_from_fork_asm+0x1a/0x30 > > [ 53.701777] </TASK> > > >
On 7/16/2025 1:20 PM, Alison Schofield wrote: > On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: >> Hi Alison, >> >> On 7/15/2025 2:07 PM, Alison Schofield wrote: >>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: >>>> This series introduces the ability to manage SOFT RESERVED iomem >>>> resources, enabling the CXL driver to remove any portions that >>>> intersect with created CXL regions. >>> >>> Hi Smita, >>> >>> This set applied cleanly to todays cxl-next but fails like appended >>> before region probe. >>> >>> BTW - there were sparse warnings in the build that look related: >>> CHECK drivers/dax/hmem/hmem_notify.c >>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit >>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit >> >> Thanks for pointing this bug. I failed to release the spinlock before >> calling hmem_register_device(), which internally calls platform_device_add() >> and can sleep. The following fix addresses that bug. I’ll incorporate this >> into v6: >> >> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c >> index 6c276c5bd51d..8f411f3fe7bd 100644 >> --- a/drivers/dax/hmem/hmem_notify.c >> +++ b/drivers/dax/hmem/hmem_notify.c >> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const >> struct resource *res) >> { >> walk_hmem_fn hmem_fn; >> >> - guard(spinlock)(&hmem_notify_lock); >> + spin_lock(&hmem_notify_lock); >> hmem_fn = hmem_fallback_fn; >> + spin_unlock(&hmem_notify_lock); >> >> if (hmem_fn) >> hmem_fn(target_nid, res); >> -- > > Hi Smita, Adding the above got me past that, and doubling the timeout > below stopped that from happening. After that, I haven't had time to > trace so, I'll just dump on you for now: > > In /proc/iomem > Here, we see a regions resource, no CXL Window, and no dax, and no > actual region, not even disabled, is available. > c080000000-c47fffffff : region0 > > And, here no CXL Window, no region, and a soft reserved. > 68e80000000-70e7fffffff : Soft Reserved > 68e80000000-70e7fffffff : dax1.0 > 68e80000000-70e7fffffff : System RAM (kmem) > > I haven't yet walked through the v4 to v5 changes so I'll do that next. Hi Alison, To help better understand the current behavior, could you share more about your platform configuration? specifically, are there two memory cards involved? One at c080000000 (which appears as region0) and another at 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how are the Soft Reserved ranges laid out on your system for these cards? I'm trying to understand the "before" state of the resources i.e, prior to trimming applied by my patches. Also, do you think it's feasible to change the direction of the soft reserve trimming, that is, defer it until after CXL region or memdev creation is complete? In this case it would be trimmed after but inline the existing region or memdev creation. This might simplify the flow by removing the need for wait_event_timeout(), wait_for_device_probe() and the workqueue logic inside cxl_acpi_probe(). (As a side note I experimented changing cxl_acpi_init() to a late_initcall() and observed that it consistently avoided probe ordering issues in my setup. Additional note: I realized that even when cxl_acpi_probe() fails, the fallback DAX registration path (via cxl_softreserv_mem_update()) still waits on cxl_mem_active() and wait_for_device_probe(). I plan to address this in v6 by immediately triggering fallback DAX registration (hmem_register_device()) when the ACPI probe fails, instead of waiting.) Thanks Smita > >> >> As for the log: >> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for >> cxl_mem probing >> >> I’m still analyzing that. Here's what was my thought process so far. >> >> - This occurs when cxl_acpi_probe() runs significantly earlier than >> cxl_mem_probe(), so CXL region creation (which happens in >> cxl_port_endpoint_probe()) may or may not have completed by the time >> trimming is attempted. >> >> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does >> guarantee load order when all components are built as modules. So even if >> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, >> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and >> cxl_mem in modular configurations. As a result, region creation is >> eventually guaranteed, and wait_for_device_probe() will succeed once the >> relevant probes complete. >> >> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no >> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish >> before cxl_port_probe() even begins, which can cause wait_for_device_probe() >> to return prematurely and trigger the timeout. >> >> - In my local setup, I observed that a 30-second timeout was generally >> sufficient to catch this race, allowing cxl_port_probe() to load while >> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular >> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a >> best-effort mechanism. After the timeout, wait_for_device_probe() ensures >> cxl_port_probe() has completed before trimming proceeds, making the logic >> good enough to most boot-time races. >> >> One possible improvement I’m considering is to schedule a >> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait >> slightly longer for cxl_mem_probe() to complete (which itself softdeps on >> cxl_port) before initiating the soft reserve trimming. >> >> That said, I'm still evaluating better options to more robustly coordinate >> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and >> looking for suggestions here. >> >> Thanks >> Smita >> >>> >>> >>> This isn't all the logs, I trimmed. Let me know if you need more or >>> other info to reproduce. >>> >>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing >>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 >>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 >>> [ 53.653540] preempt_count: 1, expected: 0 >>> [ 53.653554] RCU nest depth: 0, expected: 0 >>> [ 53.653568] 3 locks held by kworker/46:1/1875: >>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>> [ 53.653598] Preemption disabled at: >>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>> [ 53.653648] Call Trace: >>> [ 53.653649] <TASK> >>> [ 53.653652] dump_stack_lvl+0xa8/0xd0 >>> [ 53.653658] dump_stack+0x14/0x20 >>> [ 53.653659] __might_resched+0x1ae/0x2d0 >>> [ 53.653666] __might_sleep+0x48/0x70 >>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 >>> [ 53.653674] ? __devm_add_action+0x3d/0x160 >>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 >>> [ 53.653688] __devres_alloc_node+0x4a/0x90 >>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90 >>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] >>> [ 53.653693] __devm_add_action+0x3d/0x160 >>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] >>> [ 53.653700] hmem_fallback_register_device+0x37/0x60 >>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0 >>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 >>> [ 53.653768] process_one_work+0x1fa/0x630 >>> [ 53.653774] worker_thread+0x1b2/0x360 >>> [ 53.653777] kthread+0x128/0x250 >>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10 >>> [ 53.653784] ? __pfx_kthread+0x10/0x10 >>> [ 53.653786] ret_from_fork+0x139/0x1e0 >>> [ 53.653790] ? __pfx_kthread+0x10/0x10 >>> [ 53.653792] ret_from_fork_asm+0x1a/0x30 >>> [ 53.653801] </TASK> >>> >>> [ 53.654193] ============================= >>> [ 53.654203] [ BUG: Invalid wait context ] >>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W >>> [ 53.654623] ----------------------------- >>> [ 53.654785] kworker/46:1/1875 is trying to lock: >>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 >>> [ 53.655115] other info that might help us debug this: >>> [ 53.655273] context-{5:5} >>> [ 53.655428] 3 locks held by kworker/46:1/1875: >>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>> [ 53.656062] stack backtrace: >>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>> [ 53.656227] Tainted: [W]=WARN >>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>> [ 53.656232] Call Trace: >>> [ 53.656232] <TASK> >>> [ 53.656234] dump_stack_lvl+0x85/0xd0 >>> [ 53.656238] dump_stack+0x14/0x20 >>> [ 53.656239] __lock_acquire+0xaf4/0x2200 >>> [ 53.656246] lock_acquire+0xd8/0x300 >>> [ 53.656248] ? kernfs_add_one+0x34/0x390 >>> [ 53.656252] ? __might_resched+0x208/0x2d0 >>> [ 53.656257] down_write+0x44/0xe0 >>> [ 53.656262] ? kernfs_add_one+0x34/0x390 >>> [ 53.656263] kernfs_add_one+0x34/0x390 >>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 >>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 >>> [ 53.656270] kobject_add_internal+0xb1/0x2f0 >>> [ 53.656273] kobject_add+0x7d/0xf0 >>> [ 53.656275] ? get_device_parent+0x28/0x1e0 >>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 >>> [ 53.656282] device_add+0x124/0x8b0 >>> [ 53.656285] ? dev_set_name+0x56/0x70 >>> [ 53.656287] platform_device_add+0x102/0x260 >>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] >>> [ 53.656291] hmem_fallback_register_device+0x37/0x60 >>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0 >>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 >>> [ 53.656346] process_one_work+0x1fa/0x630 >>> [ 53.656350] worker_thread+0x1b2/0x360 >>> [ 53.656352] kthread+0x128/0x250 >>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10 >>> [ 53.656356] ? __pfx_kthread+0x10/0x10 >>> [ 53.656357] ret_from_fork+0x139/0x1e0 >>> [ 53.656360] ? __pfx_kthread+0x10/0x10 >>> [ 53.656361] ret_from_fork_asm+0x1a/0x30 >>> [ 53.656366] </TASK> >>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>> [ 53.663552] schedule+0x4a/0x160 >>> [ 53.663553] schedule_timeout+0x10a/0x120 >>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 >>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 >>> [ 53.663558] __wait_for_common+0xb9/0x1c0 >>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 >>> [ 53.663561] wait_for_completion+0x28/0x30 >>> [ 53.663562] __synchronize_srcu+0xbf/0x180 >>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 >>> [ 53.663571] ? i2c_repstart+0x30/0x80 >>> [ 53.663576] synchronize_srcu+0x46/0x120 >>> [ 53.663577] kill_dax+0x47/0x70 >>> [ 53.663580] __devm_create_dev_dax+0x112/0x470 >>> [ 53.663582] devm_create_dev_dax+0x26/0x50 >>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>> [ 53.663585] platform_probe+0x61/0xd0 >>> [ 53.663589] really_probe+0xe2/0x390 >>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 >>> [ 53.663593] __driver_probe_device+0x7e/0x160 >>> [ 53.663594] driver_probe_device+0x23/0xa0 >>> [ 53.663596] __device_attach_driver+0x92/0x120 >>> [ 53.663597] bus_for_each_drv+0x8c/0xf0 >>> [ 53.663599] __device_attach+0xc2/0x1f0 >>> [ 53.663601] device_initial_probe+0x17/0x20 >>> [ 53.663603] bus_probe_device+0xa8/0xb0 >>> [ 53.663604] device_add+0x687/0x8b0 >>> [ 53.663607] ? dev_set_name+0x56/0x70 >>> [ 53.663609] platform_device_add+0x102/0x260 >>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] >>> [ 53.663612] hmem_fallback_register_device+0x37/0x60 >>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0 >>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 >>> [ 53.663658] process_one_work+0x1fa/0x630 >>> [ 53.663662] worker_thread+0x1b2/0x360 >>> [ 53.663664] kthread+0x128/0x250 >>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10 >>> [ 53.663668] ? __pfx_kthread+0x10/0x10 >>> [ 53.663670] ret_from_fork+0x139/0x1e0 >>> [ 53.663672] ? __pfx_kthread+0x10/0x10 >>> [ 53.663673] ret_from_fork_asm+0x1a/0x30 >>> [ 53.663677] </TASK> >>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>> [ 53.700264] INFO: lockdep is turned off. >>> [ 53.701315] Preemption disabled at: >>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>> [ 53.701633] Tainted: [W]=WARN >>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>> [ 53.701638] Call Trace: >>> [ 53.701638] <TASK> >>> [ 53.701640] dump_stack_lvl+0xa8/0xd0 >>> [ 53.701644] dump_stack+0x14/0x20 >>> [ 53.701645] __schedule_bug+0xa2/0xd0 >>> [ 53.701649] __schedule+0xe6f/0x10d0 >>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 >>> [ 53.701655] ? lock_release+0x1e6/0x2b0 >>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 >>> [ 53.701661] schedule+0x4a/0x160 >>> [ 53.701662] schedule_timeout+0x10a/0x120 >>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 >>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 >>> [ 53.701667] __wait_for_common+0xb9/0x1c0 >>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 >>> [ 53.701670] wait_for_completion+0x28/0x30 >>> [ 53.701671] __synchronize_srcu+0xbf/0x180 >>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 >>> [ 53.701682] ? i2c_repstart+0x30/0x80 >>> [ 53.701685] synchronize_srcu+0x46/0x120 >>> [ 53.701687] kill_dax+0x47/0x70 >>> [ 53.701689] __devm_create_dev_dax+0x112/0x470 >>> [ 53.701691] devm_create_dev_dax+0x26/0x50 >>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>> [ 53.701695] platform_probe+0x61/0xd0 >>> [ 53.701698] really_probe+0xe2/0x390 >>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 >>> [ 53.701701] __driver_probe_device+0x7e/0x160 >>> [ 53.701703] driver_probe_device+0x23/0xa0 >>> [ 53.701704] __device_attach_driver+0x92/0x120 >>> [ 53.701706] bus_for_each_drv+0x8c/0xf0 >>> [ 53.701708] __device_attach+0xc2/0x1f0 >>> [ 53.701710] device_initial_probe+0x17/0x20 >>> [ 53.701711] bus_probe_device+0xa8/0xb0 >>> [ 53.701712] device_add+0x687/0x8b0 >>> [ 53.701715] ? dev_set_name+0x56/0x70 >>> [ 53.701717] platform_device_add+0x102/0x260 >>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] >>> [ 53.701720] hmem_fallback_register_device+0x37/0x60 >>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0 >>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 >>> [ 53.701756] process_one_work+0x1fa/0x630 >>> [ 53.701760] worker_thread+0x1b2/0x360 >>> [ 53.701762] kthread+0x128/0x250 >>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10 >>> [ 53.701766] ? __pfx_kthread+0x10/0x10 >>> [ 53.701768] ret_from_fork+0x139/0x1e0 >>> [ 53.701771] ? __pfx_kthread+0x10/0x10 >>> [ 53.701772] ret_from_fork_asm+0x1a/0x30 >>> [ 53.701777] </TASK> >>> >>
On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote: > On 7/16/2025 1:20 PM, Alison Schofield wrote: > > On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: > > > Hi Alison, > > > > > > On 7/15/2025 2:07 PM, Alison Schofield wrote: > > > > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: > > > > > This series introduces the ability to manage SOFT RESERVED iomem > > > > > resources, enabling the CXL driver to remove any portions that > > > > > intersect with created CXL regions. > > > > > > > > Hi Smita, > > > > > > > > This set applied cleanly to todays cxl-next but fails like appended > > > > before region probe. > > > > > > > > BTW - there were sparse warnings in the build that look related: > > > > CHECK drivers/dax/hmem/hmem_notify.c > > > > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit > > > > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit > > > > > > Thanks for pointing this bug. I failed to release the spinlock before > > > calling hmem_register_device(), which internally calls platform_device_add() > > > and can sleep. The following fix addresses that bug. I’ll incorporate this > > > into v6: > > > > > > diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c > > > index 6c276c5bd51d..8f411f3fe7bd 100644 > > > --- a/drivers/dax/hmem/hmem_notify.c > > > +++ b/drivers/dax/hmem/hmem_notify.c > > > @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const > > > struct resource *res) > > > { > > > walk_hmem_fn hmem_fn; > > > > > > - guard(spinlock)(&hmem_notify_lock); > > > + spin_lock(&hmem_notify_lock); > > > hmem_fn = hmem_fallback_fn; > > > + spin_unlock(&hmem_notify_lock); > > > > > > if (hmem_fn) > > > hmem_fn(target_nid, res); > > > -- > > > > Hi Smita, Adding the above got me past that, and doubling the timeout > > below stopped that from happening. After that, I haven't had time to > > trace so, I'll just dump on you for now: > > > > In /proc/iomem > > Here, we see a regions resource, no CXL Window, and no dax, and no > > actual region, not even disabled, is available. > > c080000000-c47fffffff : region0 > > > > And, here no CXL Window, no region, and a soft reserved. > > 68e80000000-70e7fffffff : Soft Reserved > > 68e80000000-70e7fffffff : dax1.0 > > 68e80000000-70e7fffffff : System RAM (kmem) > > > > I haven't yet walked through the v4 to v5 changes so I'll do that next. > > Hi Alison, > > To help better understand the current behavior, could you share more about > your platform configuration? specifically, are there two memory cards > involved? One at c080000000 (which appears as region0) and another at > 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how > are the Soft Reserved ranges laid out on your system for these cards? I'm > trying to understand the "before" state of the resources i.e, prior to > trimming applied by my patches. Here are the soft reserveds - [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved And this is what we expect - c080000000-17dbfffffff : CXL Window 0 c080000000-c47fffffff : region2 c080000000-c47fffffff : dax0.0 c080000000-c47fffffff : System RAM (kmem) 68e80000000-8d37fffffff : CXL Window 1 68e80000000-70e7fffffff : region5 68e80000000-70e7fffffff : dax1.0 68e80000000-70e7fffffff : System RAM (kmem) And, like in prev message, iv v5 we get - c080000000-c47fffffff : region0 68e80000000-70e7fffffff : Soft Reserved 68e80000000-70e7fffffff : dax1.0 68e80000000-70e7fffffff : System RAM (kmem) In v4, we 'almost' had what we expect, except that the HMEM driver created those dax devices our of Soft Reserveds before region driver could do same. > > Also, do you think it's feasible to change the direction of the soft reserve > trimming, that is, defer it until after CXL region or memdev creation is > complete? In this case it would be trimmed after but inline the existing > region or memdev creation. This might simplify the flow by removing the need > for wait_event_timeout(), wait_for_device_probe() and the workqueue logic > inside cxl_acpi_probe(). Yes that aligns with my simple thinking. There's the trimming after a region is successfully created, and it seems that could simply be called at the end of *that* region creation. Then, there's the round up of all the unused Soft Reserveds, and that has to wait until after all regions are created, ie. all endpoints have arrived and we've given up all hope of creating another region in that space. That's the timing challenge. -- Alison > > (As a side note I experimented changing cxl_acpi_init() to a late_initcall() > and observed that it consistently avoided probe ordering issues in my setup. > > Additional note: I realized that even when cxl_acpi_probe() fails, the > fallback DAX registration path (via cxl_softreserv_mem_update()) still waits > on cxl_mem_active() and wait_for_device_probe(). I plan to address this in > v6 by immediately triggering fallback DAX registration > (hmem_register_device()) when the ACPI probe fails, instead of waiting.) > > Thanks > Smita > > > > > > > > > As for the log: > > > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for > > > cxl_mem probing > > > > > > I’m still analyzing that. Here's what was my thought process so far. > > > > > > - This occurs when cxl_acpi_probe() runs significantly earlier than > > > cxl_mem_probe(), so CXL region creation (which happens in > > > cxl_port_endpoint_probe()) may or may not have completed by the time > > > trimming is attempted. > > > > > > - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does > > > guarantee load order when all components are built as modules. So even if > > > the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, > > > MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and > > > cxl_mem in modular configurations. As a result, region creation is > > > eventually guaranteed, and wait_for_device_probe() will succeed once the > > > relevant probes complete. > > > > > > - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no > > > guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish > > > before cxl_port_probe() even begins, which can cause wait_for_device_probe() > > > to return prematurely and trigger the timeout. > > > > > > - In my local setup, I observed that a 30-second timeout was generally > > > sufficient to catch this race, allowing cxl_port_probe() to load while > > > cxl_acpi_probe() is still active. Since we cannot mix built-in and modular > > > components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a > > > best-effort mechanism. After the timeout, wait_for_device_probe() ensures > > > cxl_port_probe() has completed before trimming proceeds, making the logic > > > good enough to most boot-time races. > > > > > > One possible improvement I’m considering is to schedule a > > > delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait > > > slightly longer for cxl_mem_probe() to complete (which itself softdeps on > > > cxl_port) before initiating the soft reserve trimming. > > > > > > That said, I'm still evaluating better options to more robustly coordinate > > > probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and > > > looking for suggestions here. > > > > > > Thanks > > > Smita > > > > > > > > > > > > > > > This isn't all the logs, I trimmed. Let me know if you need more or > > > > other info to reproduce. > > > > > > > > [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing > > > > [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 > > > > [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 > > > > [ 53.653540] preempt_count: 1, expected: 0 > > > > [ 53.653554] RCU nest depth: 0, expected: 0 > > > > [ 53.653568] 3 locks held by kworker/46:1/1875: > > > > [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 > > > > [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 > > > > [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 > > > > [ 53.653598] Preemption disabled at: > > > > [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 > > > > [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > > > > [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > > > > [ 53.653648] Call Trace: > > > > [ 53.653649] <TASK> > > > > [ 53.653652] dump_stack_lvl+0xa8/0xd0 > > > > [ 53.653658] dump_stack+0x14/0x20 > > > > [ 53.653659] __might_resched+0x1ae/0x2d0 > > > > [ 53.653666] __might_sleep+0x48/0x70 > > > > [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 > > > > [ 53.653674] ? __devm_add_action+0x3d/0x160 > > > > [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 > > > > [ 53.653688] __devres_alloc_node+0x4a/0x90 > > > > [ 53.653689] ? __devres_alloc_node+0x4a/0x90 > > > > [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] > > > > [ 53.653693] __devm_add_action+0x3d/0x160 > > > > [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] > > > > [ 53.653700] hmem_fallback_register_device+0x37/0x60 > > > > [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > > > [ 53.653739] walk_iomem_res_desc+0x55/0xb0 > > > > [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > > > [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > > > [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > > > [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 > > > > [ 53.653768] process_one_work+0x1fa/0x630 > > > > [ 53.653774] worker_thread+0x1b2/0x360 > > > > [ 53.653777] kthread+0x128/0x250 > > > > [ 53.653781] ? __pfx_worker_thread+0x10/0x10 > > > > [ 53.653784] ? __pfx_kthread+0x10/0x10 > > > > [ 53.653786] ret_from_fork+0x139/0x1e0 > > > > [ 53.653790] ? __pfx_kthread+0x10/0x10 > > > > [ 53.653792] ret_from_fork_asm+0x1a/0x30 > > > > [ 53.653801] </TASK> > > > > > > > > [ 53.654193] ============================= > > > > [ 53.654203] [ BUG: Invalid wait context ] > > > > [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W > > > > [ 53.654623] ----------------------------- > > > > [ 53.654785] kworker/46:1/1875 is trying to lock: > > > > [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 > > > > [ 53.655115] other info that might help us debug this: > > > > [ 53.655273] context-{5:5} > > > > [ 53.655428] 3 locks held by kworker/46:1/1875: > > > > [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 > > > > [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 > > > > [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 > > > > [ 53.656062] stack backtrace: > > > > [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > > > > [ 53.656227] Tainted: [W]=WARN > > > > [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > > > > [ 53.656232] Call Trace: > > > > [ 53.656232] <TASK> > > > > [ 53.656234] dump_stack_lvl+0x85/0xd0 > > > > [ 53.656238] dump_stack+0x14/0x20 > > > > [ 53.656239] __lock_acquire+0xaf4/0x2200 > > > > [ 53.656246] lock_acquire+0xd8/0x300 > > > > [ 53.656248] ? kernfs_add_one+0x34/0x390 > > > > [ 53.656252] ? __might_resched+0x208/0x2d0 > > > > [ 53.656257] down_write+0x44/0xe0 > > > > [ 53.656262] ? kernfs_add_one+0x34/0x390 > > > > [ 53.656263] kernfs_add_one+0x34/0x390 > > > > [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 > > > > [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 > > > > [ 53.656270] kobject_add_internal+0xb1/0x2f0 > > > > [ 53.656273] kobject_add+0x7d/0xf0 > > > > [ 53.656275] ? get_device_parent+0x28/0x1e0 > > > > [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 > > > > [ 53.656282] device_add+0x124/0x8b0 > > > > [ 53.656285] ? dev_set_name+0x56/0x70 > > > > [ 53.656287] platform_device_add+0x102/0x260 > > > > [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] > > > > [ 53.656291] hmem_fallback_register_device+0x37/0x60 > > > > [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > > > [ 53.656323] walk_iomem_res_desc+0x55/0xb0 > > > > [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > > > [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > > > [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > > > [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 > > > > [ 53.656346] process_one_work+0x1fa/0x630 > > > > [ 53.656350] worker_thread+0x1b2/0x360 > > > > [ 53.656352] kthread+0x128/0x250 > > > > [ 53.656354] ? __pfx_worker_thread+0x10/0x10 > > > > [ 53.656356] ? __pfx_kthread+0x10/0x10 > > > > [ 53.656357] ret_from_fork+0x139/0x1e0 > > > > [ 53.656360] ? __pfx_kthread+0x10/0x10 > > > > [ 53.656361] ret_from_fork_asm+0x1a/0x30 > > > > [ 53.656366] </TASK> > > > > [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 > > > > [ 53.663552] schedule+0x4a/0x160 > > > > [ 53.663553] schedule_timeout+0x10a/0x120 > > > > [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 > > > > [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 > > > > [ 53.663558] __wait_for_common+0xb9/0x1c0 > > > > [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 > > > > [ 53.663561] wait_for_completion+0x28/0x30 > > > > [ 53.663562] __synchronize_srcu+0xbf/0x180 > > > > [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 > > > > [ 53.663571] ? i2c_repstart+0x30/0x80 > > > > [ 53.663576] synchronize_srcu+0x46/0x120 > > > > [ 53.663577] kill_dax+0x47/0x70 > > > > [ 53.663580] __devm_create_dev_dax+0x112/0x470 > > > > [ 53.663582] devm_create_dev_dax+0x26/0x50 > > > > [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] > > > > [ 53.663585] platform_probe+0x61/0xd0 > > > > [ 53.663589] really_probe+0xe2/0x390 > > > > [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 > > > > [ 53.663593] __driver_probe_device+0x7e/0x160 > > > > [ 53.663594] driver_probe_device+0x23/0xa0 > > > > [ 53.663596] __device_attach_driver+0x92/0x120 > > > > [ 53.663597] bus_for_each_drv+0x8c/0xf0 > > > > [ 53.663599] __device_attach+0xc2/0x1f0 > > > > [ 53.663601] device_initial_probe+0x17/0x20 > > > > [ 53.663603] bus_probe_device+0xa8/0xb0 > > > > [ 53.663604] device_add+0x687/0x8b0 > > > > [ 53.663607] ? dev_set_name+0x56/0x70 > > > > [ 53.663609] platform_device_add+0x102/0x260 > > > > [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] > > > > [ 53.663612] hmem_fallback_register_device+0x37/0x60 > > > > [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > > > [ 53.663637] walk_iomem_res_desc+0x55/0xb0 > > > > [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > > > [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > > > [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > > > [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 > > > > [ 53.663658] process_one_work+0x1fa/0x630 > > > > [ 53.663662] worker_thread+0x1b2/0x360 > > > > [ 53.663664] kthread+0x128/0x250 > > > > [ 53.663666] ? __pfx_worker_thread+0x10/0x10 > > > > [ 53.663668] ? __pfx_kthread+0x10/0x10 > > > > [ 53.663670] ret_from_fork+0x139/0x1e0 > > > > [ 53.663672] ? __pfx_kthread+0x10/0x10 > > > > [ 53.663673] ret_from_fork_asm+0x1a/0x30 > > > > [ 53.663677] </TASK> > > > > [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 > > > > [ 53.700264] INFO: lockdep is turned off. > > > > [ 53.701315] Preemption disabled at: > > > > [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 > > > > [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) > > > > [ 53.701633] Tainted: [W]=WARN > > > > [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] > > > > [ 53.701638] Call Trace: > > > > [ 53.701638] <TASK> > > > > [ 53.701640] dump_stack_lvl+0xa8/0xd0 > > > > [ 53.701644] dump_stack+0x14/0x20 > > > > [ 53.701645] __schedule_bug+0xa2/0xd0 > > > > [ 53.701649] __schedule+0xe6f/0x10d0 > > > > [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 > > > > [ 53.701655] ? lock_release+0x1e6/0x2b0 > > > > [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 > > > > [ 53.701661] schedule+0x4a/0x160 > > > > [ 53.701662] schedule_timeout+0x10a/0x120 > > > > [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 > > > > [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 > > > > [ 53.701667] __wait_for_common+0xb9/0x1c0 > > > > [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 > > > > [ 53.701670] wait_for_completion+0x28/0x30 > > > > [ 53.701671] __synchronize_srcu+0xbf/0x180 > > > > [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 > > > > [ 53.701682] ? i2c_repstart+0x30/0x80 > > > > [ 53.701685] synchronize_srcu+0x46/0x120 > > > > [ 53.701687] kill_dax+0x47/0x70 > > > > [ 53.701689] __devm_create_dev_dax+0x112/0x470 > > > > [ 53.701691] devm_create_dev_dax+0x26/0x50 > > > > [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] > > > > [ 53.701695] platform_probe+0x61/0xd0 > > > > [ 53.701698] really_probe+0xe2/0x390 > > > > [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 > > > > [ 53.701701] __driver_probe_device+0x7e/0x160 > > > > [ 53.701703] driver_probe_device+0x23/0xa0 > > > > [ 53.701704] __device_attach_driver+0x92/0x120 > > > > [ 53.701706] bus_for_each_drv+0x8c/0xf0 > > > > [ 53.701708] __device_attach+0xc2/0x1f0 > > > > [ 53.701710] device_initial_probe+0x17/0x20 > > > > [ 53.701711] bus_probe_device+0xa8/0xb0 > > > > [ 53.701712] device_add+0x687/0x8b0 > > > > [ 53.701715] ? dev_set_name+0x56/0x70 > > > > [ 53.701717] platform_device_add+0x102/0x260 > > > > [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] > > > > [ 53.701720] hmem_fallback_register_device+0x37/0x60 > > > > [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] > > > > [ 53.701734] walk_iomem_res_desc+0x55/0xb0 > > > > [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] > > > > [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] > > > > [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] > > > > [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 > > > > [ 53.701756] process_one_work+0x1fa/0x630 > > > > [ 53.701760] worker_thread+0x1b2/0x360 > > > > [ 53.701762] kthread+0x128/0x250 > > > > [ 53.701765] ? __pfx_worker_thread+0x10/0x10 > > > > [ 53.701766] ? __pfx_kthread+0x10/0x10 > > > > [ 53.701768] ret_from_fork+0x139/0x1e0 > > > > [ 53.701771] ? __pfx_kthread+0x10/0x10 > > > > [ 53.701772] ret_from_fork_asm+0x1a/0x30 > > > > [ 53.701777] </TASK> > > > > > > > >
On 7/16/2025 4:48 PM, Alison Schofield wrote: > On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote: >> On 7/16/2025 1:20 PM, Alison Schofield wrote: >>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: >>>> Hi Alison, >>>> >>>> On 7/15/2025 2:07 PM, Alison Schofield wrote: >>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: >>>>>> This series introduces the ability to manage SOFT RESERVED iomem >>>>>> resources, enabling the CXL driver to remove any portions that >>>>>> intersect with created CXL regions. >>>>> >>>>> Hi Smita, >>>>> >>>>> This set applied cleanly to todays cxl-next but fails like appended >>>>> before region probe. >>>>> >>>>> BTW - there were sparse warnings in the build that look related: >>>>> CHECK drivers/dax/hmem/hmem_notify.c >>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit >>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit >>>> >>>> Thanks for pointing this bug. I failed to release the spinlock before >>>> calling hmem_register_device(), which internally calls platform_device_add() >>>> and can sleep. The following fix addresses that bug. I’ll incorporate this >>>> into v6: >>>> >>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c >>>> index 6c276c5bd51d..8f411f3fe7bd 100644 >>>> --- a/drivers/dax/hmem/hmem_notify.c >>>> +++ b/drivers/dax/hmem/hmem_notify.c >>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const >>>> struct resource *res) >>>> { >>>> walk_hmem_fn hmem_fn; >>>> >>>> - guard(spinlock)(&hmem_notify_lock); >>>> + spin_lock(&hmem_notify_lock); >>>> hmem_fn = hmem_fallback_fn; >>>> + spin_unlock(&hmem_notify_lock); >>>> >>>> if (hmem_fn) >>>> hmem_fn(target_nid, res); >>>> -- >>> >>> Hi Smita, Adding the above got me past that, and doubling the timeout >>> below stopped that from happening. After that, I haven't had time to >>> trace so, I'll just dump on you for now: >>> >>> In /proc/iomem >>> Here, we see a regions resource, no CXL Window, and no dax, and no >>> actual region, not even disabled, is available. >>> c080000000-c47fffffff : region0 >>> >>> And, here no CXL Window, no region, and a soft reserved. >>> 68e80000000-70e7fffffff : Soft Reserved >>> 68e80000000-70e7fffffff : dax1.0 >>> 68e80000000-70e7fffffff : System RAM (kmem) >>> >>> I haven't yet walked through the v4 to v5 changes so I'll do that next. >> >> Hi Alison, >> >> To help better understand the current behavior, could you share more about >> your platform configuration? specifically, are there two memory cards >> involved? One at c080000000 (which appears as region0) and another at >> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how >> are the Soft Reserved ranges laid out on your system for these cards? I'm >> trying to understand the "before" state of the resources i.e, prior to >> trimming applied by my patches. > > Here are the soft reserveds - > [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved > [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved > > And this is what we expect - > > c080000000-17dbfffffff : CXL Window 0 > c080000000-c47fffffff : region2 > c080000000-c47fffffff : dax0.0 > c080000000-c47fffffff : System RAM (kmem) > > > 68e80000000-8d37fffffff : CXL Window 1 > 68e80000000-70e7fffffff : region5 > 68e80000000-70e7fffffff : dax1.0 > 68e80000000-70e7fffffff : System RAM (kmem) > > And, like in prev message, iv v5 we get - > > c080000000-c47fffffff : region0 > > 68e80000000-70e7fffffff : Soft Reserved > 68e80000000-70e7fffffff : dax1.0 > 68e80000000-70e7fffffff : System RAM (kmem) > > > In v4, we 'almost' had what we expect, except that the HMEM driver > created those dax devices our of Soft Reserveds before region driver > could do same. > Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior. v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected. Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues.. Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried): if (IS_ENABLED(CONFIG_CXL_REGION) && region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {.. At this point, I’m running out of ideas on how to reliably coordinate this.. :( Thanks Smita >> >> Also, do you think it's feasible to change the direction of the soft reserve >> trimming, that is, defer it until after CXL region or memdev creation is >> complete? In this case it would be trimmed after but inline the existing >> region or memdev creation. This might simplify the flow by removing the need >> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic >> inside cxl_acpi_probe(). > > Yes that aligns with my simple thinking. There's the trimming after a region > is successfully created, and it seems that could simply be called at the end > of *that* region creation. > > Then, there's the round up of all the unused Soft Reserveds, and that has > to wait until after all regions are created, ie. all endpoints have arrived > and we've given up all hope of creating another region in that space. > That's the timing challenge. > > -- Alison > >> >> (As a side note I experimented changing cxl_acpi_init() to a late_initcall() >> and observed that it consistently avoided probe ordering issues in my setup. >> >> Additional note: I realized that even when cxl_acpi_probe() fails, the >> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits >> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in >> v6 by immediately triggering fallback DAX registration >> (hmem_register_device()) when the ACPI probe fails, instead of waiting.) >> >> Thanks >> Smita >> >>> >>>> >>>> As for the log: >>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for >>>> cxl_mem probing >>>> >>>> I’m still analyzing that. Here's what was my thought process so far. >>>> >>>> - This occurs when cxl_acpi_probe() runs significantly earlier than >>>> cxl_mem_probe(), so CXL region creation (which happens in >>>> cxl_port_endpoint_probe()) may or may not have completed by the time >>>> trimming is attempted. >>>> >>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does >>>> guarantee load order when all components are built as modules. So even if >>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, >>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and >>>> cxl_mem in modular configurations. As a result, region creation is >>>> eventually guaranteed, and wait_for_device_probe() will succeed once the >>>> relevant probes complete. >>>> >>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no >>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish >>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe() >>>> to return prematurely and trigger the timeout. >>>> >>>> - In my local setup, I observed that a 30-second timeout was generally >>>> sufficient to catch this race, allowing cxl_port_probe() to load while >>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular >>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a >>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures >>>> cxl_port_probe() has completed before trimming proceeds, making the logic >>>> good enough to most boot-time races. >>>> >>>> One possible improvement I’m considering is to schedule a >>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait >>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on >>>> cxl_port) before initiating the soft reserve trimming. >>>> >>>> That said, I'm still evaluating better options to more robustly coordinate >>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and >>>> looking for suggestions here. >>>> >>>> Thanks >>>> Smita >>>> >>>>> >>>>> >>>>> This isn't all the logs, I trimmed. Let me know if you need more or >>>>> other info to reproduce. >>>>> >>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing >>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 >>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 >>>>> [ 53.653540] preempt_count: 1, expected: 0 >>>>> [ 53.653554] RCU nest depth: 0, expected: 0 >>>>> [ 53.653568] 3 locks held by kworker/46:1/1875: >>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>> [ 53.653598] Preemption disabled at: >>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>> [ 53.653648] Call Trace: >>>>> [ 53.653649] <TASK> >>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0 >>>>> [ 53.653658] dump_stack+0x14/0x20 >>>>> [ 53.653659] __might_resched+0x1ae/0x2d0 >>>>> [ 53.653666] __might_sleep+0x48/0x70 >>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 >>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160 >>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 >>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90 >>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90 >>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] >>>>> [ 53.653693] __devm_add_action+0x3d/0x160 >>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] >>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60 >>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0 >>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>> [ 53.653768] process_one_work+0x1fa/0x630 >>>>> [ 53.653774] worker_thread+0x1b2/0x360 >>>>> [ 53.653777] kthread+0x128/0x250 >>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10 >>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.653786] ret_from_fork+0x139/0x1e0 >>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30 >>>>> [ 53.653801] </TASK> >>>>> >>>>> [ 53.654193] ============================= >>>>> [ 53.654203] [ BUG: Invalid wait context ] >>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W >>>>> [ 53.654623] ----------------------------- >>>>> [ 53.654785] kworker/46:1/1875 is trying to lock: >>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 >>>>> [ 53.655115] other info that might help us debug this: >>>>> [ 53.655273] context-{5:5} >>>>> [ 53.655428] 3 locks held by kworker/46:1/1875: >>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>> [ 53.656062] stack backtrace: >>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>> [ 53.656227] Tainted: [W]=WARN >>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>> [ 53.656232] Call Trace: >>>>> [ 53.656232] <TASK> >>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0 >>>>> [ 53.656238] dump_stack+0x14/0x20 >>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200 >>>>> [ 53.656246] lock_acquire+0xd8/0x300 >>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390 >>>>> [ 53.656252] ? __might_resched+0x208/0x2d0 >>>>> [ 53.656257] down_write+0x44/0xe0 >>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390 >>>>> [ 53.656263] kernfs_add_one+0x34/0x390 >>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 >>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 >>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0 >>>>> [ 53.656273] kobject_add+0x7d/0xf0 >>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0 >>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 >>>>> [ 53.656282] device_add+0x124/0x8b0 >>>>> [ 53.656285] ? dev_set_name+0x56/0x70 >>>>> [ 53.656287] platform_device_add+0x102/0x260 >>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] >>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60 >>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0 >>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>> [ 53.656346] process_one_work+0x1fa/0x630 >>>>> [ 53.656350] worker_thread+0x1b2/0x360 >>>>> [ 53.656352] kthread+0x128/0x250 >>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10 >>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.656357] ret_from_fork+0x139/0x1e0 >>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30 >>>>> [ 53.656366] </TASK> >>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>> [ 53.663552] schedule+0x4a/0x160 >>>>> [ 53.663553] schedule_timeout+0x10a/0x120 >>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 >>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 >>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0 >>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 >>>>> [ 53.663561] wait_for_completion+0x28/0x30 >>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180 >>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>> [ 53.663571] ? i2c_repstart+0x30/0x80 >>>>> [ 53.663576] synchronize_srcu+0x46/0x120 >>>>> [ 53.663577] kill_dax+0x47/0x70 >>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470 >>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50 >>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>> [ 53.663585] platform_probe+0x61/0xd0 >>>>> [ 53.663589] really_probe+0xe2/0x390 >>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 >>>>> [ 53.663593] __driver_probe_device+0x7e/0x160 >>>>> [ 53.663594] driver_probe_device+0x23/0xa0 >>>>> [ 53.663596] __device_attach_driver+0x92/0x120 >>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0 >>>>> [ 53.663599] __device_attach+0xc2/0x1f0 >>>>> [ 53.663601] device_initial_probe+0x17/0x20 >>>>> [ 53.663603] bus_probe_device+0xa8/0xb0 >>>>> [ 53.663604] device_add+0x687/0x8b0 >>>>> [ 53.663607] ? dev_set_name+0x56/0x70 >>>>> [ 53.663609] platform_device_add+0x102/0x260 >>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] >>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60 >>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0 >>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>> [ 53.663658] process_one_work+0x1fa/0x630 >>>>> [ 53.663662] worker_thread+0x1b2/0x360 >>>>> [ 53.663664] kthread+0x128/0x250 >>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10 >>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.663670] ret_from_fork+0x139/0x1e0 >>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30 >>>>> [ 53.663677] </TASK> >>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>> [ 53.700264] INFO: lockdep is turned off. >>>>> [ 53.701315] Preemption disabled at: >>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>> [ 53.701633] Tainted: [W]=WARN >>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>> [ 53.701638] Call Trace: >>>>> [ 53.701638] <TASK> >>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0 >>>>> [ 53.701644] dump_stack+0x14/0x20 >>>>> [ 53.701645] __schedule_bug+0xa2/0xd0 >>>>> [ 53.701649] __schedule+0xe6f/0x10d0 >>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 >>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0 >>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 >>>>> [ 53.701661] schedule+0x4a/0x160 >>>>> [ 53.701662] schedule_timeout+0x10a/0x120 >>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 >>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 >>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0 >>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 >>>>> [ 53.701670] wait_for_completion+0x28/0x30 >>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180 >>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>> [ 53.701682] ? i2c_repstart+0x30/0x80 >>>>> [ 53.701685] synchronize_srcu+0x46/0x120 >>>>> [ 53.701687] kill_dax+0x47/0x70 >>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470 >>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50 >>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>> [ 53.701695] platform_probe+0x61/0xd0 >>>>> [ 53.701698] really_probe+0xe2/0x390 >>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 >>>>> [ 53.701701] __driver_probe_device+0x7e/0x160 >>>>> [ 53.701703] driver_probe_device+0x23/0xa0 >>>>> [ 53.701704] __device_attach_driver+0x92/0x120 >>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0 >>>>> [ 53.701708] __device_attach+0xc2/0x1f0 >>>>> [ 53.701710] device_initial_probe+0x17/0x20 >>>>> [ 53.701711] bus_probe_device+0xa8/0xb0 >>>>> [ 53.701712] device_add+0x687/0x8b0 >>>>> [ 53.701715] ? dev_set_name+0x56/0x70 >>>>> [ 53.701717] platform_device_add+0x102/0x260 >>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] >>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60 >>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0 >>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>> [ 53.701756] process_one_work+0x1fa/0x630 >>>>> [ 53.701760] worker_thread+0x1b2/0x360 >>>>> [ 53.701762] kthread+0x128/0x250 >>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10 >>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.701768] ret_from_fork+0x139/0x1e0 >>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10 >>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30 >>>>> [ 53.701777] </TASK> >>>>> >>>> >>
On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote: > > > On 7/16/2025 4:48 PM, Alison Schofield wrote: >> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote: >>> On 7/16/2025 1:20 PM, Alison Schofield wrote: >>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: >>>>> Hi Alison, >>>>> >>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote: >>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: >>>>>>> This series introduces the ability to manage SOFT RESERVED iomem >>>>>>> resources, enabling the CXL driver to remove any portions that >>>>>>> intersect with created CXL regions. >>>>>> >>>>>> Hi Smita, >>>>>> >>>>>> This set applied cleanly to todays cxl-next but fails like appended >>>>>> before region probe. >>>>>> >>>>>> BTW - there were sparse warnings in the build that look related: >>>>>> CHECK drivers/dax/hmem/hmem_notify.c >>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit >>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit >>>>> >>>>> Thanks for pointing this bug. I failed to release the spinlock before >>>>> calling hmem_register_device(), which internally calls platform_device_add() >>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this >>>>> into v6: >>>>> >>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c >>>>> index 6c276c5bd51d..8f411f3fe7bd 100644 >>>>> --- a/drivers/dax/hmem/hmem_notify.c >>>>> +++ b/drivers/dax/hmem/hmem_notify.c >>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const >>>>> struct resource *res) >>>>> { >>>>> walk_hmem_fn hmem_fn; >>>>> >>>>> - guard(spinlock)(&hmem_notify_lock); >>>>> + spin_lock(&hmem_notify_lock); >>>>> hmem_fn = hmem_fallback_fn; >>>>> + spin_unlock(&hmem_notify_lock); >>>>> >>>>> if (hmem_fn) >>>>> hmem_fn(target_nid, res); >>>>> -- >>>> >>>> Hi Smita, Adding the above got me past that, and doubling the timeout >>>> below stopped that from happening. After that, I haven't had time to >>>> trace so, I'll just dump on you for now: >>>> >>>> In /proc/iomem >>>> Here, we see a regions resource, no CXL Window, and no dax, and no >>>> actual region, not even disabled, is available. >>>> c080000000-c47fffffff : region0 >>>> >>>> And, here no CXL Window, no region, and a soft reserved. >>>> 68e80000000-70e7fffffff : Soft Reserved >>>> 68e80000000-70e7fffffff : dax1.0 >>>> 68e80000000-70e7fffffff : System RAM (kmem) >>>> >>>> I haven't yet walked through the v4 to v5 changes so I'll do that next. >>> >>> Hi Alison, >>> >>> To help better understand the current behavior, could you share more about >>> your platform configuration? specifically, are there two memory cards >>> involved? One at c080000000 (which appears as region0) and another at >>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how >>> are the Soft Reserved ranges laid out on your system for these cards? I'm >>> trying to understand the "before" state of the resources i.e, prior to >>> trimming applied by my patches. >> >> Here are the soft reserveds - >> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved >> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved >> >> And this is what we expect - >> >> c080000000-17dbfffffff : CXL Window 0 >> c080000000-c47fffffff : region2 >> c080000000-c47fffffff : dax0.0 >> c080000000-c47fffffff : System RAM (kmem) >> >> >> 68e80000000-8d37fffffff : CXL Window 1 >> 68e80000000-70e7fffffff : region5 >> 68e80000000-70e7fffffff : dax1.0 >> 68e80000000-70e7fffffff : System RAM (kmem) >> >> And, like in prev message, iv v5 we get - >> >> c080000000-c47fffffff : region0 >> >> 68e80000000-70e7fffffff : Soft Reserved >> 68e80000000-70e7fffffff : dax1.0 >> 68e80000000-70e7fffffff : System RAM (kmem) >> >> >> In v4, we 'almost' had what we expect, except that the HMEM driver >> created those dax devices our of Soft Reserveds before region driver >> could do same. >> > > Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior. > > v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected. > > Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues.. > > Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried): > > if (IS_ENABLED(CONFIG_CXL_REGION) && > region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {.. > > At this point, I’m running out of ideas on how to reliably coordinate this.. :( > > Thanks > Smita > >>> >>> Also, do you think it's feasible to change the direction of the soft reserve >>> trimming, that is, defer it until after CXL region or memdev creation is >>> complete? In this case it would be trimmed after but inline the existing >>> region or memdev creation. This might simplify the flow by removing the need >>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic >>> inside cxl_acpi_probe(). >> >> Yes that aligns with my simple thinking. There's the trimming after a region >> is successfully created, and it seems that could simply be called at the end >> of *that* region creation. >> >> Then, there's the round up of all the unused Soft Reserveds, and that has >> to wait until after all regions are created, ie. all endpoints have arrived >> and we've given up all hope of creating another region in that space. >> That's the timing challenge. >> >> -- Alison >> >>> >>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall() >>> and observed that it consistently avoided probe ordering issues in my setup. >>> >>> Additional note: I realized that even when cxl_acpi_probe() fails, the >>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits >>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in >>> v6 by immediately triggering fallback DAX registration >>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.) >>> >>> Thanks >>> Smita >>> >>>> >>>>> >>>>> As for the log: >>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for >>>>> cxl_mem probing >>>>> >>>>> I’m still analyzing that. Here's what was my thought process so far. >>>>> >>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than >>>>> cxl_mem_probe(), so CXL region creation (which happens in >>>>> cxl_port_endpoint_probe()) may or may not have completed by the time >>>>> trimming is attempted. >>>>> >>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does >>>>> guarantee load order when all components are built as modules. So even if >>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, >>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and >>>>> cxl_mem in modular configurations. As a result, region creation is >>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the >>>>> relevant probes complete. >>>>> >>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no >>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish >>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe() >>>>> to return prematurely and trigger the timeout. >>>>> >>>>> - In my local setup, I observed that a 30-second timeout was generally >>>>> sufficient to catch this race, allowing cxl_port_probe() to load while >>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular >>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a >>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures >>>>> cxl_port_probe() has completed before trimming proceeds, making the logic >>>>> good enough to most boot-time races. >>>>> >>>>> One possible improvement I’m considering is to schedule a >>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait >>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on >>>>> cxl_port) before initiating the soft reserve trimming. >>>>> >>>>> That said, I'm still evaluating better options to more robustly coordinate >>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and >>>>> looking for suggestions here. Hi Smita, Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery. A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery. I think either one will not require special ordering of the modules being loaded. DJ >>>>> >>>>> Thanks >>>>> Smita >>>>> >>>>>> >>>>>> >>>>>> This isn't all the logs, I trimmed. Let me know if you need more or >>>>>> other info to reproduce. >>>>>> >>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing >>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 >>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 >>>>>> [ 53.653540] preempt_count: 1, expected: 0 >>>>>> [ 53.653554] RCU nest depth: 0, expected: 0 >>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875: >>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>>> [ 53.653598] Preemption disabled at: >>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>> [ 53.653648] Call Trace: >>>>>> [ 53.653649] <TASK> >>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0 >>>>>> [ 53.653658] dump_stack+0x14/0x20 >>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0 >>>>>> [ 53.653666] __might_sleep+0x48/0x70 >>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 >>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160 >>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 >>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90 >>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90 >>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] >>>>>> [ 53.653693] __devm_add_action+0x3d/0x160 >>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] >>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60 >>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0 >>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>> [ 53.653768] process_one_work+0x1fa/0x630 >>>>>> [ 53.653774] worker_thread+0x1b2/0x360 >>>>>> [ 53.653777] kthread+0x128/0x250 >>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10 >>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0 >>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30 >>>>>> [ 53.653801] </TASK> >>>>>> >>>>>> [ 53.654193] ============================= >>>>>> [ 53.654203] [ BUG: Invalid wait context ] >>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W >>>>>> [ 53.654623] ----------------------------- >>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock: >>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 >>>>>> [ 53.655115] other info that might help us debug this: >>>>>> [ 53.655273] context-{5:5} >>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875: >>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>>> [ 53.656062] stack backtrace: >>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>> [ 53.656227] Tainted: [W]=WARN >>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>> [ 53.656232] Call Trace: >>>>>> [ 53.656232] <TASK> >>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0 >>>>>> [ 53.656238] dump_stack+0x14/0x20 >>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200 >>>>>> [ 53.656246] lock_acquire+0xd8/0x300 >>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390 >>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0 >>>>>> [ 53.656257] down_write+0x44/0xe0 >>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390 >>>>>> [ 53.656263] kernfs_add_one+0x34/0x390 >>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 >>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 >>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0 >>>>>> [ 53.656273] kobject_add+0x7d/0xf0 >>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0 >>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 >>>>>> [ 53.656282] device_add+0x124/0x8b0 >>>>>> [ 53.656285] ? dev_set_name+0x56/0x70 >>>>>> [ 53.656287] platform_device_add+0x102/0x260 >>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60 >>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0 >>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>> [ 53.656346] process_one_work+0x1fa/0x630 >>>>>> [ 53.656350] worker_thread+0x1b2/0x360 >>>>>> [ 53.656352] kthread+0x128/0x250 >>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10 >>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0 >>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30 >>>>>> [ 53.656366] </TASK> >>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>>> [ 53.663552] schedule+0x4a/0x160 >>>>>> [ 53.663553] schedule_timeout+0x10a/0x120 >>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 >>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 >>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0 >>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 >>>>>> [ 53.663561] wait_for_completion+0x28/0x30 >>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180 >>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80 >>>>>> [ 53.663576] synchronize_srcu+0x46/0x120 >>>>>> [ 53.663577] kill_dax+0x47/0x70 >>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470 >>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50 >>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>>> [ 53.663585] platform_probe+0x61/0xd0 >>>>>> [ 53.663589] really_probe+0xe2/0x390 >>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 >>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160 >>>>>> [ 53.663594] driver_probe_device+0x23/0xa0 >>>>>> [ 53.663596] __device_attach_driver+0x92/0x120 >>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0 >>>>>> [ 53.663599] __device_attach+0xc2/0x1f0 >>>>>> [ 53.663601] device_initial_probe+0x17/0x20 >>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0 >>>>>> [ 53.663604] device_add+0x687/0x8b0 >>>>>> [ 53.663607] ? dev_set_name+0x56/0x70 >>>>>> [ 53.663609] platform_device_add+0x102/0x260 >>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60 >>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0 >>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>> [ 53.663658] process_one_work+0x1fa/0x630 >>>>>> [ 53.663662] worker_thread+0x1b2/0x360 >>>>>> [ 53.663664] kthread+0x128/0x250 >>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10 >>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0 >>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30 >>>>>> [ 53.663677] </TASK> >>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>>> [ 53.700264] INFO: lockdep is turned off. >>>>>> [ 53.701315] Preemption disabled at: >>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>> [ 53.701633] Tainted: [W]=WARN >>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>> [ 53.701638] Call Trace: >>>>>> [ 53.701638] <TASK> >>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0 >>>>>> [ 53.701644] dump_stack+0x14/0x20 >>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0 >>>>>> [ 53.701649] __schedule+0xe6f/0x10d0 >>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 >>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0 >>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 >>>>>> [ 53.701661] schedule+0x4a/0x160 >>>>>> [ 53.701662] schedule_timeout+0x10a/0x120 >>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 >>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 >>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0 >>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 >>>>>> [ 53.701670] wait_for_completion+0x28/0x30 >>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180 >>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80 >>>>>> [ 53.701685] synchronize_srcu+0x46/0x120 >>>>>> [ 53.701687] kill_dax+0x47/0x70 >>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470 >>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50 >>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>>> [ 53.701695] platform_probe+0x61/0xd0 >>>>>> [ 53.701698] really_probe+0xe2/0x390 >>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 >>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160 >>>>>> [ 53.701703] driver_probe_device+0x23/0xa0 >>>>>> [ 53.701704] __device_attach_driver+0x92/0x120 >>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0 >>>>>> [ 53.701708] __device_attach+0xc2/0x1f0 >>>>>> [ 53.701710] device_initial_probe+0x17/0x20 >>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0 >>>>>> [ 53.701712] device_add+0x687/0x8b0 >>>>>> [ 53.701715] ? dev_set_name+0x56/0x70 >>>>>> [ 53.701717] platform_device_add+0x102/0x260 >>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60 >>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0 >>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>> [ 53.701756] process_one_work+0x1fa/0x630 >>>>>> [ 53.701760] worker_thread+0x1b2/0x360 >>>>>> [ 53.701762] kthread+0x128/0x250 >>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10 >>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0 >>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10 >>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30 >>>>>> [ 53.701777] </TASK> >>>>>> >>>>> >>> >
On 7/17/2025 12:06 PM, Dave Jiang wrote: > > > On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote: >> >> >> On 7/16/2025 4:48 PM, Alison Schofield wrote: >>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote: >>>> On 7/16/2025 1:20 PM, Alison Schofield wrote: >>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: >>>>>> Hi Alison, >>>>>> >>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote: >>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: >>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem >>>>>>>> resources, enabling the CXL driver to remove any portions that >>>>>>>> intersect with created CXL regions. >>>>>>> >>>>>>> Hi Smita, >>>>>>> >>>>>>> This set applied cleanly to todays cxl-next but fails like appended >>>>>>> before region probe. >>>>>>> >>>>>>> BTW - there were sparse warnings in the build that look related: >>>>>>> CHECK drivers/dax/hmem/hmem_notify.c >>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit >>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit >>>>>> >>>>>> Thanks for pointing this bug. I failed to release the spinlock before >>>>>> calling hmem_register_device(), which internally calls platform_device_add() >>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this >>>>>> into v6: >>>>>> >>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c >>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644 >>>>>> --- a/drivers/dax/hmem/hmem_notify.c >>>>>> +++ b/drivers/dax/hmem/hmem_notify.c >>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const >>>>>> struct resource *res) >>>>>> { >>>>>> walk_hmem_fn hmem_fn; >>>>>> >>>>>> - guard(spinlock)(&hmem_notify_lock); >>>>>> + spin_lock(&hmem_notify_lock); >>>>>> hmem_fn = hmem_fallback_fn; >>>>>> + spin_unlock(&hmem_notify_lock); >>>>>> >>>>>> if (hmem_fn) >>>>>> hmem_fn(target_nid, res); >>>>>> -- >>>>> >>>>> Hi Smita, Adding the above got me past that, and doubling the timeout >>>>> below stopped that from happening. After that, I haven't had time to >>>>> trace so, I'll just dump on you for now: >>>>> >>>>> In /proc/iomem >>>>> Here, we see a regions resource, no CXL Window, and no dax, and no >>>>> actual region, not even disabled, is available. >>>>> c080000000-c47fffffff : region0 >>>>> >>>>> And, here no CXL Window, no region, and a soft reserved. >>>>> 68e80000000-70e7fffffff : Soft Reserved >>>>> 68e80000000-70e7fffffff : dax1.0 >>>>> 68e80000000-70e7fffffff : System RAM (kmem) >>>>> >>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next. >>>> >>>> Hi Alison, >>>> >>>> To help better understand the current behavior, could you share more about >>>> your platform configuration? specifically, are there two memory cards >>>> involved? One at c080000000 (which appears as region0) and another at >>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how >>>> are the Soft Reserved ranges laid out on your system for these cards? I'm >>>> trying to understand the "before" state of the resources i.e, prior to >>>> trimming applied by my patches. >>> >>> Here are the soft reserveds - >>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved >>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved >>> >>> And this is what we expect - >>> >>> c080000000-17dbfffffff : CXL Window 0 >>> c080000000-c47fffffff : region2 >>> c080000000-c47fffffff : dax0.0 >>> c080000000-c47fffffff : System RAM (kmem) >>> >>> >>> 68e80000000-8d37fffffff : CXL Window 1 >>> 68e80000000-70e7fffffff : region5 >>> 68e80000000-70e7fffffff : dax1.0 >>> 68e80000000-70e7fffffff : System RAM (kmem) >>> >>> And, like in prev message, iv v5 we get - >>> >>> c080000000-c47fffffff : region0 >>> >>> 68e80000000-70e7fffffff : Soft Reserved >>> 68e80000000-70e7fffffff : dax1.0 >>> 68e80000000-70e7fffffff : System RAM (kmem) >>> >>> >>> In v4, we 'almost' had what we expect, except that the HMEM driver >>> created those dax devices our of Soft Reserveds before region driver >>> could do same. >>> >> >> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior. >> >> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected. >> >> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues.. >> >> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried): >> >> if (IS_ENABLED(CONFIG_CXL_REGION) && >> region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {.. >> >> At this point, I’m running out of ideas on how to reliably coordinate this.. :( >> >> Thanks >> Smita >> >>>> >>>> Also, do you think it's feasible to change the direction of the soft reserve >>>> trimming, that is, defer it until after CXL region or memdev creation is >>>> complete? In this case it would be trimmed after but inline the existing >>>> region or memdev creation. This might simplify the flow by removing the need >>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic >>>> inside cxl_acpi_probe(). >>> >>> Yes that aligns with my simple thinking. There's the trimming after a region >>> is successfully created, and it seems that could simply be called at the end >>> of *that* region creation. >>> >>> Then, there's the round up of all the unused Soft Reserveds, and that has >>> to wait until after all regions are created, ie. all endpoints have arrived >>> and we've given up all hope of creating another region in that space. >>> That's the timing challenge. >>> >>> -- Alison >>> >>>> >>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall() >>>> and observed that it consistently avoided probe ordering issues in my setup. >>>> >>>> Additional note: I realized that even when cxl_acpi_probe() fails, the >>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits >>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in >>>> v6 by immediately triggering fallback DAX registration >>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.) >>>> >>>> Thanks >>>> Smita >>>> >>>>> >>>>>> >>>>>> As for the log: >>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for >>>>>> cxl_mem probing >>>>>> >>>>>> I’m still analyzing that. Here's what was my thought process so far. >>>>>> >>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than >>>>>> cxl_mem_probe(), so CXL region creation (which happens in >>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time >>>>>> trimming is attempted. >>>>>> >>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does >>>>>> guarantee load order when all components are built as modules. So even if >>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, >>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and >>>>>> cxl_mem in modular configurations. As a result, region creation is >>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the >>>>>> relevant probes complete. >>>>>> >>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no >>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish >>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe() >>>>>> to return prematurely and trigger the timeout. >>>>>> >>>>>> - In my local setup, I observed that a 30-second timeout was generally >>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while >>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular >>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a >>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures >>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic >>>>>> good enough to most boot-time races. >>>>>> >>>>>> One possible improvement I’m considering is to schedule a >>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait >>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on >>>>>> cxl_port) before initiating the soft reserve trimming. >>>>>> >>>>>> That said, I'm still evaluating better options to more robustly coordinate >>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and >>>>>> looking for suggestions here. > > Hi Smita, > Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery. > > A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery. > > I think either one will not require special ordering of the modules being loaded. > > DJ I think we might need both, the counters and a settling timer to coordinate Soft Reserved trimming and DAX registration. Here's the rough flow I'm thinking of. Let me know the flaws in this approach. 1. cxl_acpi_probe() schedules cxl_softreserv_work_fn() and exits early. This work item is responsible for trimming leftover Soft Reserved memory ranges once all cxl_mem devices have finished probing. 2. A delayed work is initialized for the settle timer: INIT_DELAYED_WORK(&cxl_probe_settle_work, cxl_probe_settle_fn); 3. In cxl_mem_probe(): - Increment counter2 (memdevs in progress). - Increment counter1 (memdevs discovered). - On probe completion (success or failure), decrement counter2. - After each probe, re-arm the settle timer to extend the quiet period if more devices arrive (this might fail Im not sure if cxl mem devices come in too late).. mod_delayed_work(system_wq, &cxl_probe_settle_work, 30 * HZ); - Call wake_up(&cxl_softreserv_waitq); after each probe to notify listeners. 4. The settle timer callback (cxl_probe_settle_fn()) runs when no new devices have probed for a while (30s) timer_expired = true; wake_up(&cxl_softreserv_waitq); 5. In cxl_softreserv_work_fn() wait_event(cxl_softreserv_waitq, atomic_read(&cxl_mem_counter1) > 0 && atomic_read(&cxl_mem_counter2) == 0 && atomic_read(&timer_expired)); 6. Once unblocked, cxl_softreserv_work_fn() trims Soft Reserved regions via cxl_region_softreserv_update(). (We do not perform any DAX fallback here as we dont want to endup with unresolved symbols when DAX_HMEM loads too late..) 7. Separately, dax_hmem_platform_probe() runs independently on module load, but also blocks on the same wait_event() condition if CONFIG_CXL_ACPI is enabled. Once the condition is satisfied, it invokes hmem_register_device() to register leftover Soft Reserved memory. Thanks Smita > >>>>>> >>>>>> Thanks >>>>>> Smita >>>>>> >>>>>>> >>>>>>> >>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or >>>>>>> other info to reproduce. >>>>>>> >>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing >>>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 >>>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 >>>>>>> [ 53.653540] preempt_count: 1, expected: 0 >>>>>>> [ 53.653554] RCU nest depth: 0, expected: 0 >>>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875: >>>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>>>> [ 53.653598] Preemption disabled at: >>>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>>> [ 53.653648] Call Trace: >>>>>>> [ 53.653649] <TASK> >>>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0 >>>>>>> [ 53.653658] dump_stack+0x14/0x20 >>>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0 >>>>>>> [ 53.653666] __might_sleep+0x48/0x70 >>>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 >>>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160 >>>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 >>>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90 >>>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90 >>>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] >>>>>>> [ 53.653693] __devm_add_action+0x3d/0x160 >>>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] >>>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60 >>>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0 >>>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>> [ 53.653768] process_one_work+0x1fa/0x630 >>>>>>> [ 53.653774] worker_thread+0x1b2/0x360 >>>>>>> [ 53.653777] kthread+0x128/0x250 >>>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10 >>>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0 >>>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30 >>>>>>> [ 53.653801] </TASK> >>>>>>> >>>>>>> [ 53.654193] ============================= >>>>>>> [ 53.654203] [ BUG: Invalid wait context ] >>>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W >>>>>>> [ 53.654623] ----------------------------- >>>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock: >>>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 >>>>>>> [ 53.655115] other info that might help us debug this: >>>>>>> [ 53.655273] context-{5:5} >>>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875: >>>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>>>> [ 53.656062] stack backtrace: >>>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>>> [ 53.656227] Tainted: [W]=WARN >>>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>>> [ 53.656232] Call Trace: >>>>>>> [ 53.656232] <TASK> >>>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0 >>>>>>> [ 53.656238] dump_stack+0x14/0x20 >>>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200 >>>>>>> [ 53.656246] lock_acquire+0xd8/0x300 >>>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390 >>>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0 >>>>>>> [ 53.656257] down_write+0x44/0xe0 >>>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390 >>>>>>> [ 53.656263] kernfs_add_one+0x34/0x390 >>>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 >>>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 >>>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0 >>>>>>> [ 53.656273] kobject_add+0x7d/0xf0 >>>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0 >>>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 >>>>>>> [ 53.656282] device_add+0x124/0x8b0 >>>>>>> [ 53.656285] ? dev_set_name+0x56/0x70 >>>>>>> [ 53.656287] platform_device_add+0x102/0x260 >>>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60 >>>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0 >>>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>> [ 53.656346] process_one_work+0x1fa/0x630 >>>>>>> [ 53.656350] worker_thread+0x1b2/0x360 >>>>>>> [ 53.656352] kthread+0x128/0x250 >>>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10 >>>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0 >>>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30 >>>>>>> [ 53.656366] </TASK> >>>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>>>> [ 53.663552] schedule+0x4a/0x160 >>>>>>> [ 53.663553] schedule_timeout+0x10a/0x120 >>>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 >>>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 >>>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0 >>>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 >>>>>>> [ 53.663561] wait_for_completion+0x28/0x30 >>>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180 >>>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80 >>>>>>> [ 53.663576] synchronize_srcu+0x46/0x120 >>>>>>> [ 53.663577] kill_dax+0x47/0x70 >>>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470 >>>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50 >>>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>>>> [ 53.663585] platform_probe+0x61/0xd0 >>>>>>> [ 53.663589] really_probe+0xe2/0x390 >>>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 >>>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160 >>>>>>> [ 53.663594] driver_probe_device+0x23/0xa0 >>>>>>> [ 53.663596] __device_attach_driver+0x92/0x120 >>>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0 >>>>>>> [ 53.663599] __device_attach+0xc2/0x1f0 >>>>>>> [ 53.663601] device_initial_probe+0x17/0x20 >>>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0 >>>>>>> [ 53.663604] device_add+0x687/0x8b0 >>>>>>> [ 53.663607] ? dev_set_name+0x56/0x70 >>>>>>> [ 53.663609] platform_device_add+0x102/0x260 >>>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60 >>>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0 >>>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>> [ 53.663658] process_one_work+0x1fa/0x630 >>>>>>> [ 53.663662] worker_thread+0x1b2/0x360 >>>>>>> [ 53.663664] kthread+0x128/0x250 >>>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10 >>>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0 >>>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30 >>>>>>> [ 53.663677] </TASK> >>>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>>>> [ 53.700264] INFO: lockdep is turned off. >>>>>>> [ 53.701315] Preemption disabled at: >>>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>>> [ 53.701633] Tainted: [W]=WARN >>>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>>> [ 53.701638] Call Trace: >>>>>>> [ 53.701638] <TASK> >>>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0 >>>>>>> [ 53.701644] dump_stack+0x14/0x20 >>>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0 >>>>>>> [ 53.701649] __schedule+0xe6f/0x10d0 >>>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 >>>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0 >>>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 >>>>>>> [ 53.701661] schedule+0x4a/0x160 >>>>>>> [ 53.701662] schedule_timeout+0x10a/0x120 >>>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 >>>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 >>>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0 >>>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 >>>>>>> [ 53.701670] wait_for_completion+0x28/0x30 >>>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180 >>>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80 >>>>>>> [ 53.701685] synchronize_srcu+0x46/0x120 >>>>>>> [ 53.701687] kill_dax+0x47/0x70 >>>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470 >>>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50 >>>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>>>> [ 53.701695] platform_probe+0x61/0xd0 >>>>>>> [ 53.701698] really_probe+0xe2/0x390 >>>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 >>>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160 >>>>>>> [ 53.701703] driver_probe_device+0x23/0xa0 >>>>>>> [ 53.701704] __device_attach_driver+0x92/0x120 >>>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0 >>>>>>> [ 53.701708] __device_attach+0xc2/0x1f0 >>>>>>> [ 53.701710] device_initial_probe+0x17/0x20 >>>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0 >>>>>>> [ 53.701712] device_add+0x687/0x8b0 >>>>>>> [ 53.701715] ? dev_set_name+0x56/0x70 >>>>>>> [ 53.701717] platform_device_add+0x102/0x260 >>>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60 >>>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0 >>>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>> [ 53.701756] process_one_work+0x1fa/0x630 >>>>>>> [ 53.701760] worker_thread+0x1b2/0x360 >>>>>>> [ 53.701762] kthread+0x128/0x250 >>>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10 >>>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0 >>>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10 >>>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30 >>>>>>> [ 53.701777] </TASK> >>>>>>> >>>>>> >>>> >> >
On 7/17/25 4:20 PM, Koralahalli Channabasappa, Smita wrote: > On 7/17/2025 12:06 PM, Dave Jiang wrote: >> >> >> On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote: >>> >>> >>> On 7/16/2025 4:48 PM, Alison Schofield wrote: >>>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote: >>>>> On 7/16/2025 1:20 PM, Alison Schofield wrote: >>>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote: >>>>>>> Hi Alison, >>>>>>> >>>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote: >>>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote: >>>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem >>>>>>>>> resources, enabling the CXL driver to remove any portions that >>>>>>>>> intersect with created CXL regions. >>>>>>>> >>>>>>>> Hi Smita, >>>>>>>> >>>>>>>> This set applied cleanly to todays cxl-next but fails like appended >>>>>>>> before region probe. >>>>>>>> >>>>>>>> BTW - there were sparse warnings in the build that look related: >>>>>>>> CHECK drivers/dax/hmem/hmem_notify.c >>>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit >>>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit >>>>>>> >>>>>>> Thanks for pointing this bug. I failed to release the spinlock before >>>>>>> calling hmem_register_device(), which internally calls platform_device_add() >>>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this >>>>>>> into v6: >>>>>>> >>>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c >>>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644 >>>>>>> --- a/drivers/dax/hmem/hmem_notify.c >>>>>>> +++ b/drivers/dax/hmem/hmem_notify.c >>>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const >>>>>>> struct resource *res) >>>>>>> { >>>>>>> walk_hmem_fn hmem_fn; >>>>>>> >>>>>>> - guard(spinlock)(&hmem_notify_lock); >>>>>>> + spin_lock(&hmem_notify_lock); >>>>>>> hmem_fn = hmem_fallback_fn; >>>>>>> + spin_unlock(&hmem_notify_lock); >>>>>>> >>>>>>> if (hmem_fn) >>>>>>> hmem_fn(target_nid, res); >>>>>>> -- >>>>>> >>>>>> Hi Smita, Adding the above got me past that, and doubling the timeout >>>>>> below stopped that from happening. After that, I haven't had time to >>>>>> trace so, I'll just dump on you for now: >>>>>> >>>>>> In /proc/iomem >>>>>> Here, we see a regions resource, no CXL Window, and no dax, and no >>>>>> actual region, not even disabled, is available. >>>>>> c080000000-c47fffffff : region0 >>>>>> >>>>>> And, here no CXL Window, no region, and a soft reserved. >>>>>> 68e80000000-70e7fffffff : Soft Reserved >>>>>> 68e80000000-70e7fffffff : dax1.0 >>>>>> 68e80000000-70e7fffffff : System RAM (kmem) >>>>>> >>>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next. >>>>> >>>>> Hi Alison, >>>>> >>>>> To help better understand the current behavior, could you share more about >>>>> your platform configuration? specifically, are there two memory cards >>>>> involved? One at c080000000 (which appears as region0) and another at >>>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how >>>>> are the Soft Reserved ranges laid out on your system for these cards? I'm >>>>> trying to understand the "before" state of the resources i.e, prior to >>>>> trimming applied by my patches. >>>> >>>> Here are the soft reserveds - >>>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved >>>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved >>>> >>>> And this is what we expect - >>>> >>>> c080000000-17dbfffffff : CXL Window 0 >>>> c080000000-c47fffffff : region2 >>>> c080000000-c47fffffff : dax0.0 >>>> c080000000-c47fffffff : System RAM (kmem) >>>> >>>> >>>> 68e80000000-8d37fffffff : CXL Window 1 >>>> 68e80000000-70e7fffffff : region5 >>>> 68e80000000-70e7fffffff : dax1.0 >>>> 68e80000000-70e7fffffff : System RAM (kmem) >>>> >>>> And, like in prev message, iv v5 we get - >>>> >>>> c080000000-c47fffffff : region0 >>>> >>>> 68e80000000-70e7fffffff : Soft Reserved >>>> 68e80000000-70e7fffffff : dax1.0 >>>> 68e80000000-70e7fffffff : System RAM (kmem) >>>> >>>> >>>> In v4, we 'almost' had what we expect, except that the HMEM driver >>>> created those dax devices our of Soft Reserveds before region driver >>>> could do same. >>>> >>> >>> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior. >>> >>> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected. >>> >>> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues.. >>> >>> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried): >>> >>> if (IS_ENABLED(CONFIG_CXL_REGION) && >>> region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {.. >>> >>> At this point, I’m running out of ideas on how to reliably coordinate this.. :( >>> >>> Thanks >>> Smita >>> >>>>> >>>>> Also, do you think it's feasible to change the direction of the soft reserve >>>>> trimming, that is, defer it until after CXL region or memdev creation is >>>>> complete? In this case it would be trimmed after but inline the existing >>>>> region or memdev creation. This might simplify the flow by removing the need >>>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic >>>>> inside cxl_acpi_probe(). >>>> >>>> Yes that aligns with my simple thinking. There's the trimming after a region >>>> is successfully created, and it seems that could simply be called at the end >>>> of *that* region creation. >>>> >>>> Then, there's the round up of all the unused Soft Reserveds, and that has >>>> to wait until after all regions are created, ie. all endpoints have arrived >>>> and we've given up all hope of creating another region in that space. >>>> That's the timing challenge. >>>> >>>> -- Alison >>>> >>>>> >>>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall() >>>>> and observed that it consistently avoided probe ordering issues in my setup. >>>>> >>>>> Additional note: I realized that even when cxl_acpi_probe() fails, the >>>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits >>>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in >>>>> v6 by immediately triggering fallback DAX registration >>>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.) >>>>> >>>>> Thanks >>>>> Smita >>>>> >>>>>> >>>>>>> >>>>>>> As for the log: >>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for >>>>>>> cxl_mem probing >>>>>>> >>>>>>> I’m still analyzing that. Here's what was my thought process so far. >>>>>>> >>>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than >>>>>>> cxl_mem_probe(), so CXL region creation (which happens in >>>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time >>>>>>> trimming is attempted. >>>>>>> >>>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does >>>>>>> guarantee load order when all components are built as modules. So even if >>>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window, >>>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and >>>>>>> cxl_mem in modular configurations. As a result, region creation is >>>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the >>>>>>> relevant probes complete. >>>>>>> >>>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no >>>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish >>>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe() >>>>>>> to return prematurely and trigger the timeout. >>>>>>> >>>>>>> - In my local setup, I observed that a 30-second timeout was generally >>>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while >>>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular >>>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a >>>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures >>>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic >>>>>>> good enough to most boot-time races. >>>>>>> >>>>>>> One possible improvement I’m considering is to schedule a >>>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait >>>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on >>>>>>> cxl_port) before initiating the soft reserve trimming. >>>>>>> >>>>>>> That said, I'm still evaluating better options to more robustly coordinate >>>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and >>>>>>> looking for suggestions here. >> >> Hi Smita, >> Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery. >> >> A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery. >> >> I think either one will not require special ordering of the modules being loaded. >> >> DJ > > I think we might need both, the counters and a settling timer to coordinate Soft Reserved trimming and DAX registration. > > Here's the rough flow I'm thinking of. Let me know the flaws in this approach. Seems reasonable to me. Don't forget to cancel timer if your condition is met and you are woken up early by a probe() finish. It really is best effort in dealing with the situation. DJ > > 1. cxl_acpi_probe() schedules cxl_softreserv_work_fn() and exits early. > This work item is responsible for trimming leftover Soft Reserved memory ranges once all cxl_mem devices have finished probing. > > 2. A delayed work is initialized for the settle timer: > > INIT_DELAYED_WORK(&cxl_probe_settle_work, cxl_probe_settle_fn); > > 3. In cxl_mem_probe(): > - Increment counter2 (memdevs in progress). > - Increment counter1 (memdevs discovered). > - On probe completion (success or failure), decrement counter2. > - After each probe, re-arm the settle timer to extend the quiet > period if more devices arrive (this might fail Im not sure if cxl > mem devices come in too late).. > mod_delayed_work(system_wq, &cxl_probe_settle_work, 30 * HZ); > - Call wake_up(&cxl_softreserv_waitq); after each probe to notify > listeners. > > 4. The settle timer callback (cxl_probe_settle_fn()) runs when no new devices have probed for a while (30s) > timer_expired = true; > wake_up(&cxl_softreserv_waitq); > > 5. In cxl_softreserv_work_fn() > wait_event(cxl_softreserv_waitq, > atomic_read(&cxl_mem_counter1) > 0 && > atomic_read(&cxl_mem_counter2) == 0 && > atomic_read(&timer_expired)); > > 6. Once unblocked, cxl_softreserv_work_fn() trims Soft Reserved regions via cxl_region_softreserv_update(). > (We do not perform any DAX fallback here as we dont want to endup with unresolved symbols when DAX_HMEM loads too late..) > > 7. Separately, dax_hmem_platform_probe() runs independently on module load, but also blocks on the same wait_event() condition if CONFIG_CXL_ACPI is enabled. Once the condition is satisfied, it invokes hmem_register_device() to register leftover Soft Reserved memory. > > Thanks > Smita > >> >>>>>>> >>>>>>> Thanks >>>>>>> Smita >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or >>>>>>>> other info to reproduce. >>>>>>>> >>>>>>>> [ 53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing >>>>>>>> [ 53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 >>>>>>>> [ 53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1 >>>>>>>> [ 53.653540] preempt_count: 1, expected: 0 >>>>>>>> [ 53.653554] RCU nest depth: 0, expected: 0 >>>>>>>> [ 53.653568] 3 locks held by kworker/46:1/1875: >>>>>>>> [ 53.653569] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>>>>> [ 53.653583] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>>>>> [ 53.653589] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>>>>> [ 53.653598] Preemption disabled at: >>>>>>>> [ 53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>>>>> [ 53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>>>> [ 53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>>>> [ 53.653648] Call Trace: >>>>>>>> [ 53.653649] <TASK> >>>>>>>> [ 53.653652] dump_stack_lvl+0xa8/0xd0 >>>>>>>> [ 53.653658] dump_stack+0x14/0x20 >>>>>>>> [ 53.653659] __might_resched+0x1ae/0x2d0 >>>>>>>> [ 53.653666] __might_sleep+0x48/0x70 >>>>>>>> [ 53.653668] __kmalloc_node_track_caller_noprof+0x349/0x510 >>>>>>>> [ 53.653674] ? __devm_add_action+0x3d/0x160 >>>>>>>> [ 53.653685] ? __pfx_devm_action_release+0x10/0x10 >>>>>>>> [ 53.653688] __devres_alloc_node+0x4a/0x90 >>>>>>>> [ 53.653689] ? __devres_alloc_node+0x4a/0x90 >>>>>>>> [ 53.653691] ? __pfx_release_memregion+0x10/0x10 [dax_hmem] >>>>>>>> [ 53.653693] __devm_add_action+0x3d/0x160 >>>>>>>> [ 53.653696] hmem_register_device+0xea/0x230 [dax_hmem] >>>>>>>> [ 53.653700] hmem_fallback_register_device+0x37/0x60 >>>>>>>> [ 53.653703] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>>> [ 53.653739] walk_iomem_res_desc+0x55/0xb0 >>>>>>>> [ 53.653744] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>>> [ 53.653755] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>>> [ 53.653761] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>>> [ 53.653763] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>>> [ 53.653768] process_one_work+0x1fa/0x630 >>>>>>>> [ 53.653774] worker_thread+0x1b2/0x360 >>>>>>>> [ 53.653777] kthread+0x128/0x250 >>>>>>>> [ 53.653781] ? __pfx_worker_thread+0x10/0x10 >>>>>>>> [ 53.653784] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.653786] ret_from_fork+0x139/0x1e0 >>>>>>>> [ 53.653790] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.653792] ret_from_fork_asm+0x1a/0x30 >>>>>>>> [ 53.653801] </TASK> >>>>>>>> >>>>>>>> [ 53.654193] ============================= >>>>>>>> [ 53.654203] [ BUG: Invalid wait context ] >>>>>>>> [ 53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G W >>>>>>>> [ 53.654623] ----------------------------- >>>>>>>> [ 53.654785] kworker/46:1/1875 is trying to lock: >>>>>>>> [ 53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390 >>>>>>>> [ 53.655115] other info that might help us debug this: >>>>>>>> [ 53.655273] context-{5:5} >>>>>>>> [ 53.655428] 3 locks held by kworker/46:1/1875: >>>>>>>> [ 53.655579] #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630 >>>>>>>> [ 53.655739] #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630 >>>>>>>> [ 53.655900] #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60 >>>>>>>> [ 53.656062] stack backtrace: >>>>>>>> [ 53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>>>> [ 53.656227] Tainted: [W]=WARN >>>>>>>> [ 53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>>>> [ 53.656232] Call Trace: >>>>>>>> [ 53.656232] <TASK> >>>>>>>> [ 53.656234] dump_stack_lvl+0x85/0xd0 >>>>>>>> [ 53.656238] dump_stack+0x14/0x20 >>>>>>>> [ 53.656239] __lock_acquire+0xaf4/0x2200 >>>>>>>> [ 53.656246] lock_acquire+0xd8/0x300 >>>>>>>> [ 53.656248] ? kernfs_add_one+0x34/0x390 >>>>>>>> [ 53.656252] ? __might_resched+0x208/0x2d0 >>>>>>>> [ 53.656257] down_write+0x44/0xe0 >>>>>>>> [ 53.656262] ? kernfs_add_one+0x34/0x390 >>>>>>>> [ 53.656263] kernfs_add_one+0x34/0x390 >>>>>>>> [ 53.656265] kernfs_create_dir_ns+0x5a/0xa0 >>>>>>>> [ 53.656268] sysfs_create_dir_ns+0x74/0xd0 >>>>>>>> [ 53.656270] kobject_add_internal+0xb1/0x2f0 >>>>>>>> [ 53.656273] kobject_add+0x7d/0xf0 >>>>>>>> [ 53.656275] ? get_device_parent+0x28/0x1e0 >>>>>>>> [ 53.656280] ? __pfx_klist_children_get+0x10/0x10 >>>>>>>> [ 53.656282] device_add+0x124/0x8b0 >>>>>>>> [ 53.656285] ? dev_set_name+0x56/0x70 >>>>>>>> [ 53.656287] platform_device_add+0x102/0x260 >>>>>>>> [ 53.656289] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>>>> [ 53.656291] hmem_fallback_register_device+0x37/0x60 >>>>>>>> [ 53.656294] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>>> [ 53.656323] walk_iomem_res_desc+0x55/0xb0 >>>>>>>> [ 53.656326] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>>> [ 53.656335] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>>> [ 53.656342] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>>> [ 53.656343] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>>> [ 53.656346] process_one_work+0x1fa/0x630 >>>>>>>> [ 53.656350] worker_thread+0x1b2/0x360 >>>>>>>> [ 53.656352] kthread+0x128/0x250 >>>>>>>> [ 53.656354] ? __pfx_worker_thread+0x10/0x10 >>>>>>>> [ 53.656356] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.656357] ret_from_fork+0x139/0x1e0 >>>>>>>> [ 53.656360] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.656361] ret_from_fork_asm+0x1a/0x30 >>>>>>>> [ 53.656366] </TASK> >>>>>>>> [ 53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>>>>> [ 53.663552] schedule+0x4a/0x160 >>>>>>>> [ 53.663553] schedule_timeout+0x10a/0x120 >>>>>>>> [ 53.663555] ? debug_smp_processor_id+0x1b/0x30 >>>>>>>> [ 53.663556] ? trace_hardirqs_on+0x5f/0xd0 >>>>>>>> [ 53.663558] __wait_for_common+0xb9/0x1c0 >>>>>>>> [ 53.663559] ? __pfx_schedule_timeout+0x10/0x10 >>>>>>>> [ 53.663561] wait_for_completion+0x28/0x30 >>>>>>>> [ 53.663562] __synchronize_srcu+0xbf/0x180 >>>>>>>> [ 53.663566] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>>>>> [ 53.663571] ? i2c_repstart+0x30/0x80 >>>>>>>> [ 53.663576] synchronize_srcu+0x46/0x120 >>>>>>>> [ 53.663577] kill_dax+0x47/0x70 >>>>>>>> [ 53.663580] __devm_create_dev_dax+0x112/0x470 >>>>>>>> [ 53.663582] devm_create_dev_dax+0x26/0x50 >>>>>>>> [ 53.663584] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>>>>> [ 53.663585] platform_probe+0x61/0xd0 >>>>>>>> [ 53.663589] really_probe+0xe2/0x390 >>>>>>>> [ 53.663591] ? __pfx___device_attach_driver+0x10/0x10 >>>>>>>> [ 53.663593] __driver_probe_device+0x7e/0x160 >>>>>>>> [ 53.663594] driver_probe_device+0x23/0xa0 >>>>>>>> [ 53.663596] __device_attach_driver+0x92/0x120 >>>>>>>> [ 53.663597] bus_for_each_drv+0x8c/0xf0 >>>>>>>> [ 53.663599] __device_attach+0xc2/0x1f0 >>>>>>>> [ 53.663601] device_initial_probe+0x17/0x20 >>>>>>>> [ 53.663603] bus_probe_device+0xa8/0xb0 >>>>>>>> [ 53.663604] device_add+0x687/0x8b0 >>>>>>>> [ 53.663607] ? dev_set_name+0x56/0x70 >>>>>>>> [ 53.663609] platform_device_add+0x102/0x260 >>>>>>>> [ 53.663610] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>>>> [ 53.663612] hmem_fallback_register_device+0x37/0x60 >>>>>>>> [ 53.663614] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>>> [ 53.663637] walk_iomem_res_desc+0x55/0xb0 >>>>>>>> [ 53.663640] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>>> [ 53.663647] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>>> [ 53.663654] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>>> [ 53.663655] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>>> [ 53.663658] process_one_work+0x1fa/0x630 >>>>>>>> [ 53.663662] worker_thread+0x1b2/0x360 >>>>>>>> [ 53.663664] kthread+0x128/0x250 >>>>>>>> [ 53.663666] ? __pfx_worker_thread+0x10/0x10 >>>>>>>> [ 53.663668] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.663670] ret_from_fork+0x139/0x1e0 >>>>>>>> [ 53.663672] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.663673] ret_from_fork_asm+0x1a/0x30 >>>>>>>> [ 53.663677] </TASK> >>>>>>>> [ 53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002 >>>>>>>> [ 53.700264] INFO: lockdep is turned off. >>>>>>>> [ 53.701315] Preemption disabled at: >>>>>>>> [ 53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60 >>>>>>>> [ 53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G W 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) >>>>>>>> [ 53.701633] Tainted: [W]=WARN >>>>>>>> [ 53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi] >>>>>>>> [ 53.701638] Call Trace: >>>>>>>> [ 53.701638] <TASK> >>>>>>>> [ 53.701640] dump_stack_lvl+0xa8/0xd0 >>>>>>>> [ 53.701644] dump_stack+0x14/0x20 >>>>>>>> [ 53.701645] __schedule_bug+0xa2/0xd0 >>>>>>>> [ 53.701649] __schedule+0xe6f/0x10d0 >>>>>>>> [ 53.701652] ? debug_smp_processor_id+0x1b/0x30 >>>>>>>> [ 53.701655] ? lock_release+0x1e6/0x2b0 >>>>>>>> [ 53.701658] ? trace_hardirqs_on+0x5f/0xd0 >>>>>>>> [ 53.701661] schedule+0x4a/0x160 >>>>>>>> [ 53.701662] schedule_timeout+0x10a/0x120 >>>>>>>> [ 53.701664] ? debug_smp_processor_id+0x1b/0x30 >>>>>>>> [ 53.701666] ? trace_hardirqs_on+0x5f/0xd0 >>>>>>>> [ 53.701667] __wait_for_common+0xb9/0x1c0 >>>>>>>> [ 53.701668] ? __pfx_schedule_timeout+0x10/0x10 >>>>>>>> [ 53.701670] wait_for_completion+0x28/0x30 >>>>>>>> [ 53.701671] __synchronize_srcu+0xbf/0x180 >>>>>>>> [ 53.701677] ? __pfx_wakeme_after_rcu+0x10/0x10 >>>>>>>> [ 53.701682] ? i2c_repstart+0x30/0x80 >>>>>>>> [ 53.701685] synchronize_srcu+0x46/0x120 >>>>>>>> [ 53.701687] kill_dax+0x47/0x70 >>>>>>>> [ 53.701689] __devm_create_dev_dax+0x112/0x470 >>>>>>>> [ 53.701691] devm_create_dev_dax+0x26/0x50 >>>>>>>> [ 53.701693] dax_hmem_probe+0x87/0xd0 [dax_hmem] >>>>>>>> [ 53.701695] platform_probe+0x61/0xd0 >>>>>>>> [ 53.701698] really_probe+0xe2/0x390 >>>>>>>> [ 53.701700] ? __pfx___device_attach_driver+0x10/0x10 >>>>>>>> [ 53.701701] __driver_probe_device+0x7e/0x160 >>>>>>>> [ 53.701703] driver_probe_device+0x23/0xa0 >>>>>>>> [ 53.701704] __device_attach_driver+0x92/0x120 >>>>>>>> [ 53.701706] bus_for_each_drv+0x8c/0xf0 >>>>>>>> [ 53.701708] __device_attach+0xc2/0x1f0 >>>>>>>> [ 53.701710] device_initial_probe+0x17/0x20 >>>>>>>> [ 53.701711] bus_probe_device+0xa8/0xb0 >>>>>>>> [ 53.701712] device_add+0x687/0x8b0 >>>>>>>> [ 53.701715] ? dev_set_name+0x56/0x70 >>>>>>>> [ 53.701717] platform_device_add+0x102/0x260 >>>>>>>> [ 53.701718] hmem_register_device+0x160/0x230 [dax_hmem] >>>>>>>> [ 53.701720] hmem_fallback_register_device+0x37/0x60 >>>>>>>> [ 53.701722] cxl_softreserv_mem_register+0x24/0x30 [cxl_core] >>>>>>>> [ 53.701734] walk_iomem_res_desc+0x55/0xb0 >>>>>>>> [ 53.701738] ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core] >>>>>>>> [ 53.701745] cxl_region_softreserv_update+0x46/0x50 [cxl_core] >>>>>>>> [ 53.701751] cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi] >>>>>>>> [ 53.701752] ? __pfx_autoremove_wake_function+0x10/0x10 >>>>>>>> [ 53.701756] process_one_work+0x1fa/0x630 >>>>>>>> [ 53.701760] worker_thread+0x1b2/0x360 >>>>>>>> [ 53.701762] kthread+0x128/0x250 >>>>>>>> [ 53.701765] ? __pfx_worker_thread+0x10/0x10 >>>>>>>> [ 53.701766] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.701768] ret_from_fork+0x139/0x1e0 >>>>>>>> [ 53.701771] ? __pfx_kthread+0x10/0x10 >>>>>>>> [ 53.701772] ret_from_fork_asm+0x1a/0x30 >>>>>>>> [ 53.701777] </TASK> >>>>>>>> >>>>>>> >>>>> >>> >> > >
Smita Koralahalli wrote: > This series introduces the ability to manage SOFT RESERVED iomem > resources, enabling the CXL driver to remove any portions that > intersect with created CXL regions. > > The current approach of leaving SOFT RESERVED entries as is can result > in failures during device hotplug such as CXL because the address range > remains reserved and unavailable for reuse even after region teardown. I will go through the patches, but the main concern here is not hotplug, it is region assembly failure. We have a constant drip of surprising platform behaviors that trip up the driver leaving memory stranded. Specifically, device-dax defers to CXL to assemble the region representing the soft-reserve range, CXL fails to complete that assembly due to being confused by the platform, end user wonders why their platform BIOS sees memory capacity that Linux does not see. So the priority order of solutions needed here is: 1/ Fix all shipping platform "quirks", try to prevent new ones from being created. I.e. ideally, long term, Linux doed not need a soft-reserve fallback and just always ignores Soft Reserve in CXL Windows because the CXL subsystem will handle it. 2/ In the near term forseeable future, for all yet to be solved or yet to be discovered platform quirks, provide a device-dax fallback to recover baseline device-dax behavior (equivalent to putting cxl_acpi on a modprobe deny-list). 3/ For hotplug, remove the conflicting resource. > To address this, the CXL driver now uses a background worker that waits > for cxl_mem driver probe to complete before scanning for intersecting > resources. Then the driver walks through created CXL regions to trim any > intersections with SOFT RESERVED resources in the iomem tree. The precision of this gives me pause. I think it is fine to make this more coarse because any mismatch between Soft Reserve and a CXL Window resource should be cause to give up on the CXL side. If a Soft Reserve range straddles a CXL window and "System RAM", give up on trying to use the CXL driver on that system. CXL does not completely cover a soft-reserve region, give up on trying to use the CXL driver on that system. Effectively anytime we detect unexpected platform shenanigans it is likely indicating missing understanding in the Linux driver. > The following scenarios have been tested: Nice! Appreciate you including the test case results. [..] > Example 3: No alignment > |---------- "Soft Reserved" ----------| > |---- "Region #" ----| Per above, CXL subsystem should completely give up in this scenario. The BIOS said that all of the range is Conventional memory and CXL is only creating a region for part of it. Somebody is wrong. Given the fact that non-CXL aware OSes would try to use the entirety of the Soft Reserved region, then this scenario is "disable CXL, it clearly does not understand this platform".
© 2016 - 2025 Red Hat, Inc.