[v5] Add managed SOFT RESERVE resource handling

[PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Smita Koralahalli 2 months, 3 weeks ago

This series introduces the ability to manage SOFT RESERVED iomem
resources, enabling the CXL driver to remove any portions that
intersect with created CXL regions.

The current approach of leaving SOFT RESERVED entries as is can result
in failures during device hotplug such as CXL because the address range
remains reserved and unavailable for reuse even after region teardown.

To address this, the CXL driver now uses a background worker that waits
for cxl_mem driver probe to complete before scanning for intersecting
resources. Then the driver walks through created CXL regions to trim any
intersections with SOFT RESERVED resources in the iomem tree.

The following scenarios have been tested:

Example 1: Exact alignment, soft reserved is a child of the region

|---------- "Soft Reserved" -----------|
|-------------- "Region #" ------------|

Before:
  1050000000-304fffffff : CXL Window 0
    1050000000-304fffffff : region0
      1050000000-304fffffff : Soft Reserved
        1080000000-2fffffffff : dax0.0
          1080000000-2fffffffff : System RAM (kmem)

After:
  1050000000-304fffffff : CXL Window 0
    1050000000-304fffffff : region0
      1080000000-2fffffffff : dax0.0
        1080000000-2fffffffff : System RAM (kmem)

Example 2: Start and/or end aligned and soft reserved spans multiple
regions
|----------- "Soft Reserved" -----------|
|-------- "Region #" -------|
or
|----------- "Soft Reserved" -----------|
            |-------- "Region #" -------|

Before:
  850000000-684fffffff : Soft Reserved
    850000000-284fffffff : CXL Window 0
      850000000-284fffffff : region3
        850000000-284fffffff : dax0.0
          850000000-284fffffff : System RAM (kmem)
    2850000000-484fffffff : CXL Window 1
      2850000000-484fffffff : region4
        2850000000-484fffffff : dax1.0
          2850000000-484fffffff : System RAM (kmem)
    4850000000-684fffffff : CXL Window 2
      4850000000-684fffffff : region5
        4850000000-684fffffff : dax2.0
          4850000000-684fffffff : System RAM (kmem)

After:
  850000000-284fffffff : CXL Window 0
    850000000-284fffffff : region3
      850000000-284fffffff : dax0.0
        850000000-284fffffff : System RAM (kmem)
  2850000000-484fffffff : CXL Window 1
    2850000000-484fffffff : region4
      2850000000-484fffffff : dax1.0
        2850000000-484fffffff : System RAM (kmem)
  4850000000-684fffffff : CXL Window 2
    4850000000-684fffffff : region5
      4850000000-684fffffff : dax2.0
        4850000000-684fffffff : System RAM (kmem)

Example 3: No alignment
|---------- "Soft Reserved" ----------|
	|---- "Region #" ----|

Before:
  00000000-3050000ffd : Soft Reserved
    ..
    ..
    1050000000-304fffffff : CXL Window 0
      1050000000-304fffffff : region1
        1080000000-2fffffffff : dax0.0
          1080000000-2fffffffff : System RAM (kmem)

After:
  00000000-104fffffff : Soft Reserved
    ..
    ..
  1050000000-304fffffff : CXL Window 0
    1050000000-304fffffff : region1
      1080000000-2fffffffff : dax0.0
        1080000000-2fffffffff : System RAM (kmem)
  3050000000-3050000ffd : Soft Reserved

Link to v4:
https://lore.kernel.org/linux-cxl/20250603221949.53272-1-Smita.KoralahalliChannabasappa@amd.com

v5 updates:
 - Handled cases where CXL driver loads early even before HMEM driver is
   initialized.
 - Introduced callback functions to resolve dependencies.
 - Rename suspend.c to probe_state.c.
 - Refactor cxl_acpi_probe() to use a single exit path.
 - Commit description update to justify cxl_mem_active() usage.
 - Change from kmalloc -> kzalloc in add_soft_reserved().
 - Change from goto to if else blocks inside remove_soft_reserved().
 - DEFINE_RES_MEM_NAMED -> DEFINE_RES_NAMED_DESC.
 - Comments for flags inside remove_soft_reserved().
 - Add resource_lock inside normalize_resource().
 - bus_find_next_device -> bus_find_device.
 - Skip DAX consumption of soft reserves inside hmat with
   CONFIG_CXL_ACPI checks.

v4 updates:
 - Split first patch into 4 smaller patches.
 - Correct the logic for cxl_pci_loaded() and cxl_mem_active() to return
   false at default instead of true.
 - Cleanup cxl_wait_for_pci_mem() to remove config checks for cxl_pci
   and cxl_mem.
 - Fixed multiple bugs and build issues which includes correcting
   walk_iomem_resc_desc() and calculations of alignments.
 
v3 updates:
 - Remove srmem resource tree from kernel/resource.c, this is no longer
   needed in the current implementation. All SOFT RESERVE resources now
   put on the iomem resource tree.
 - Remove the no longer needed SOFT_RESERVED_MANAGED kernel config option.
 - Add the 'nid' parameter back to hmem_register_resource();
 - Remove the no longer used soft reserve notification chain (introduced
   in v2). The dax driver is now notified of SOFT RESERVED resources by
   the CXL driver.

v2 updates:
 - Add config option SOFT_RESERVE_MANAGED to control use of the
   separate srmem resource tree at boot.
 - Only add SOFT RESERVE resources to the soft reserve tree during
   boot, they go to the iomem resource tree after boot.
 - Remove the resource trimming code in the previous patch to re-use
   the existing code in kernel/resource.c
 - Add functionality for the cxl acpi driver to wait for the cxl PCI
   and mem drivers to load.

Smita Koralahalli (7):
  cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX
    registration
  cxl/core: Rename suspend.c to probe_state.c and remove
    CONFIG_CXL_SUSPEND
  cxl/acpi: Add background worker to coordinate with cxl_mem probe
    completion
  cxl/region: Introduce SOFT RESERVED resource removal on region
    teardown
  dax/hmem: Save the DAX HMEM platform device pointer
  dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until
    after CXL region creation
  dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads
    late

 drivers/acpi/numa/hmat.c                      |   4 +
 drivers/cxl/Kconfig                           |   4 -
 drivers/cxl/acpi.c                            |  50 +++++--
 drivers/cxl/core/Makefile                     |   2 +-
 drivers/cxl/core/{suspend.c => probe_state.c} |  10 +-
 drivers/cxl/core/region.c                     | 135 ++++++++++++++++++
 drivers/cxl/cxl.h                             |   4 +
 drivers/cxl/cxlmem.h                          |   9 --
 drivers/dax/hmem/Makefile                     |   1 +
 drivers/dax/hmem/device.c                     |  62 ++++----
 drivers/dax/hmem/hmem.c                       |  14 +-
 drivers/dax/hmem/hmem_notify.c                |  29 ++++
 include/linux/dax.h                           |   7 +-
 include/linux/ioport.h                        |   1 +
 include/linux/pm.h                            |   7 -
 kernel/resource.c                             |  34 +++++
 16 files changed, 307 insertions(+), 66 deletions(-)
 rename drivers/cxl/core/{suspend.c => probe_state.c} (62%)
 create mode 100644 drivers/dax/hmem/hmem_notify.c

-- 
2.17.1

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Zhijian Li (Fujitsu) 2 months, 2 weeks ago

Smita,

I have not yet to complete all of my local patterns. Nonetheless, in addition to the issues highlighted by Alison, I have also encountered some regressions.

Based on your conversation with Alison, it appears you have decided to have a refactor. Thus, I intend to stop testing on this version until the updated iteration is available.

Here is what I have verified thus far (kernel built upon the cxl/next 20250718):

A) No Soft reserved (BIOS did not expose EFI_SPECIAL_PURPOSE)
- A.1 Decoder not committed (default QEMU emulation)

Before:
```
fffc0000-ffffffff : Reserved
100000000-27fffffff : System RAM
5c0001128-5c00011b7 : port1
5d0000000-6cfffffff : CXL Window 0
6d0000000-7cfffffff : CXL Window 1
7000000000-700000ffff : PCI Bus 0000:0c
   7000000000-700000ffff : 0000:0c:00.0
7000010000-700001ffff : PCI Bus 0000:0e
   7000010000-700001ffff : 0000:0e:00.0
     7000011080-70000110d7 : mem0
```
After (CXL window is absent):
```
fed00000-fed003ff : PNP0103:00
fed1c000-fed1ffff : Reserved
feffc000-feffffff : Reserved
fffc0000-ffffffff : Reserved
100000000-27fffffff : System RAM
7000000000-700000ffff : PCI Bus 0000:0c
   7000000000-700000ffff : 0000:0c:00.0
7000010000-700001ffff : PCI Bus 0000:0e
   7000010000-700001ffff : 0000:0e:00.0
7000020000-703fffffff : PCI Bus 0000:00
```
- A.2 Decoder is committed

Before:
```
100000000-27fffffff : System RAM
5c0001128-5c00011b7 : port1
5d0000000-6cfffffff : CXL Window 0
   5d0000000-6cfffffff : region0
     5d0000000-6cfffffff : dax0.0
       5d0000000-6cfffffff : System RAM (kmem)
7000000000-700000ffff : PCI Bus 0000:0c
   7000000000-700000ffff : 0000:0c:00.0
```
After (CXL window is absent):
```
feffc000-feffffff : Reserved
fffc0000-ffffffff : Reserved
100000000-27fffffff : System RAM
7000000000-700000ffff : PCI Bus 0000:0c
   7000000000-700000ffff : 0000:0c:00.0
7000010000-700001ffff : PCI Bus 0000:0e
   7000010000-700001ffff : 0000:0e:00.0
7000020000-703fffffff : PCI Bus 0000:00
```

B) EFI_SPECIAL_PURPOSE is set
- B.1 Decoder not committed

Before:
```
5d0000000-7cfffffff : Soft Reserved
   5d0000000-6cfffffff : CXL Window 0
   6d0000000-7cfffffff : CXL Window 1
```
After (fallback to hmem):
```
5d0000000-7cfffffff : Soft Reserved
   5d0000000-7cfffffff : dax0.0
     5d0000000-7cfffffff : System RAM (kmem)
```

- B.2 Decoder is committed

Before:
```
5d0000000-6cfffffff : CXL Window 0
   5d0000000-6cfffffff : region0
     5d0000000-6cfffffff : Soft Reserved
       5d0000000-6cfffffff : dax0.0
         5d0000000-6cfffffff : System RAM (kmem)
```
After (fallback to hmem):
```
5d0000000-6cfffffff : Soft Reserved
   5d0000000-6cfffffff : dax0.0
     5d0000000-6cfffffff : System RAM (kmem)
```

Thanks
Zhijian



On 16/07/2025 02:04, Smita Koralahalli wrote:
> This series introduces the ability to manage SOFT RESERVED iomem
> resources, enabling the CXL driver to remove any portions that
> intersect with created CXL regions.
> 
> The current approach of leaving SOFT RESERVED entries as is can result
> in failures during device hotplug such as CXL because the address range
> remains reserved and unavailable for reuse even after region teardown.
> 
> To address this, the CXL driver now uses a background worker that waits
> for cxl_mem driver probe to complete before scanning for intersecting
> resources. Then the driver walks through created CXL regions to trim any
> intersections with SOFT RESERVED resources in the iomem tree.
> 
> The following scenarios have been tested:
> 
> Example 1: Exact alignment, soft reserved is a child of the region
> 
> |---------- "Soft Reserved" -----------|
> |-------------- "Region #" ------------|
> 
> Before:
>    1050000000-304fffffff : CXL Window 0
>      1050000000-304fffffff : region0
>        1050000000-304fffffff : Soft Reserved
>          1080000000-2fffffffff : dax0.0
>            1080000000-2fffffffff : System RAM (kmem)
> 
> After:
>    1050000000-304fffffff : CXL Window 0
>      1050000000-304fffffff : region0
>        1080000000-2fffffffff : dax0.0
>          1080000000-2fffffffff : System RAM (kmem)
> 
> Example 2: Start and/or end aligned and soft reserved spans multiple
> regions
> |----------- "Soft Reserved" -----------|
> |-------- "Region #" -------|
> or
> |----------- "Soft Reserved" -----------|
>              |-------- "Region #" -------|
> 
> Before:
>    850000000-684fffffff : Soft Reserved
>      850000000-284fffffff : CXL Window 0
>        850000000-284fffffff : region3
>          850000000-284fffffff : dax0.0
>            850000000-284fffffff : System RAM (kmem)
>      2850000000-484fffffff : CXL Window 1
>        2850000000-484fffffff : region4
>          2850000000-484fffffff : dax1.0
>            2850000000-484fffffff : System RAM (kmem)
>      4850000000-684fffffff : CXL Window 2
>        4850000000-684fffffff : region5
>          4850000000-684fffffff : dax2.0
>            4850000000-684fffffff : System RAM (kmem)
> 
> After:
>    850000000-284fffffff : CXL Window 0
>      850000000-284fffffff : region3
>        850000000-284fffffff : dax0.0
>          850000000-284fffffff : System RAM (kmem)
>    2850000000-484fffffff : CXL Window 1
>      2850000000-484fffffff : region4
>        2850000000-484fffffff : dax1.0
>          2850000000-484fffffff : System RAM (kmem)
>    4850000000-684fffffff : CXL Window 2
>      4850000000-684fffffff : region5
>        4850000000-684fffffff : dax2.0
>          4850000000-684fffffff : System RAM (kmem)
> 
> Example 3: No alignment
> |---------- "Soft Reserved" ----------|
> 	|---- "Region #" ----|
> 
> Before:
>    00000000-3050000ffd : Soft Reserved
>      ..
>      ..
>      1050000000-304fffffff : CXL Window 0
>        1050000000-304fffffff : region1
>          1080000000-2fffffffff : dax0.0
>            1080000000-2fffffffff : System RAM (kmem)
> 
> After:
>    00000000-104fffffff : Soft Reserved
>      ..
>      ..
>    1050000000-304fffffff : CXL Window 0
>      1050000000-304fffffff : region1
>        1080000000-2fffffffff : dax0.0
>          1080000000-2fffffffff : System RAM (kmem)
>    3050000000-3050000ffd : Soft Reserved
> 
> Link to v4:
> https://lore.kernel.org/linux-cxl/20250603221949.53272-1-Smita.KoralahalliChannabasappa@amd.com
> 
> v5 updates:
>   - Handled cases where CXL driver loads early even before HMEM driver is
>     initialized.
>   - Introduced callback functions to resolve dependencies.
>   - Rename suspend.c to probe_state.c.
>   - Refactor cxl_acpi_probe() to use a single exit path.
>   - Commit description update to justify cxl_mem_active() usage.
>   - Change from kmalloc -> kzalloc in add_soft_reserved().
>   - Change from goto to if else blocks inside remove_soft_reserved().
>   - DEFINE_RES_MEM_NAMED -> DEFINE_RES_NAMED_DESC.
>   - Comments for flags inside remove_soft_reserved().
>   - Add resource_lock inside normalize_resource().
>   - bus_find_next_device -> bus_find_device.
>   - Skip DAX consumption of soft reserves inside hmat with
>     CONFIG_CXL_ACPI checks.
> 
> v4 updates:
>   - Split first patch into 4 smaller patches.
>   - Correct the logic for cxl_pci_loaded() and cxl_mem_active() to return
>     false at default instead of true.
>   - Cleanup cxl_wait_for_pci_mem() to remove config checks for cxl_pci
>     and cxl_mem.
>   - Fixed multiple bugs and build issues which includes correcting
>     walk_iomem_resc_desc() and calculations of alignments.
>   
> v3 updates:
>   - Remove srmem resource tree from kernel/resource.c, this is no longer
>     needed in the current implementation. All SOFT RESERVE resources now
>     put on the iomem resource tree.
>   - Remove the no longer needed SOFT_RESERVED_MANAGED kernel config option.
>   - Add the 'nid' parameter back to hmem_register_resource();
>   - Remove the no longer used soft reserve notification chain (introduced
>     in v2). The dax driver is now notified of SOFT RESERVED resources by
>     the CXL driver.
> 
> v2 updates:
>   - Add config option SOFT_RESERVE_MANAGED to control use of the
>     separate srmem resource tree at boot.
>   - Only add SOFT RESERVE resources to the soft reserve tree during
>     boot, they go to the iomem resource tree after boot.
>   - Remove the resource trimming code in the previous patch to re-use
>     the existing code in kernel/resource.c
>   - Add functionality for the cxl acpi driver to wait for the cxl PCI
>     and mem drivers to load.
> 
> Smita Koralahalli (7):
>    cxl/acpi: Refactor cxl_acpi_probe() to always schedule fallback DAX
>      registration
>    cxl/core: Rename suspend.c to probe_state.c and remove
>      CONFIG_CXL_SUSPEND
>    cxl/acpi: Add background worker to coordinate with cxl_mem probe
>      completion
>    cxl/region: Introduce SOFT RESERVED resource removal on region
>      teardown
>    dax/hmem: Save the DAX HMEM platform device pointer
>    dax/hmem, cxl: Defer DAX consumption of SOFT RESERVED resources until
>      after CXL region creation
>    dax/hmem: Preserve fallback SOFT RESERVED regions if DAX HMEM loads
>      late
> 
>   drivers/acpi/numa/hmat.c                      |   4 +
>   drivers/cxl/Kconfig                           |   4 -
>   drivers/cxl/acpi.c                            |  50 +++++--
>   drivers/cxl/core/Makefile                     |   2 +-
>   drivers/cxl/core/{suspend.c => probe_state.c} |  10 +-
>   drivers/cxl/core/region.c                     | 135 ++++++++++++++++++
>   drivers/cxl/cxl.h                             |   4 +
>   drivers/cxl/cxlmem.h                          |   9 --
>   drivers/dax/hmem/Makefile                     |   1 +
>   drivers/dax/hmem/device.c                     |  62 ++++----
>   drivers/dax/hmem/hmem.c                       |  14 +-
>   drivers/dax/hmem/hmem_notify.c                |  29 ++++
>   include/linux/dax.h                           |   7 +-
>   include/linux/ioport.h                        |   1 +
>   include/linux/pm.h                            |   7 -
>   kernel/resource.c                             |  34 +++++
>   16 files changed, 307 insertions(+), 66 deletions(-)
>   rename drivers/cxl/core/{suspend.c => probe_state.c} (62%)
>   create mode 100644 drivers/dax/hmem/hmem_notify.c
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Alison Schofield 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
> This series introduces the ability to manage SOFT RESERVED iomem
> resources, enabling the CXL driver to remove any portions that
> intersect with created CXL regions.

Hi Smita,

This set applied cleanly to todays cxl-next but fails like appended
before region probe.

BTW - there were sparse warnings in the build that look related:
  CHECK   drivers/dax/hmem/hmem_notify.c
drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit


This isn't all the logs, I trimmed. Let me know if you need more or
other info to reproduce.

[   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
[   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
[   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
[   53.653540] preempt_count: 1, expected: 0
[   53.653554] RCU nest depth: 0, expected: 0
[   53.653568] 3 locks held by kworker/46:1/1875:
[   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
[   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
[   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
[   53.653598] Preemption disabled at:
[   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
[   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) 
[   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
[   53.653648] Call Trace:
[   53.653649]  <TASK>
[   53.653652]  dump_stack_lvl+0xa8/0xd0
[   53.653658]  dump_stack+0x14/0x20
[   53.653659]  __might_resched+0x1ae/0x2d0
[   53.653666]  __might_sleep+0x48/0x70
[   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
[   53.653674]  ? __devm_add_action+0x3d/0x160
[   53.653685]  ? __pfx_devm_action_release+0x10/0x10
[   53.653688]  __devres_alloc_node+0x4a/0x90
[   53.653689]  ? __devres_alloc_node+0x4a/0x90
[   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
[   53.653693]  __devm_add_action+0x3d/0x160
[   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
[   53.653700]  hmem_fallback_register_device+0x37/0x60
[   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[   53.653739]  walk_iomem_res_desc+0x55/0xb0
[   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
[   53.653768]  process_one_work+0x1fa/0x630
[   53.653774]  worker_thread+0x1b2/0x360
[   53.653777]  kthread+0x128/0x250
[   53.653781]  ? __pfx_worker_thread+0x10/0x10
[   53.653784]  ? __pfx_kthread+0x10/0x10
[   53.653786]  ret_from_fork+0x139/0x1e0
[   53.653790]  ? __pfx_kthread+0x10/0x10
[   53.653792]  ret_from_fork_asm+0x1a/0x30
[   53.653801]  </TASK>

[   53.654193] =============================
[   53.654203] [ BUG: Invalid wait context ]
[   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W          
[   53.654623] -----------------------------
[   53.654785] kworker/46:1/1875 is trying to lock:
[   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
[   53.655115] other info that might help us debug this:
[   53.655273] context-{5:5}
[   53.655428] 3 locks held by kworker/46:1/1875:
[   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
[   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
[   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
[   53.656062] stack backtrace:
[   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) 
[   53.656227] Tainted: [W]=WARN
[   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
[   53.656232] Call Trace:
[   53.656232]  <TASK>
[   53.656234]  dump_stack_lvl+0x85/0xd0
[   53.656238]  dump_stack+0x14/0x20
[   53.656239]  __lock_acquire+0xaf4/0x2200
[   53.656246]  lock_acquire+0xd8/0x300
[   53.656248]  ? kernfs_add_one+0x34/0x390
[   53.656252]  ? __might_resched+0x208/0x2d0
[   53.656257]  down_write+0x44/0xe0
[   53.656262]  ? kernfs_add_one+0x34/0x390
[   53.656263]  kernfs_add_one+0x34/0x390
[   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
[   53.656268]  sysfs_create_dir_ns+0x74/0xd0
[   53.656270]  kobject_add_internal+0xb1/0x2f0
[   53.656273]  kobject_add+0x7d/0xf0
[   53.656275]  ? get_device_parent+0x28/0x1e0
[   53.656280]  ? __pfx_klist_children_get+0x10/0x10
[   53.656282]  device_add+0x124/0x8b0
[   53.656285]  ? dev_set_name+0x56/0x70
[   53.656287]  platform_device_add+0x102/0x260
[   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
[   53.656291]  hmem_fallback_register_device+0x37/0x60
[   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[   53.656323]  walk_iomem_res_desc+0x55/0xb0
[   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
[   53.656346]  process_one_work+0x1fa/0x630
[   53.656350]  worker_thread+0x1b2/0x360
[   53.656352]  kthread+0x128/0x250
[   53.656354]  ? __pfx_worker_thread+0x10/0x10
[   53.656356]  ? __pfx_kthread+0x10/0x10
[   53.656357]  ret_from_fork+0x139/0x1e0
[   53.656360]  ? __pfx_kthread+0x10/0x10
[   53.656361]  ret_from_fork_asm+0x1a/0x30
[   53.656366]  </TASK>
[   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
[   53.663552]  schedule+0x4a/0x160
[   53.663553]  schedule_timeout+0x10a/0x120
[   53.663555]  ? debug_smp_processor_id+0x1b/0x30
[   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
[   53.663558]  __wait_for_common+0xb9/0x1c0
[   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
[   53.663561]  wait_for_completion+0x28/0x30
[   53.663562]  __synchronize_srcu+0xbf/0x180
[   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
[   53.663571]  ? i2c_repstart+0x30/0x80
[   53.663576]  synchronize_srcu+0x46/0x120
[   53.663577]  kill_dax+0x47/0x70
[   53.663580]  __devm_create_dev_dax+0x112/0x470
[   53.663582]  devm_create_dev_dax+0x26/0x50
[   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
[   53.663585]  platform_probe+0x61/0xd0
[   53.663589]  really_probe+0xe2/0x390
[   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
[   53.663593]  __driver_probe_device+0x7e/0x160
[   53.663594]  driver_probe_device+0x23/0xa0
[   53.663596]  __device_attach_driver+0x92/0x120
[   53.663597]  bus_for_each_drv+0x8c/0xf0
[   53.663599]  __device_attach+0xc2/0x1f0
[   53.663601]  device_initial_probe+0x17/0x20
[   53.663603]  bus_probe_device+0xa8/0xb0
[   53.663604]  device_add+0x687/0x8b0
[   53.663607]  ? dev_set_name+0x56/0x70
[   53.663609]  platform_device_add+0x102/0x260
[   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
[   53.663612]  hmem_fallback_register_device+0x37/0x60
[   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[   53.663637]  walk_iomem_res_desc+0x55/0xb0
[   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
[   53.663658]  process_one_work+0x1fa/0x630
[   53.663662]  worker_thread+0x1b2/0x360
[   53.663664]  kthread+0x128/0x250
[   53.663666]  ? __pfx_worker_thread+0x10/0x10
[   53.663668]  ? __pfx_kthread+0x10/0x10
[   53.663670]  ret_from_fork+0x139/0x1e0
[   53.663672]  ? __pfx_kthread+0x10/0x10
[   53.663673]  ret_from_fork_asm+0x1a/0x30
[   53.663677]  </TASK>
[   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
[   53.700264] INFO: lockdep is turned off.
[   53.701315] Preemption disabled at:
[   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
[   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary) 
[   53.701633] Tainted: [W]=WARN
[   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
[   53.701638] Call Trace:
[   53.701638]  <TASK>
[   53.701640]  dump_stack_lvl+0xa8/0xd0
[   53.701644]  dump_stack+0x14/0x20
[   53.701645]  __schedule_bug+0xa2/0xd0
[   53.701649]  __schedule+0xe6f/0x10d0
[   53.701652]  ? debug_smp_processor_id+0x1b/0x30
[   53.701655]  ? lock_release+0x1e6/0x2b0
[   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
[   53.701661]  schedule+0x4a/0x160
[   53.701662]  schedule_timeout+0x10a/0x120
[   53.701664]  ? debug_smp_processor_id+0x1b/0x30
[   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
[   53.701667]  __wait_for_common+0xb9/0x1c0
[   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
[   53.701670]  wait_for_completion+0x28/0x30
[   53.701671]  __synchronize_srcu+0xbf/0x180
[   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
[   53.701682]  ? i2c_repstart+0x30/0x80
[   53.701685]  synchronize_srcu+0x46/0x120
[   53.701687]  kill_dax+0x47/0x70
[   53.701689]  __devm_create_dev_dax+0x112/0x470
[   53.701691]  devm_create_dev_dax+0x26/0x50
[   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
[   53.701695]  platform_probe+0x61/0xd0
[   53.701698]  really_probe+0xe2/0x390
[   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
[   53.701701]  __driver_probe_device+0x7e/0x160
[   53.701703]  driver_probe_device+0x23/0xa0
[   53.701704]  __device_attach_driver+0x92/0x120
[   53.701706]  bus_for_each_drv+0x8c/0xf0
[   53.701708]  __device_attach+0xc2/0x1f0
[   53.701710]  device_initial_probe+0x17/0x20
[   53.701711]  bus_probe_device+0xa8/0xb0
[   53.701712]  device_add+0x687/0x8b0
[   53.701715]  ? dev_set_name+0x56/0x70
[   53.701717]  platform_device_add+0x102/0x260
[   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
[   53.701720]  hmem_fallback_register_device+0x37/0x60
[   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
[   53.701734]  walk_iomem_res_desc+0x55/0xb0
[   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
[   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
[   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
[   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
[   53.701756]  process_one_work+0x1fa/0x630
[   53.701760]  worker_thread+0x1b2/0x360
[   53.701762]  kthread+0x128/0x250
[   53.701765]  ? __pfx_worker_thread+0x10/0x10
[   53.701766]  ? __pfx_kthread+0x10/0x10
[   53.701768]  ret_from_fork+0x139/0x1e0
[   53.701771]  ? __pfx_kthread+0x10/0x10
[   53.701772]  ret_from_fork_asm+0x1a/0x30
[   53.701777]  </TASK>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Koralahalli Channabasappa, Smita 2 months, 3 weeks ago

Hi Alison,

On 7/15/2025 2:07 PM, Alison Schofield wrote:
> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>> This series introduces the ability to manage SOFT RESERVED iomem
>> resources, enabling the CXL driver to remove any portions that
>> intersect with created CXL regions.
> 
> Hi Smita,
> 
> This set applied cleanly to todays cxl-next but fails like appended
> before region probe.
> 
> BTW - there were sparse warnings in the build that look related:
>    CHECK   drivers/dax/hmem/hmem_notify.c
> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit

Thanks for pointing this bug. I failed to release the spinlock before 
calling hmem_register_device(), which internally calls 
platform_device_add() and can sleep. The following fix addresses that 
bug. I’ll incorporate this into v6:

diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
index 6c276c5bd51d..8f411f3fe7bd 100644
--- a/drivers/dax/hmem/hmem_notify.c
+++ b/drivers/dax/hmem/hmem_notify.c
@@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, 
const struct resource *res)
  {
         walk_hmem_fn hmem_fn;

-       guard(spinlock)(&hmem_notify_lock);
+       spin_lock(&hmem_notify_lock);
         hmem_fn = hmem_fallback_fn;
+       spin_unlock(&hmem_notify_lock);

         if (hmem_fn)
                 hmem_fn(target_nid, res);
--

As for the log:
[   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting 
for cxl_mem probing

I’m still analyzing that. Here's what was my thought process so far.

- This occurs when cxl_acpi_probe() runs significantly earlier than 
cxl_mem_probe(), so CXL region creation (which happens in 
cxl_port_endpoint_probe()) may or may not have completed by the time 
trimming is attempted.

- Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does 
guarantee load order when all components are built as modules. So even 
if the timeout occurs and cxl_mem_probe() hasn’t run within the wait 
window, MODULE_SOFTDEP ensures that cxl_port is loaded before both 
cxl_acpi and cxl_mem in modular configurations. As a result, region 
creation is eventually guaranteed, and wait_for_device_probe() will 
succeed once the relevant probes complete.

- However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no 
guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish 
before cxl_port_probe() even begins, which can cause 
wait_for_device_probe() to return prematurely and trigger the timeout.

- In my local setup, I observed that a 30-second timeout was generally 
sufficient to catch this race, allowing cxl_port_probe() to load while 
cxl_acpi_probe() is still active. Since we cannot mix built-in and 
modular components (i.e., have cxl_acpi=y and cxl_port=m), the timeout 
serves as a best-effort mechanism. After the timeout, 
wait_for_device_probe() ensures cxl_port_probe() has completed before 
trimming proceeds, making the logic good enough to most boot-time races.

One possible improvement I’m considering is to schedule a 
delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait 
slightly longer for cxl_mem_probe() to complete (which itself softdeps 
on cxl_port) before initiating the soft reserve trimming.

That said, I'm still evaluating better options to more robustly 
coordinate probe ordering between cxl_acpi, cxl_port, cxl_mem and 
cxl_region and looking for suggestions here.

Thanks
Smita

> 
> 
> This isn't all the logs, I trimmed. Let me know if you need more or
> other info to reproduce.
> 
> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
> [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
> [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
> [   53.653540] preempt_count: 1, expected: 0
> [   53.653554] RCU nest depth: 0, expected: 0
> [   53.653568] 3 locks held by kworker/46:1/1875:
> [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> [   53.653598] Preemption disabled at:
> [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> [   53.653648] Call Trace:
> [   53.653649]  <TASK>
> [   53.653652]  dump_stack_lvl+0xa8/0xd0
> [   53.653658]  dump_stack+0x14/0x20
> [   53.653659]  __might_resched+0x1ae/0x2d0
> [   53.653666]  __might_sleep+0x48/0x70
> [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
> [   53.653674]  ? __devm_add_action+0x3d/0x160
> [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
> [   53.653688]  __devres_alloc_node+0x4a/0x90
> [   53.653689]  ? __devres_alloc_node+0x4a/0x90
> [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
> [   53.653693]  __devm_add_action+0x3d/0x160
> [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
> [   53.653700]  hmem_fallback_register_device+0x37/0x60
> [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [   53.653739]  walk_iomem_res_desc+0x55/0xb0
> [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
> [   53.653768]  process_one_work+0x1fa/0x630
> [   53.653774]  worker_thread+0x1b2/0x360
> [   53.653777]  kthread+0x128/0x250
> [   53.653781]  ? __pfx_worker_thread+0x10/0x10
> [   53.653784]  ? __pfx_kthread+0x10/0x10
> [   53.653786]  ret_from_fork+0x139/0x1e0
> [   53.653790]  ? __pfx_kthread+0x10/0x10
> [   53.653792]  ret_from_fork_asm+0x1a/0x30
> [   53.653801]  </TASK>
> 
> [   53.654193] =============================
> [   53.654203] [ BUG: Invalid wait context ]
> [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
> [   53.654623] -----------------------------
> [   53.654785] kworker/46:1/1875 is trying to lock:
> [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
> [   53.655115] other info that might help us debug this:
> [   53.655273] context-{5:5}
> [   53.655428] 3 locks held by kworker/46:1/1875:
> [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> [   53.656062] stack backtrace:
> [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> [   53.656227] Tainted: [W]=WARN
> [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> [   53.656232] Call Trace:
> [   53.656232]  <TASK>
> [   53.656234]  dump_stack_lvl+0x85/0xd0
> [   53.656238]  dump_stack+0x14/0x20
> [   53.656239]  __lock_acquire+0xaf4/0x2200
> [   53.656246]  lock_acquire+0xd8/0x300
> [   53.656248]  ? kernfs_add_one+0x34/0x390
> [   53.656252]  ? __might_resched+0x208/0x2d0
> [   53.656257]  down_write+0x44/0xe0
> [   53.656262]  ? kernfs_add_one+0x34/0x390
> [   53.656263]  kernfs_add_one+0x34/0x390
> [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
> [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
> [   53.656270]  kobject_add_internal+0xb1/0x2f0
> [   53.656273]  kobject_add+0x7d/0xf0
> [   53.656275]  ? get_device_parent+0x28/0x1e0
> [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
> [   53.656282]  device_add+0x124/0x8b0
> [   53.656285]  ? dev_set_name+0x56/0x70
> [   53.656287]  platform_device_add+0x102/0x260
> [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
> [   53.656291]  hmem_fallback_register_device+0x37/0x60
> [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [   53.656323]  walk_iomem_res_desc+0x55/0xb0
> [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
> [   53.656346]  process_one_work+0x1fa/0x630
> [   53.656350]  worker_thread+0x1b2/0x360
> [   53.656352]  kthread+0x128/0x250
> [   53.656354]  ? __pfx_worker_thread+0x10/0x10
> [   53.656356]  ? __pfx_kthread+0x10/0x10
> [   53.656357]  ret_from_fork+0x139/0x1e0
> [   53.656360]  ? __pfx_kthread+0x10/0x10
> [   53.656361]  ret_from_fork_asm+0x1a/0x30
> [   53.656366]  </TASK>
> [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> [   53.663552]  schedule+0x4a/0x160
> [   53.663553]  schedule_timeout+0x10a/0x120
> [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
> [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
> [   53.663558]  __wait_for_common+0xb9/0x1c0
> [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
> [   53.663561]  wait_for_completion+0x28/0x30
> [   53.663562]  __synchronize_srcu+0xbf/0x180
> [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
> [   53.663571]  ? i2c_repstart+0x30/0x80
> [   53.663576]  synchronize_srcu+0x46/0x120
> [   53.663577]  kill_dax+0x47/0x70
> [   53.663580]  __devm_create_dev_dax+0x112/0x470
> [   53.663582]  devm_create_dev_dax+0x26/0x50
> [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
> [   53.663585]  platform_probe+0x61/0xd0
> [   53.663589]  really_probe+0xe2/0x390
> [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
> [   53.663593]  __driver_probe_device+0x7e/0x160
> [   53.663594]  driver_probe_device+0x23/0xa0
> [   53.663596]  __device_attach_driver+0x92/0x120
> [   53.663597]  bus_for_each_drv+0x8c/0xf0
> [   53.663599]  __device_attach+0xc2/0x1f0
> [   53.663601]  device_initial_probe+0x17/0x20
> [   53.663603]  bus_probe_device+0xa8/0xb0
> [   53.663604]  device_add+0x687/0x8b0
> [   53.663607]  ? dev_set_name+0x56/0x70
> [   53.663609]  platform_device_add+0x102/0x260
> [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
> [   53.663612]  hmem_fallback_register_device+0x37/0x60
> [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [   53.663637]  walk_iomem_res_desc+0x55/0xb0
> [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
> [   53.663658]  process_one_work+0x1fa/0x630
> [   53.663662]  worker_thread+0x1b2/0x360
> [   53.663664]  kthread+0x128/0x250
> [   53.663666]  ? __pfx_worker_thread+0x10/0x10
> [   53.663668]  ? __pfx_kthread+0x10/0x10
> [   53.663670]  ret_from_fork+0x139/0x1e0
> [   53.663672]  ? __pfx_kthread+0x10/0x10
> [   53.663673]  ret_from_fork_asm+0x1a/0x30
> [   53.663677]  </TASK>
> [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> [   53.700264] INFO: lockdep is turned off.
> [   53.701315] Preemption disabled at:
> [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> [   53.701633] Tainted: [W]=WARN
> [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> [   53.701638] Call Trace:
> [   53.701638]  <TASK>
> [   53.701640]  dump_stack_lvl+0xa8/0xd0
> [   53.701644]  dump_stack+0x14/0x20
> [   53.701645]  __schedule_bug+0xa2/0xd0
> [   53.701649]  __schedule+0xe6f/0x10d0
> [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
> [   53.701655]  ? lock_release+0x1e6/0x2b0
> [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
> [   53.701661]  schedule+0x4a/0x160
> [   53.701662]  schedule_timeout+0x10a/0x120
> [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
> [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
> [   53.701667]  __wait_for_common+0xb9/0x1c0
> [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
> [   53.701670]  wait_for_completion+0x28/0x30
> [   53.701671]  __synchronize_srcu+0xbf/0x180
> [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
> [   53.701682]  ? i2c_repstart+0x30/0x80
> [   53.701685]  synchronize_srcu+0x46/0x120
> [   53.701687]  kill_dax+0x47/0x70
> [   53.701689]  __devm_create_dev_dax+0x112/0x470
> [   53.701691]  devm_create_dev_dax+0x26/0x50
> [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
> [   53.701695]  platform_probe+0x61/0xd0
> [   53.701698]  really_probe+0xe2/0x390
> [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
> [   53.701701]  __driver_probe_device+0x7e/0x160
> [   53.701703]  driver_probe_device+0x23/0xa0
> [   53.701704]  __device_attach_driver+0x92/0x120
> [   53.701706]  bus_for_each_drv+0x8c/0xf0
> [   53.701708]  __device_attach+0xc2/0x1f0
> [   53.701710]  device_initial_probe+0x17/0x20
> [   53.701711]  bus_probe_device+0xa8/0xb0
> [   53.701712]  device_add+0x687/0x8b0
> [   53.701715]  ? dev_set_name+0x56/0x70
> [   53.701717]  platform_device_add+0x102/0x260
> [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
> [   53.701720]  hmem_fallback_register_device+0x37/0x60
> [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> [   53.701734]  walk_iomem_res_desc+0x55/0xb0
> [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
> [   53.701756]  process_one_work+0x1fa/0x630
> [   53.701760]  worker_thread+0x1b2/0x360
> [   53.701762]  kthread+0x128/0x250
> [   53.701765]  ? __pfx_worker_thread+0x10/0x10
> [   53.701766]  ? __pfx_kthread+0x10/0x10
> [   53.701768]  ret_from_fork+0x139/0x1e0
> [   53.701771]  ? __pfx_kthread+0x10/0x10
> [   53.701772]  ret_from_fork_asm+0x1a/0x30
> [   53.701777]  </TASK>
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by dan.j.williams@intel.com 2 months, 2 weeks ago

Koralahalli Channabasappa, Smita wrote:
[..]
> That said, I'm still evaluating better options to more robustly 
> coordinate probe ordering between cxl_acpi, cxl_port, cxl_mem and 
> cxl_region and looking for suggestions here.

I never quite understood the arguments around why
wait_for_device_probe() does not work, but I did find a bug in my prior
thinking on the way towards this RFC [1]. The misunderstanding was that
MODULE_SOFTDEP() only guarantees that the module gets loaded eventually,
but it does not guarantee that the softdep has completed init before the
caller performs its own init.

It works sometimes, and that is probably what misled me about that
contract. request_module() is synchronous. With that in place I now see
what wait_for_device_probe() does the right thing. It flushes cxl_pci
attach for devices present at boot, and all follow-on probe work gets
flushed as well.

With that in hand the RFC now has a stable quiesce point to walk the CXL
topology and make decisions. The RFC is effectively a fix for platforms
where CXL loses the MODULE_SOFTDEP() race.

[1]: http://lore.kernel.org/68808fb4e4cbf_137e6b100cc@dwillia2-xfh.jf.intel.com.notmuch

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Alison Schofield 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
> Hi Alison,
> 
> On 7/15/2025 2:07 PM, Alison Schofield wrote:
> > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
> > > This series introduces the ability to manage SOFT RESERVED iomem
> > > resources, enabling the CXL driver to remove any portions that
> > > intersect with created CXL regions.
> > 
> > Hi Smita,
> > 
> > This set applied cleanly to todays cxl-next but fails like appended
> > before region probe.
> > 
> > BTW - there were sparse warnings in the build that look related:
> >    CHECK   drivers/dax/hmem/hmem_notify.c
> > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
> > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
> 
> Thanks for pointing this bug. I failed to release the spinlock before
> calling hmem_register_device(), which internally calls platform_device_add()
> and can sleep. The following fix addresses that bug. I’ll incorporate this
> into v6:
> 
> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
> index 6c276c5bd51d..8f411f3fe7bd 100644
> --- a/drivers/dax/hmem/hmem_notify.c
> +++ b/drivers/dax/hmem/hmem_notify.c
> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
> struct resource *res)
>  {
>         walk_hmem_fn hmem_fn;
> 
> -       guard(spinlock)(&hmem_notify_lock);
> +       spin_lock(&hmem_notify_lock);
>         hmem_fn = hmem_fallback_fn;
> +       spin_unlock(&hmem_notify_lock);
> 
>         if (hmem_fn)
>                 hmem_fn(target_nid, res);
> --

Hi Smita,  Adding the above got me past that, and doubling the timeout
below stopped that from happening. After that, I haven't had time to
trace so, I'll just dump on you for now:

In /proc/iomem
Here, we see a regions resource, no CXL Window, and no dax, and no
actual region, not even disabled, is available.
c080000000-c47fffffff : region0

And, here no CXL Window, no region, and a soft reserved.
68e80000000-70e7fffffff : Soft Reserved
  68e80000000-70e7fffffff : dax1.0
    68e80000000-70e7fffffff : System RAM (kmem)

I haven't yet walked through the v4 to v5 changes so I'll do that next. 

> 
> As for the log:
> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
> cxl_mem probing
> 
> I’m still analyzing that. Here's what was my thought process so far.
> 
> - This occurs when cxl_acpi_probe() runs significantly earlier than
> cxl_mem_probe(), so CXL region creation (which happens in
> cxl_port_endpoint_probe()) may or may not have completed by the time
> trimming is attempted.
> 
> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
> guarantee load order when all components are built as modules. So even if
> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
> cxl_mem in modular configurations. As a result, region creation is
> eventually guaranteed, and wait_for_device_probe() will succeed once the
> relevant probes complete.
> 
> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
> to return prematurely and trigger the timeout.
> 
> - In my local setup, I observed that a 30-second timeout was generally
> sufficient to catch this race, allowing cxl_port_probe() to load while
> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
> cxl_port_probe() has completed before trimming proceeds, making the logic
> good enough to most boot-time races.
> 
> One possible improvement I’m considering is to schedule a
> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
> cxl_port) before initiating the soft reserve trimming.
> 
> That said, I'm still evaluating better options to more robustly coordinate
> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
> looking for suggestions here.
> 
> Thanks
> Smita
> 
> > 
> > 
> > This isn't all the logs, I trimmed. Let me know if you need more or
> > other info to reproduce.
> > 
> > [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
> > [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
> > [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
> > [   53.653540] preempt_count: 1, expected: 0
> > [   53.653554] RCU nest depth: 0, expected: 0
> > [   53.653568] 3 locks held by kworker/46:1/1875:
> > [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > [   53.653598] Preemption disabled at:
> > [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > [   53.653648] Call Trace:
> > [   53.653649]  <TASK>
> > [   53.653652]  dump_stack_lvl+0xa8/0xd0
> > [   53.653658]  dump_stack+0x14/0x20
> > [   53.653659]  __might_resched+0x1ae/0x2d0
> > [   53.653666]  __might_sleep+0x48/0x70
> > [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
> > [   53.653674]  ? __devm_add_action+0x3d/0x160
> > [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
> > [   53.653688]  __devres_alloc_node+0x4a/0x90
> > [   53.653689]  ? __devres_alloc_node+0x4a/0x90
> > [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
> > [   53.653693]  __devm_add_action+0x3d/0x160
> > [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
> > [   53.653700]  hmem_fallback_register_device+0x37/0x60
> > [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [   53.653739]  walk_iomem_res_desc+0x55/0xb0
> > [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
> > [   53.653768]  process_one_work+0x1fa/0x630
> > [   53.653774]  worker_thread+0x1b2/0x360
> > [   53.653777]  kthread+0x128/0x250
> > [   53.653781]  ? __pfx_worker_thread+0x10/0x10
> > [   53.653784]  ? __pfx_kthread+0x10/0x10
> > [   53.653786]  ret_from_fork+0x139/0x1e0
> > [   53.653790]  ? __pfx_kthread+0x10/0x10
> > [   53.653792]  ret_from_fork_asm+0x1a/0x30
> > [   53.653801]  </TASK>
> > 
> > [   53.654193] =============================
> > [   53.654203] [ BUG: Invalid wait context ]
> > [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
> > [   53.654623] -----------------------------
> > [   53.654785] kworker/46:1/1875 is trying to lock:
> > [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
> > [   53.655115] other info that might help us debug this:
> > [   53.655273] context-{5:5}
> > [   53.655428] 3 locks held by kworker/46:1/1875:
> > [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > [   53.656062] stack backtrace:
> > [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > [   53.656227] Tainted: [W]=WARN
> > [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > [   53.656232] Call Trace:
> > [   53.656232]  <TASK>
> > [   53.656234]  dump_stack_lvl+0x85/0xd0
> > [   53.656238]  dump_stack+0x14/0x20
> > [   53.656239]  __lock_acquire+0xaf4/0x2200
> > [   53.656246]  lock_acquire+0xd8/0x300
> > [   53.656248]  ? kernfs_add_one+0x34/0x390
> > [   53.656252]  ? __might_resched+0x208/0x2d0
> > [   53.656257]  down_write+0x44/0xe0
> > [   53.656262]  ? kernfs_add_one+0x34/0x390
> > [   53.656263]  kernfs_add_one+0x34/0x390
> > [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
> > [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
> > [   53.656270]  kobject_add_internal+0xb1/0x2f0
> > [   53.656273]  kobject_add+0x7d/0xf0
> > [   53.656275]  ? get_device_parent+0x28/0x1e0
> > [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
> > [   53.656282]  device_add+0x124/0x8b0
> > [   53.656285]  ? dev_set_name+0x56/0x70
> > [   53.656287]  platform_device_add+0x102/0x260
> > [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
> > [   53.656291]  hmem_fallback_register_device+0x37/0x60
> > [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [   53.656323]  walk_iomem_res_desc+0x55/0xb0
> > [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
> > [   53.656346]  process_one_work+0x1fa/0x630
> > [   53.656350]  worker_thread+0x1b2/0x360
> > [   53.656352]  kthread+0x128/0x250
> > [   53.656354]  ? __pfx_worker_thread+0x10/0x10
> > [   53.656356]  ? __pfx_kthread+0x10/0x10
> > [   53.656357]  ret_from_fork+0x139/0x1e0
> > [   53.656360]  ? __pfx_kthread+0x10/0x10
> > [   53.656361]  ret_from_fork_asm+0x1a/0x30
> > [   53.656366]  </TASK>
> > [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > [   53.663552]  schedule+0x4a/0x160
> > [   53.663553]  schedule_timeout+0x10a/0x120
> > [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
> > [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
> > [   53.663558]  __wait_for_common+0xb9/0x1c0
> > [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
> > [   53.663561]  wait_for_completion+0x28/0x30
> > [   53.663562]  __synchronize_srcu+0xbf/0x180
> > [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
> > [   53.663571]  ? i2c_repstart+0x30/0x80
> > [   53.663576]  synchronize_srcu+0x46/0x120
> > [   53.663577]  kill_dax+0x47/0x70
> > [   53.663580]  __devm_create_dev_dax+0x112/0x470
> > [   53.663582]  devm_create_dev_dax+0x26/0x50
> > [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > [   53.663585]  platform_probe+0x61/0xd0
> > [   53.663589]  really_probe+0xe2/0x390
> > [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
> > [   53.663593]  __driver_probe_device+0x7e/0x160
> > [   53.663594]  driver_probe_device+0x23/0xa0
> > [   53.663596]  __device_attach_driver+0x92/0x120
> > [   53.663597]  bus_for_each_drv+0x8c/0xf0
> > [   53.663599]  __device_attach+0xc2/0x1f0
> > [   53.663601]  device_initial_probe+0x17/0x20
> > [   53.663603]  bus_probe_device+0xa8/0xb0
> > [   53.663604]  device_add+0x687/0x8b0
> > [   53.663607]  ? dev_set_name+0x56/0x70
> > [   53.663609]  platform_device_add+0x102/0x260
> > [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
> > [   53.663612]  hmem_fallback_register_device+0x37/0x60
> > [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [   53.663637]  walk_iomem_res_desc+0x55/0xb0
> > [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
> > [   53.663658]  process_one_work+0x1fa/0x630
> > [   53.663662]  worker_thread+0x1b2/0x360
> > [   53.663664]  kthread+0x128/0x250
> > [   53.663666]  ? __pfx_worker_thread+0x10/0x10
> > [   53.663668]  ? __pfx_kthread+0x10/0x10
> > [   53.663670]  ret_from_fork+0x139/0x1e0
> > [   53.663672]  ? __pfx_kthread+0x10/0x10
> > [   53.663673]  ret_from_fork_asm+0x1a/0x30
> > [   53.663677]  </TASK>
> > [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > [   53.700264] INFO: lockdep is turned off.
> > [   53.701315] Preemption disabled at:
> > [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > [   53.701633] Tainted: [W]=WARN
> > [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > [   53.701638] Call Trace:
> > [   53.701638]  <TASK>
> > [   53.701640]  dump_stack_lvl+0xa8/0xd0
> > [   53.701644]  dump_stack+0x14/0x20
> > [   53.701645]  __schedule_bug+0xa2/0xd0
> > [   53.701649]  __schedule+0xe6f/0x10d0
> > [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
> > [   53.701655]  ? lock_release+0x1e6/0x2b0
> > [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
> > [   53.701661]  schedule+0x4a/0x160
> > [   53.701662]  schedule_timeout+0x10a/0x120
> > [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
> > [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
> > [   53.701667]  __wait_for_common+0xb9/0x1c0
> > [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
> > [   53.701670]  wait_for_completion+0x28/0x30
> > [   53.701671]  __synchronize_srcu+0xbf/0x180
> > [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
> > [   53.701682]  ? i2c_repstart+0x30/0x80
> > [   53.701685]  synchronize_srcu+0x46/0x120
> > [   53.701687]  kill_dax+0x47/0x70
> > [   53.701689]  __devm_create_dev_dax+0x112/0x470
> > [   53.701691]  devm_create_dev_dax+0x26/0x50
> > [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > [   53.701695]  platform_probe+0x61/0xd0
> > [   53.701698]  really_probe+0xe2/0x390
> > [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
> > [   53.701701]  __driver_probe_device+0x7e/0x160
> > [   53.701703]  driver_probe_device+0x23/0xa0
> > [   53.701704]  __device_attach_driver+0x92/0x120
> > [   53.701706]  bus_for_each_drv+0x8c/0xf0
> > [   53.701708]  __device_attach+0xc2/0x1f0
> > [   53.701710]  device_initial_probe+0x17/0x20
> > [   53.701711]  bus_probe_device+0xa8/0xb0
> > [   53.701712]  device_add+0x687/0x8b0
> > [   53.701715]  ? dev_set_name+0x56/0x70
> > [   53.701717]  platform_device_add+0x102/0x260
> > [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
> > [   53.701720]  hmem_fallback_register_device+0x37/0x60
> > [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > [   53.701734]  walk_iomem_res_desc+0x55/0xb0
> > [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
> > [   53.701756]  process_one_work+0x1fa/0x630
> > [   53.701760]  worker_thread+0x1b2/0x360
> > [   53.701762]  kthread+0x128/0x250
> > [   53.701765]  ? __pfx_worker_thread+0x10/0x10
> > [   53.701766]  ? __pfx_kthread+0x10/0x10
> > [   53.701768]  ret_from_fork+0x139/0x1e0
> > [   53.701771]  ? __pfx_kthread+0x10/0x10
> > [   53.701772]  ret_from_fork_asm+0x1a/0x30
> > [   53.701777]  </TASK>
> > 
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Koralahalli Channabasappa, Smita 2 months, 3 weeks ago

On 7/16/2025 1:20 PM, Alison Schofield wrote:
> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>> Hi Alison,
>>
>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>> resources, enabling the CXL driver to remove any portions that
>>>> intersect with created CXL regions.
>>>
>>> Hi Smita,
>>>
>>> This set applied cleanly to todays cxl-next but fails like appended
>>> before region probe.
>>>
>>> BTW - there were sparse warnings in the build that look related:
>>>     CHECK   drivers/dax/hmem/hmem_notify.c
>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>
>> Thanks for pointing this bug. I failed to release the spinlock before
>> calling hmem_register_device(), which internally calls platform_device_add()
>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>> into v6:
>>
>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>> index 6c276c5bd51d..8f411f3fe7bd 100644
>> --- a/drivers/dax/hmem/hmem_notify.c
>> +++ b/drivers/dax/hmem/hmem_notify.c
>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>> struct resource *res)
>>   {
>>          walk_hmem_fn hmem_fn;
>>
>> -       guard(spinlock)(&hmem_notify_lock);
>> +       spin_lock(&hmem_notify_lock);
>>          hmem_fn = hmem_fallback_fn;
>> +       spin_unlock(&hmem_notify_lock);
>>
>>          if (hmem_fn)
>>                  hmem_fn(target_nid, res);
>> --
> 
> Hi Smita,  Adding the above got me past that, and doubling the timeout
> below stopped that from happening. After that, I haven't had time to
> trace so, I'll just dump on you for now:
> 
> In /proc/iomem
> Here, we see a regions resource, no CXL Window, and no dax, and no
> actual region, not even disabled, is available.
> c080000000-c47fffffff : region0
> 
> And, here no CXL Window, no region, and a soft reserved.
> 68e80000000-70e7fffffff : Soft Reserved
>    68e80000000-70e7fffffff : dax1.0
>      68e80000000-70e7fffffff : System RAM (kmem)
> 
> I haven't yet walked through the v4 to v5 changes so I'll do that next.

Hi Alison,

To help better understand the current behavior, could you share more 
about your platform configuration? specifically, are there two memory 
cards involved? One at c080000000 (which appears as region0) and another 
at 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, 
how are the Soft Reserved ranges laid out on your system for these 
cards? I'm trying to understand the "before" state of the resources i.e, 
prior to trimming applied by my patches.

Also, do you think it's feasible to change the direction of the soft 
reserve trimming, that is, defer it until after CXL region or memdev 
creation is complete? In this case it would be trimmed after but inline 
the existing region or memdev creation. This might simplify the flow by 
removing the need for wait_event_timeout(), wait_for_device_probe() and 
the workqueue logic inside cxl_acpi_probe().

(As a side note I experimented changing cxl_acpi_init() to a 
late_initcall() and observed that it consistently avoided probe ordering 
issues in my setup.

Additional note: I realized that even when cxl_acpi_probe() fails, the 
fallback DAX registration path (via cxl_softreserv_mem_update()) still 
waits on cxl_mem_active() and wait_for_device_probe(). I plan to address 
this in v6 by immediately triggering fallback DAX registration 
(hmem_register_device()) when the ACPI probe fails, instead of waiting.)

Thanks
Smita

> 
>>
>> As for the log:
>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>> cxl_mem probing
>>
>> I’m still analyzing that. Here's what was my thought process so far.
>>
>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>> cxl_mem_probe(), so CXL region creation (which happens in
>> cxl_port_endpoint_probe()) may or may not have completed by the time
>> trimming is attempted.
>>
>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>> guarantee load order when all components are built as modules. So even if
>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>> cxl_mem in modular configurations. As a result, region creation is
>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>> relevant probes complete.
>>
>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>> to return prematurely and trigger the timeout.
>>
>> - In my local setup, I observed that a 30-second timeout was generally
>> sufficient to catch this race, allowing cxl_port_probe() to load while
>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>> cxl_port_probe() has completed before trimming proceeds, making the logic
>> good enough to most boot-time races.
>>
>> One possible improvement I’m considering is to schedule a
>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>> cxl_port) before initiating the soft reserve trimming.
>>
>> That said, I'm still evaluating better options to more robustly coordinate
>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>> looking for suggestions here.
>>
>> Thanks
>> Smita
>>
>>>
>>>
>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>> other info to reproduce.
>>>
>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>> [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>> [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>> [   53.653540] preempt_count: 1, expected: 0
>>> [   53.653554] RCU nest depth: 0, expected: 0
>>> [   53.653568] 3 locks held by kworker/46:1/1875:
>>> [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>> [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>> [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>> [   53.653598] Preemption disabled at:
>>> [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>> [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>> [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>> [   53.653648] Call Trace:
>>> [   53.653649]  <TASK>
>>> [   53.653652]  dump_stack_lvl+0xa8/0xd0
>>> [   53.653658]  dump_stack+0x14/0x20
>>> [   53.653659]  __might_resched+0x1ae/0x2d0
>>> [   53.653666]  __might_sleep+0x48/0x70
>>> [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
>>> [   53.653674]  ? __devm_add_action+0x3d/0x160
>>> [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
>>> [   53.653688]  __devres_alloc_node+0x4a/0x90
>>> [   53.653689]  ? __devres_alloc_node+0x4a/0x90
>>> [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>> [   53.653693]  __devm_add_action+0x3d/0x160
>>> [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
>>> [   53.653700]  hmem_fallback_register_device+0x37/0x60
>>> [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [   53.653739]  walk_iomem_res_desc+0x55/0xb0
>>> [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
>>> [   53.653768]  process_one_work+0x1fa/0x630
>>> [   53.653774]  worker_thread+0x1b2/0x360
>>> [   53.653777]  kthread+0x128/0x250
>>> [   53.653781]  ? __pfx_worker_thread+0x10/0x10
>>> [   53.653784]  ? __pfx_kthread+0x10/0x10
>>> [   53.653786]  ret_from_fork+0x139/0x1e0
>>> [   53.653790]  ? __pfx_kthread+0x10/0x10
>>> [   53.653792]  ret_from_fork_asm+0x1a/0x30
>>> [   53.653801]  </TASK>
>>>
>>> [   53.654193] =============================
>>> [   53.654203] [ BUG: Invalid wait context ]
>>> [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
>>> [   53.654623] -----------------------------
>>> [   53.654785] kworker/46:1/1875 is trying to lock:
>>> [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>> [   53.655115] other info that might help us debug this:
>>> [   53.655273] context-{5:5}
>>> [   53.655428] 3 locks held by kworker/46:1/1875:
>>> [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>> [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>> [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>> [   53.656062] stack backtrace:
>>> [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>> [   53.656227] Tainted: [W]=WARN
>>> [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>> [   53.656232] Call Trace:
>>> [   53.656232]  <TASK>
>>> [   53.656234]  dump_stack_lvl+0x85/0xd0
>>> [   53.656238]  dump_stack+0x14/0x20
>>> [   53.656239]  __lock_acquire+0xaf4/0x2200
>>> [   53.656246]  lock_acquire+0xd8/0x300
>>> [   53.656248]  ? kernfs_add_one+0x34/0x390
>>> [   53.656252]  ? __might_resched+0x208/0x2d0
>>> [   53.656257]  down_write+0x44/0xe0
>>> [   53.656262]  ? kernfs_add_one+0x34/0x390
>>> [   53.656263]  kernfs_add_one+0x34/0x390
>>> [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
>>> [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
>>> [   53.656270]  kobject_add_internal+0xb1/0x2f0
>>> [   53.656273]  kobject_add+0x7d/0xf0
>>> [   53.656275]  ? get_device_parent+0x28/0x1e0
>>> [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
>>> [   53.656282]  device_add+0x124/0x8b0
>>> [   53.656285]  ? dev_set_name+0x56/0x70
>>> [   53.656287]  platform_device_add+0x102/0x260
>>> [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
>>> [   53.656291]  hmem_fallback_register_device+0x37/0x60
>>> [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [   53.656323]  walk_iomem_res_desc+0x55/0xb0
>>> [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
>>> [   53.656346]  process_one_work+0x1fa/0x630
>>> [   53.656350]  worker_thread+0x1b2/0x360
>>> [   53.656352]  kthread+0x128/0x250
>>> [   53.656354]  ? __pfx_worker_thread+0x10/0x10
>>> [   53.656356]  ? __pfx_kthread+0x10/0x10
>>> [   53.656357]  ret_from_fork+0x139/0x1e0
>>> [   53.656360]  ? __pfx_kthread+0x10/0x10
>>> [   53.656361]  ret_from_fork_asm+0x1a/0x30
>>> [   53.656366]  </TASK>
>>> [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>> [   53.663552]  schedule+0x4a/0x160
>>> [   53.663553]  schedule_timeout+0x10a/0x120
>>> [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
>>> [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
>>> [   53.663558]  __wait_for_common+0xb9/0x1c0
>>> [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
>>> [   53.663561]  wait_for_completion+0x28/0x30
>>> [   53.663562]  __synchronize_srcu+0xbf/0x180
>>> [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>> [   53.663571]  ? i2c_repstart+0x30/0x80
>>> [   53.663576]  synchronize_srcu+0x46/0x120
>>> [   53.663577]  kill_dax+0x47/0x70
>>> [   53.663580]  __devm_create_dev_dax+0x112/0x470
>>> [   53.663582]  devm_create_dev_dax+0x26/0x50
>>> [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>> [   53.663585]  platform_probe+0x61/0xd0
>>> [   53.663589]  really_probe+0xe2/0x390
>>> [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
>>> [   53.663593]  __driver_probe_device+0x7e/0x160
>>> [   53.663594]  driver_probe_device+0x23/0xa0
>>> [   53.663596]  __device_attach_driver+0x92/0x120
>>> [   53.663597]  bus_for_each_drv+0x8c/0xf0
>>> [   53.663599]  __device_attach+0xc2/0x1f0
>>> [   53.663601]  device_initial_probe+0x17/0x20
>>> [   53.663603]  bus_probe_device+0xa8/0xb0
>>> [   53.663604]  device_add+0x687/0x8b0
>>> [   53.663607]  ? dev_set_name+0x56/0x70
>>> [   53.663609]  platform_device_add+0x102/0x260
>>> [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
>>> [   53.663612]  hmem_fallback_register_device+0x37/0x60
>>> [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [   53.663637]  walk_iomem_res_desc+0x55/0xb0
>>> [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
>>> [   53.663658]  process_one_work+0x1fa/0x630
>>> [   53.663662]  worker_thread+0x1b2/0x360
>>> [   53.663664]  kthread+0x128/0x250
>>> [   53.663666]  ? __pfx_worker_thread+0x10/0x10
>>> [   53.663668]  ? __pfx_kthread+0x10/0x10
>>> [   53.663670]  ret_from_fork+0x139/0x1e0
>>> [   53.663672]  ? __pfx_kthread+0x10/0x10
>>> [   53.663673]  ret_from_fork_asm+0x1a/0x30
>>> [   53.663677]  </TASK>
>>> [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>> [   53.700264] INFO: lockdep is turned off.
>>> [   53.701315] Preemption disabled at:
>>> [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>> [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>> [   53.701633] Tainted: [W]=WARN
>>> [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>> [   53.701638] Call Trace:
>>> [   53.701638]  <TASK>
>>> [   53.701640]  dump_stack_lvl+0xa8/0xd0
>>> [   53.701644]  dump_stack+0x14/0x20
>>> [   53.701645]  __schedule_bug+0xa2/0xd0
>>> [   53.701649]  __schedule+0xe6f/0x10d0
>>> [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
>>> [   53.701655]  ? lock_release+0x1e6/0x2b0
>>> [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
>>> [   53.701661]  schedule+0x4a/0x160
>>> [   53.701662]  schedule_timeout+0x10a/0x120
>>> [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
>>> [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
>>> [   53.701667]  __wait_for_common+0xb9/0x1c0
>>> [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
>>> [   53.701670]  wait_for_completion+0x28/0x30
>>> [   53.701671]  __synchronize_srcu+0xbf/0x180
>>> [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>> [   53.701682]  ? i2c_repstart+0x30/0x80
>>> [   53.701685]  synchronize_srcu+0x46/0x120
>>> [   53.701687]  kill_dax+0x47/0x70
>>> [   53.701689]  __devm_create_dev_dax+0x112/0x470
>>> [   53.701691]  devm_create_dev_dax+0x26/0x50
>>> [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>> [   53.701695]  platform_probe+0x61/0xd0
>>> [   53.701698]  really_probe+0xe2/0x390
>>> [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
>>> [   53.701701]  __driver_probe_device+0x7e/0x160
>>> [   53.701703]  driver_probe_device+0x23/0xa0
>>> [   53.701704]  __device_attach_driver+0x92/0x120
>>> [   53.701706]  bus_for_each_drv+0x8c/0xf0
>>> [   53.701708]  __device_attach+0xc2/0x1f0
>>> [   53.701710]  device_initial_probe+0x17/0x20
>>> [   53.701711]  bus_probe_device+0xa8/0xb0
>>> [   53.701712]  device_add+0x687/0x8b0
>>> [   53.701715]  ? dev_set_name+0x56/0x70
>>> [   53.701717]  platform_device_add+0x102/0x260
>>> [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
>>> [   53.701720]  hmem_fallback_register_device+0x37/0x60
>>> [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>> [   53.701734]  walk_iomem_res_desc+0x55/0xb0
>>> [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>> [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>> [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>> [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
>>> [   53.701756]  process_one_work+0x1fa/0x630
>>> [   53.701760]  worker_thread+0x1b2/0x360
>>> [   53.701762]  kthread+0x128/0x250
>>> [   53.701765]  ? __pfx_worker_thread+0x10/0x10
>>> [   53.701766]  ? __pfx_kthread+0x10/0x10
>>> [   53.701768]  ret_from_fork+0x139/0x1e0
>>> [   53.701771]  ? __pfx_kthread+0x10/0x10
>>> [   53.701772]  ret_from_fork_asm+0x1a/0x30
>>> [   53.701777]  </TASK>
>>>
>>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Alison Schofield 2 months, 3 weeks ago

On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
> On 7/16/2025 1:20 PM, Alison Schofield wrote:
> > On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
> > > Hi Alison,
> > > 
> > > On 7/15/2025 2:07 PM, Alison Schofield wrote:
> > > > On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
> > > > > This series introduces the ability to manage SOFT RESERVED iomem
> > > > > resources, enabling the CXL driver to remove any portions that
> > > > > intersect with created CXL regions.
> > > > 
> > > > Hi Smita,
> > > > 
> > > > This set applied cleanly to todays cxl-next but fails like appended
> > > > before region probe.
> > > > 
> > > > BTW - there were sparse warnings in the build that look related:
> > > >     CHECK   drivers/dax/hmem/hmem_notify.c
> > > > drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
> > > > drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
> > > 
> > > Thanks for pointing this bug. I failed to release the spinlock before
> > > calling hmem_register_device(), which internally calls platform_device_add()
> > > and can sleep. The following fix addresses that bug. I’ll incorporate this
> > > into v6:
> > > 
> > > diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
> > > index 6c276c5bd51d..8f411f3fe7bd 100644
> > > --- a/drivers/dax/hmem/hmem_notify.c
> > > +++ b/drivers/dax/hmem/hmem_notify.c
> > > @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
> > > struct resource *res)
> > >   {
> > >          walk_hmem_fn hmem_fn;
> > > 
> > > -       guard(spinlock)(&hmem_notify_lock);
> > > +       spin_lock(&hmem_notify_lock);
> > >          hmem_fn = hmem_fallback_fn;
> > > +       spin_unlock(&hmem_notify_lock);
> > > 
> > >          if (hmem_fn)
> > >                  hmem_fn(target_nid, res);
> > > --
> > 
> > Hi Smita,  Adding the above got me past that, and doubling the timeout
> > below stopped that from happening. After that, I haven't had time to
> > trace so, I'll just dump on you for now:
> > 
> > In /proc/iomem
> > Here, we see a regions resource, no CXL Window, and no dax, and no
> > actual region, not even disabled, is available.
> > c080000000-c47fffffff : region0
> > 
> > And, here no CXL Window, no region, and a soft reserved.
> > 68e80000000-70e7fffffff : Soft Reserved
> >    68e80000000-70e7fffffff : dax1.0
> >      68e80000000-70e7fffffff : System RAM (kmem)
> > 
> > I haven't yet walked through the v4 to v5 changes so I'll do that next.
> 
> Hi Alison,
> 
> To help better understand the current behavior, could you share more about
> your platform configuration? specifically, are there two memory cards
> involved? One at c080000000 (which appears as region0) and another at
> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
> are the Soft Reserved ranges laid out on your system for these cards? I'm
> trying to understand the "before" state of the resources i.e, prior to
> trimming applied by my patches.

Here are the soft reserveds -
[] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
[] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved

And this is what we expect -

c080000000-17dbfffffff : CXL Window 0
  c080000000-c47fffffff : region2
    c080000000-c47fffffff : dax0.0
      c080000000-c47fffffff : System RAM (kmem)


68e80000000-8d37fffffff : CXL Window 1
  68e80000000-70e7fffffff : region5
    68e80000000-70e7fffffff : dax1.0
      68e80000000-70e7fffffff : System RAM (kmem)

And, like in prev message, iv v5 we get -

c080000000-c47fffffff : region0

68e80000000-70e7fffffff : Soft Reserved
  68e80000000-70e7fffffff : dax1.0
    68e80000000-70e7fffffff : System RAM (kmem)


In v4, we 'almost' had what we expect, except that the HMEM driver
created those dax devices our of Soft Reserveds before region driver
could do same.

> 
> Also, do you think it's feasible to change the direction of the soft reserve
> trimming, that is, defer it until after CXL region or memdev creation is
> complete? In this case it would be trimmed after but inline the existing
> region or memdev creation. This might simplify the flow by removing the need
> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
> inside cxl_acpi_probe().

Yes that aligns with my simple thinking. There's the trimming after a region
is successfully created, and it seems that could simply be called at the end
of *that* region creation.

Then, there's the round up of all the unused Soft Reserveds, and that has
to wait until after all regions are created, ie. all endpoints have arrived
and we've given up all hope of creating another region in that space.
That's the timing challenge.

-- Alison

> 
> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
> and observed that it consistently avoided probe ordering issues in my setup.
> 
> Additional note: I realized that even when cxl_acpi_probe() fails, the
> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
> v6 by immediately triggering fallback DAX registration
> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
> 
> Thanks
> Smita
> 
> > 
> > > 
> > > As for the log:
> > > [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
> > > cxl_mem probing
> > > 
> > > I’m still analyzing that. Here's what was my thought process so far.
> > > 
> > > - This occurs when cxl_acpi_probe() runs significantly earlier than
> > > cxl_mem_probe(), so CXL region creation (which happens in
> > > cxl_port_endpoint_probe()) may or may not have completed by the time
> > > trimming is attempted.
> > > 
> > > - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
> > > guarantee load order when all components are built as modules. So even if
> > > the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
> > > MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
> > > cxl_mem in modular configurations. As a result, region creation is
> > > eventually guaranteed, and wait_for_device_probe() will succeed once the
> > > relevant probes complete.
> > > 
> > > - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
> > > guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
> > > before cxl_port_probe() even begins, which can cause wait_for_device_probe()
> > > to return prematurely and trigger the timeout.
> > > 
> > > - In my local setup, I observed that a 30-second timeout was generally
> > > sufficient to catch this race, allowing cxl_port_probe() to load while
> > > cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
> > > components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
> > > best-effort mechanism. After the timeout, wait_for_device_probe() ensures
> > > cxl_port_probe() has completed before trimming proceeds, making the logic
> > > good enough to most boot-time races.
> > > 
> > > One possible improvement I’m considering is to schedule a
> > > delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
> > > slightly longer for cxl_mem_probe() to complete (which itself softdeps on
> > > cxl_port) before initiating the soft reserve trimming.
> > > 
> > > That said, I'm still evaluating better options to more robustly coordinate
> > > probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
> > > looking for suggestions here.
> > > 
> > > Thanks
> > > Smita
> > > 
> > > > 
> > > > 
> > > > This isn't all the logs, I trimmed. Let me know if you need more or
> > > > other info to reproduce.
> > > > 
> > > > [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
> > > > [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
> > > > [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
> > > > [   53.653540] preempt_count: 1, expected: 0
> > > > [   53.653554] RCU nest depth: 0, expected: 0
> > > > [   53.653568] 3 locks held by kworker/46:1/1875:
> > > > [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > > > [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > > > [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > > > [   53.653598] Preemption disabled at:
> > > > [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > > > [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > > > [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > > > [   53.653648] Call Trace:
> > > > [   53.653649]  <TASK>
> > > > [   53.653652]  dump_stack_lvl+0xa8/0xd0
> > > > [   53.653658]  dump_stack+0x14/0x20
> > > > [   53.653659]  __might_resched+0x1ae/0x2d0
> > > > [   53.653666]  __might_sleep+0x48/0x70
> > > > [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
> > > > [   53.653674]  ? __devm_add_action+0x3d/0x160
> > > > [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
> > > > [   53.653688]  __devres_alloc_node+0x4a/0x90
> > > > [   53.653689]  ? __devres_alloc_node+0x4a/0x90
> > > > [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
> > > > [   53.653693]  __devm_add_action+0x3d/0x160
> > > > [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
> > > > [   53.653700]  hmem_fallback_register_device+0x37/0x60
> > > > [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [   53.653739]  walk_iomem_res_desc+0x55/0xb0
> > > > [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [   53.653768]  process_one_work+0x1fa/0x630
> > > > [   53.653774]  worker_thread+0x1b2/0x360
> > > > [   53.653777]  kthread+0x128/0x250
> > > > [   53.653781]  ? __pfx_worker_thread+0x10/0x10
> > > > [   53.653784]  ? __pfx_kthread+0x10/0x10
> > > > [   53.653786]  ret_from_fork+0x139/0x1e0
> > > > [   53.653790]  ? __pfx_kthread+0x10/0x10
> > > > [   53.653792]  ret_from_fork_asm+0x1a/0x30
> > > > [   53.653801]  </TASK>
> > > > 
> > > > [   53.654193] =============================
> > > > [   53.654203] [ BUG: Invalid wait context ]
> > > > [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
> > > > [   53.654623] -----------------------------
> > > > [   53.654785] kworker/46:1/1875 is trying to lock:
> > > > [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
> > > > [   53.655115] other info that might help us debug this:
> > > > [   53.655273] context-{5:5}
> > > > [   53.655428] 3 locks held by kworker/46:1/1875:
> > > > [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
> > > > [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
> > > > [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
> > > > [   53.656062] stack backtrace:
> > > > [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > > > [   53.656227] Tainted: [W]=WARN
> > > > [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > > > [   53.656232] Call Trace:
> > > > [   53.656232]  <TASK>
> > > > [   53.656234]  dump_stack_lvl+0x85/0xd0
> > > > [   53.656238]  dump_stack+0x14/0x20
> > > > [   53.656239]  __lock_acquire+0xaf4/0x2200
> > > > [   53.656246]  lock_acquire+0xd8/0x300
> > > > [   53.656248]  ? kernfs_add_one+0x34/0x390
> > > > [   53.656252]  ? __might_resched+0x208/0x2d0
> > > > [   53.656257]  down_write+0x44/0xe0
> > > > [   53.656262]  ? kernfs_add_one+0x34/0x390
> > > > [   53.656263]  kernfs_add_one+0x34/0x390
> > > > [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
> > > > [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
> > > > [   53.656270]  kobject_add_internal+0xb1/0x2f0
> > > > [   53.656273]  kobject_add+0x7d/0xf0
> > > > [   53.656275]  ? get_device_parent+0x28/0x1e0
> > > > [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
> > > > [   53.656282]  device_add+0x124/0x8b0
> > > > [   53.656285]  ? dev_set_name+0x56/0x70
> > > > [   53.656287]  platform_device_add+0x102/0x260
> > > > [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
> > > > [   53.656291]  hmem_fallback_register_device+0x37/0x60
> > > > [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [   53.656323]  walk_iomem_res_desc+0x55/0xb0
> > > > [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [   53.656346]  process_one_work+0x1fa/0x630
> > > > [   53.656350]  worker_thread+0x1b2/0x360
> > > > [   53.656352]  kthread+0x128/0x250
> > > > [   53.656354]  ? __pfx_worker_thread+0x10/0x10
> > > > [   53.656356]  ? __pfx_kthread+0x10/0x10
> > > > [   53.656357]  ret_from_fork+0x139/0x1e0
> > > > [   53.656360]  ? __pfx_kthread+0x10/0x10
> > > > [   53.656361]  ret_from_fork_asm+0x1a/0x30
> > > > [   53.656366]  </TASK>
> > > > [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > > > [   53.663552]  schedule+0x4a/0x160
> > > > [   53.663553]  schedule_timeout+0x10a/0x120
> > > > [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
> > > > [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
> > > > [   53.663558]  __wait_for_common+0xb9/0x1c0
> > > > [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
> > > > [   53.663561]  wait_for_completion+0x28/0x30
> > > > [   53.663562]  __synchronize_srcu+0xbf/0x180
> > > > [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
> > > > [   53.663571]  ? i2c_repstart+0x30/0x80
> > > > [   53.663576]  synchronize_srcu+0x46/0x120
> > > > [   53.663577]  kill_dax+0x47/0x70
> > > > [   53.663580]  __devm_create_dev_dax+0x112/0x470
> > > > [   53.663582]  devm_create_dev_dax+0x26/0x50
> > > > [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > > > [   53.663585]  platform_probe+0x61/0xd0
> > > > [   53.663589]  really_probe+0xe2/0x390
> > > > [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
> > > > [   53.663593]  __driver_probe_device+0x7e/0x160
> > > > [   53.663594]  driver_probe_device+0x23/0xa0
> > > > [   53.663596]  __device_attach_driver+0x92/0x120
> > > > [   53.663597]  bus_for_each_drv+0x8c/0xf0
> > > > [   53.663599]  __device_attach+0xc2/0x1f0
> > > > [   53.663601]  device_initial_probe+0x17/0x20
> > > > [   53.663603]  bus_probe_device+0xa8/0xb0
> > > > [   53.663604]  device_add+0x687/0x8b0
> > > > [   53.663607]  ? dev_set_name+0x56/0x70
> > > > [   53.663609]  platform_device_add+0x102/0x260
> > > > [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
> > > > [   53.663612]  hmem_fallback_register_device+0x37/0x60
> > > > [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [   53.663637]  walk_iomem_res_desc+0x55/0xb0
> > > > [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [   53.663658]  process_one_work+0x1fa/0x630
> > > > [   53.663662]  worker_thread+0x1b2/0x360
> > > > [   53.663664]  kthread+0x128/0x250
> > > > [   53.663666]  ? __pfx_worker_thread+0x10/0x10
> > > > [   53.663668]  ? __pfx_kthread+0x10/0x10
> > > > [   53.663670]  ret_from_fork+0x139/0x1e0
> > > > [   53.663672]  ? __pfx_kthread+0x10/0x10
> > > > [   53.663673]  ret_from_fork_asm+0x1a/0x30
> > > > [   53.663677]  </TASK>
> > > > [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
> > > > [   53.700264] INFO: lockdep is turned off.
> > > > [   53.701315] Preemption disabled at:
> > > > [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
> > > > [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
> > > > [   53.701633] Tainted: [W]=WARN
> > > > [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
> > > > [   53.701638] Call Trace:
> > > > [   53.701638]  <TASK>
> > > > [   53.701640]  dump_stack_lvl+0xa8/0xd0
> > > > [   53.701644]  dump_stack+0x14/0x20
> > > > [   53.701645]  __schedule_bug+0xa2/0xd0
> > > > [   53.701649]  __schedule+0xe6f/0x10d0
> > > > [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
> > > > [   53.701655]  ? lock_release+0x1e6/0x2b0
> > > > [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
> > > > [   53.701661]  schedule+0x4a/0x160
> > > > [   53.701662]  schedule_timeout+0x10a/0x120
> > > > [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
> > > > [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
> > > > [   53.701667]  __wait_for_common+0xb9/0x1c0
> > > > [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
> > > > [   53.701670]  wait_for_completion+0x28/0x30
> > > > [   53.701671]  __synchronize_srcu+0xbf/0x180
> > > > [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
> > > > [   53.701682]  ? i2c_repstart+0x30/0x80
> > > > [   53.701685]  synchronize_srcu+0x46/0x120
> > > > [   53.701687]  kill_dax+0x47/0x70
> > > > [   53.701689]  __devm_create_dev_dax+0x112/0x470
> > > > [   53.701691]  devm_create_dev_dax+0x26/0x50
> > > > [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
> > > > [   53.701695]  platform_probe+0x61/0xd0
> > > > [   53.701698]  really_probe+0xe2/0x390
> > > > [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
> > > > [   53.701701]  __driver_probe_device+0x7e/0x160
> > > > [   53.701703]  driver_probe_device+0x23/0xa0
> > > > [   53.701704]  __device_attach_driver+0x92/0x120
> > > > [   53.701706]  bus_for_each_drv+0x8c/0xf0
> > > > [   53.701708]  __device_attach+0xc2/0x1f0
> > > > [   53.701710]  device_initial_probe+0x17/0x20
> > > > [   53.701711]  bus_probe_device+0xa8/0xb0
> > > > [   53.701712]  device_add+0x687/0x8b0
> > > > [   53.701715]  ? dev_set_name+0x56/0x70
> > > > [   53.701717]  platform_device_add+0x102/0x260
> > > > [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
> > > > [   53.701720]  hmem_fallback_register_device+0x37/0x60
> > > > [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
> > > > [   53.701734]  walk_iomem_res_desc+0x55/0xb0
> > > > [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
> > > > [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
> > > > [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
> > > > [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > > [   53.701756]  process_one_work+0x1fa/0x630
> > > > [   53.701760]  worker_thread+0x1b2/0x360
> > > > [   53.701762]  kthread+0x128/0x250
> > > > [   53.701765]  ? __pfx_worker_thread+0x10/0x10
> > > > [   53.701766]  ? __pfx_kthread+0x10/0x10
> > > > [   53.701768]  ret_from_fork+0x139/0x1e0
> > > > [   53.701771]  ? __pfx_kthread+0x10/0x10
> > > > [   53.701772]  ret_from_fork_asm+0x1a/0x30
> > > > [   53.701777]  </TASK>
> > > > 
> > > 
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Koralahalli Channabasappa, Smita 2 months, 3 weeks ago


On 7/16/2025 4:48 PM, Alison Schofield wrote:
> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>> Hi Alison,
>>>>
>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>> intersect with created CXL regions.
>>>>>
>>>>> Hi Smita,
>>>>>
>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>> before region probe.
>>>>>
>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>      CHECK   drivers/dax/hmem/hmem_notify.c
>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>
>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>> into v6:
>>>>
>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>> struct resource *res)
>>>>    {
>>>>           walk_hmem_fn hmem_fn;
>>>>
>>>> -       guard(spinlock)(&hmem_notify_lock);
>>>> +       spin_lock(&hmem_notify_lock);
>>>>           hmem_fn = hmem_fallback_fn;
>>>> +       spin_unlock(&hmem_notify_lock);
>>>>
>>>>           if (hmem_fn)
>>>>                   hmem_fn(target_nid, res);
>>>> --
>>>
>>> Hi Smita,  Adding the above got me past that, and doubling the timeout
>>> below stopped that from happening. After that, I haven't had time to
>>> trace so, I'll just dump on you for now:
>>>
>>> In /proc/iomem
>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>> actual region, not even disabled, is available.
>>> c080000000-c47fffffff : region0
>>>
>>> And, here no CXL Window, no region, and a soft reserved.
>>> 68e80000000-70e7fffffff : Soft Reserved
>>>     68e80000000-70e7fffffff : dax1.0
>>>       68e80000000-70e7fffffff : System RAM (kmem)
>>>
>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>
>> Hi Alison,
>>
>> To help better understand the current behavior, could you share more about
>> your platform configuration? specifically, are there two memory cards
>> involved? One at c080000000 (which appears as region0) and another at
>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>> trying to understand the "before" state of the resources i.e, prior to
>> trimming applied by my patches.
> 
> Here are the soft reserveds -
> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
> 
> And this is what we expect -
> 
> c080000000-17dbfffffff : CXL Window 0
>    c080000000-c47fffffff : region2
>      c080000000-c47fffffff : dax0.0
>        c080000000-c47fffffff : System RAM (kmem)
> 
> 
> 68e80000000-8d37fffffff : CXL Window 1
>    68e80000000-70e7fffffff : region5
>      68e80000000-70e7fffffff : dax1.0
>        68e80000000-70e7fffffff : System RAM (kmem)
> 
> And, like in prev message, iv v5 we get -
> 
> c080000000-c47fffffff : region0
> 
> 68e80000000-70e7fffffff : Soft Reserved
>    68e80000000-70e7fffffff : dax1.0
>      68e80000000-70e7fffffff : System RAM (kmem)
> 
> 
> In v4, we 'almost' had what we expect, except that the HMEM driver
> created those dax devices our of Soft Reserveds before region driver
> could do same.
> 

Yeah, the only part I’m uncertain about in v5 is scheduling the fallback 
work from the failure path of cxl_acpi_probe(). That doesn’t feel like 
the right place to do it, and I suspect it might be contributing to the 
unexpected behavior.

v4 had most of the necessary pieces in place, but it didn’t handle 
situations well when the driver load order didn’t go as expected.

Even if we modify v4 to avoid triggering hmem_register_device() directly 
from cxl_acpi_probe() which helps avoid unresolved symbol errors when 
cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to 
pick up Soft Reserved regions after cxl_acpi creates regions, we still 
run into timing issues..

Specifically, there's no guarantee that hmem_register_device() will 
correctly skip the following check if the region state isn't fully 
ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using 
late_initcall() (which I tried):

if (IS_ENABLED(CONFIG_CXL_REGION) &&
	    region_intersects(res->start, resource_size(res), IORESOURCE_MEM, 
IORES_DESC_CXL) != REGION_DISJOINT) {..

At this point, I’m running out of ideas on how to reliably coordinate 
this.. :(

Thanks
Smita

>>
>> Also, do you think it's feasible to change the direction of the soft reserve
>> trimming, that is, defer it until after CXL region or memdev creation is
>> complete? In this case it would be trimmed after but inline the existing
>> region or memdev creation. This might simplify the flow by removing the need
>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>> inside cxl_acpi_probe().
> 
> Yes that aligns with my simple thinking. There's the trimming after a region
> is successfully created, and it seems that could simply be called at the end
> of *that* region creation.
> 
> Then, there's the round up of all the unused Soft Reserveds, and that has
> to wait until after all regions are created, ie. all endpoints have arrived
> and we've given up all hope of creating another region in that space.
> That's the timing challenge.
> 
> -- Alison
> 
>>
>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>> and observed that it consistently avoided probe ordering issues in my setup.
>>
>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>> v6 by immediately triggering fallback DAX registration
>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>
>> Thanks
>> Smita
>>
>>>
>>>>
>>>> As for the log:
>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>> cxl_mem probing
>>>>
>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>
>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>> trimming is attempted.
>>>>
>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>> guarantee load order when all components are built as modules. So even if
>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>> cxl_mem in modular configurations. As a result, region creation is
>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>> relevant probes complete.
>>>>
>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>> to return prematurely and trigger the timeout.
>>>>
>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>> good enough to most boot-time races.
>>>>
>>>> One possible improvement I’m considering is to schedule a
>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>> cxl_port) before initiating the soft reserve trimming.
>>>>
>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>> looking for suggestions here.
>>>>
>>>> Thanks
>>>> Smita
>>>>
>>>>>
>>>>>
>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>> other info to reproduce.
>>>>>
>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>> [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>> [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>> [   53.653540] preempt_count: 1, expected: 0
>>>>> [   53.653554] RCU nest depth: 0, expected: 0
>>>>> [   53.653568] 3 locks held by kworker/46:1/1875:
>>>>> [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>> [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>> [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>> [   53.653598] Preemption disabled at:
>>>>> [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>> [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>> [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>> [   53.653648] Call Trace:
>>>>> [   53.653649]  <TASK>
>>>>> [   53.653652]  dump_stack_lvl+0xa8/0xd0
>>>>> [   53.653658]  dump_stack+0x14/0x20
>>>>> [   53.653659]  __might_resched+0x1ae/0x2d0
>>>>> [   53.653666]  __might_sleep+0x48/0x70
>>>>> [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>> [   53.653674]  ? __devm_add_action+0x3d/0x160
>>>>> [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
>>>>> [   53.653688]  __devres_alloc_node+0x4a/0x90
>>>>> [   53.653689]  ? __devres_alloc_node+0x4a/0x90
>>>>> [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>> [   53.653693]  __devm_add_action+0x3d/0x160
>>>>> [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
>>>>> [   53.653700]  hmem_fallback_register_device+0x37/0x60
>>>>> [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [   53.653739]  walk_iomem_res_desc+0x55/0xb0
>>>>> [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [   53.653768]  process_one_work+0x1fa/0x630
>>>>> [   53.653774]  worker_thread+0x1b2/0x360
>>>>> [   53.653777]  kthread+0x128/0x250
>>>>> [   53.653781]  ? __pfx_worker_thread+0x10/0x10
>>>>> [   53.653784]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.653786]  ret_from_fork+0x139/0x1e0
>>>>> [   53.653790]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.653792]  ret_from_fork_asm+0x1a/0x30
>>>>> [   53.653801]  </TASK>
>>>>>
>>>>> [   53.654193] =============================
>>>>> [   53.654203] [ BUG: Invalid wait context ]
>>>>> [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
>>>>> [   53.654623] -----------------------------
>>>>> [   53.654785] kworker/46:1/1875 is trying to lock:
>>>>> [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>> [   53.655115] other info that might help us debug this:
>>>>> [   53.655273] context-{5:5}
>>>>> [   53.655428] 3 locks held by kworker/46:1/1875:
>>>>> [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>> [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>> [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>> [   53.656062] stack backtrace:
>>>>> [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>> [   53.656227] Tainted: [W]=WARN
>>>>> [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>> [   53.656232] Call Trace:
>>>>> [   53.656232]  <TASK>
>>>>> [   53.656234]  dump_stack_lvl+0x85/0xd0
>>>>> [   53.656238]  dump_stack+0x14/0x20
>>>>> [   53.656239]  __lock_acquire+0xaf4/0x2200
>>>>> [   53.656246]  lock_acquire+0xd8/0x300
>>>>> [   53.656248]  ? kernfs_add_one+0x34/0x390
>>>>> [   53.656252]  ? __might_resched+0x208/0x2d0
>>>>> [   53.656257]  down_write+0x44/0xe0
>>>>> [   53.656262]  ? kernfs_add_one+0x34/0x390
>>>>> [   53.656263]  kernfs_add_one+0x34/0x390
>>>>> [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
>>>>> [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
>>>>> [   53.656270]  kobject_add_internal+0xb1/0x2f0
>>>>> [   53.656273]  kobject_add+0x7d/0xf0
>>>>> [   53.656275]  ? get_device_parent+0x28/0x1e0
>>>>> [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
>>>>> [   53.656282]  device_add+0x124/0x8b0
>>>>> [   53.656285]  ? dev_set_name+0x56/0x70
>>>>> [   53.656287]  platform_device_add+0x102/0x260
>>>>> [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>> [   53.656291]  hmem_fallback_register_device+0x37/0x60
>>>>> [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [   53.656323]  walk_iomem_res_desc+0x55/0xb0
>>>>> [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [   53.656346]  process_one_work+0x1fa/0x630
>>>>> [   53.656350]  worker_thread+0x1b2/0x360
>>>>> [   53.656352]  kthread+0x128/0x250
>>>>> [   53.656354]  ? __pfx_worker_thread+0x10/0x10
>>>>> [   53.656356]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.656357]  ret_from_fork+0x139/0x1e0
>>>>> [   53.656360]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.656361]  ret_from_fork_asm+0x1a/0x30
>>>>> [   53.656366]  </TASK>
>>>>> [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>> [   53.663552]  schedule+0x4a/0x160
>>>>> [   53.663553]  schedule_timeout+0x10a/0x120
>>>>> [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
>>>>> [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
>>>>> [   53.663558]  __wait_for_common+0xb9/0x1c0
>>>>> [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
>>>>> [   53.663561]  wait_for_completion+0x28/0x30
>>>>> [   53.663562]  __synchronize_srcu+0xbf/0x180
>>>>> [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>> [   53.663571]  ? i2c_repstart+0x30/0x80
>>>>> [   53.663576]  synchronize_srcu+0x46/0x120
>>>>> [   53.663577]  kill_dax+0x47/0x70
>>>>> [   53.663580]  __devm_create_dev_dax+0x112/0x470
>>>>> [   53.663582]  devm_create_dev_dax+0x26/0x50
>>>>> [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>> [   53.663585]  platform_probe+0x61/0xd0
>>>>> [   53.663589]  really_probe+0xe2/0x390
>>>>> [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
>>>>> [   53.663593]  __driver_probe_device+0x7e/0x160
>>>>> [   53.663594]  driver_probe_device+0x23/0xa0
>>>>> [   53.663596]  __device_attach_driver+0x92/0x120
>>>>> [   53.663597]  bus_for_each_drv+0x8c/0xf0
>>>>> [   53.663599]  __device_attach+0xc2/0x1f0
>>>>> [   53.663601]  device_initial_probe+0x17/0x20
>>>>> [   53.663603]  bus_probe_device+0xa8/0xb0
>>>>> [   53.663604]  device_add+0x687/0x8b0
>>>>> [   53.663607]  ? dev_set_name+0x56/0x70
>>>>> [   53.663609]  platform_device_add+0x102/0x260
>>>>> [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>> [   53.663612]  hmem_fallback_register_device+0x37/0x60
>>>>> [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [   53.663637]  walk_iomem_res_desc+0x55/0xb0
>>>>> [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [   53.663658]  process_one_work+0x1fa/0x630
>>>>> [   53.663662]  worker_thread+0x1b2/0x360
>>>>> [   53.663664]  kthread+0x128/0x250
>>>>> [   53.663666]  ? __pfx_worker_thread+0x10/0x10
>>>>> [   53.663668]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.663670]  ret_from_fork+0x139/0x1e0
>>>>> [   53.663672]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.663673]  ret_from_fork_asm+0x1a/0x30
>>>>> [   53.663677]  </TASK>
>>>>> [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>> [   53.700264] INFO: lockdep is turned off.
>>>>> [   53.701315] Preemption disabled at:
>>>>> [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>> [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>> [   53.701633] Tainted: [W]=WARN
>>>>> [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>> [   53.701638] Call Trace:
>>>>> [   53.701638]  <TASK>
>>>>> [   53.701640]  dump_stack_lvl+0xa8/0xd0
>>>>> [   53.701644]  dump_stack+0x14/0x20
>>>>> [   53.701645]  __schedule_bug+0xa2/0xd0
>>>>> [   53.701649]  __schedule+0xe6f/0x10d0
>>>>> [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
>>>>> [   53.701655]  ? lock_release+0x1e6/0x2b0
>>>>> [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
>>>>> [   53.701661]  schedule+0x4a/0x160
>>>>> [   53.701662]  schedule_timeout+0x10a/0x120
>>>>> [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
>>>>> [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
>>>>> [   53.701667]  __wait_for_common+0xb9/0x1c0
>>>>> [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
>>>>> [   53.701670]  wait_for_completion+0x28/0x30
>>>>> [   53.701671]  __synchronize_srcu+0xbf/0x180
>>>>> [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>> [   53.701682]  ? i2c_repstart+0x30/0x80
>>>>> [   53.701685]  synchronize_srcu+0x46/0x120
>>>>> [   53.701687]  kill_dax+0x47/0x70
>>>>> [   53.701689]  __devm_create_dev_dax+0x112/0x470
>>>>> [   53.701691]  devm_create_dev_dax+0x26/0x50
>>>>> [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>> [   53.701695]  platform_probe+0x61/0xd0
>>>>> [   53.701698]  really_probe+0xe2/0x390
>>>>> [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
>>>>> [   53.701701]  __driver_probe_device+0x7e/0x160
>>>>> [   53.701703]  driver_probe_device+0x23/0xa0
>>>>> [   53.701704]  __device_attach_driver+0x92/0x120
>>>>> [   53.701706]  bus_for_each_drv+0x8c/0xf0
>>>>> [   53.701708]  __device_attach+0xc2/0x1f0
>>>>> [   53.701710]  device_initial_probe+0x17/0x20
>>>>> [   53.701711]  bus_probe_device+0xa8/0xb0
>>>>> [   53.701712]  device_add+0x687/0x8b0
>>>>> [   53.701715]  ? dev_set_name+0x56/0x70
>>>>> [   53.701717]  platform_device_add+0x102/0x260
>>>>> [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>> [   53.701720]  hmem_fallback_register_device+0x37/0x60
>>>>> [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>> [   53.701734]  walk_iomem_res_desc+0x55/0xb0
>>>>> [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>> [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>> [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>> [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>> [   53.701756]  process_one_work+0x1fa/0x630
>>>>> [   53.701760]  worker_thread+0x1b2/0x360
>>>>> [   53.701762]  kthread+0x128/0x250
>>>>> [   53.701765]  ? __pfx_worker_thread+0x10/0x10
>>>>> [   53.701766]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.701768]  ret_from_fork+0x139/0x1e0
>>>>> [   53.701771]  ? __pfx_kthread+0x10/0x10
>>>>> [   53.701772]  ret_from_fork_asm+0x1a/0x30
>>>>> [   53.701777]  </TASK>
>>>>>
>>>>
>>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Dave Jiang 2 months, 3 weeks ago


On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote:
> 
> 
> On 7/16/2025 4:48 PM, Alison Schofield wrote:
>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>> Hi Alison,
>>>>>
>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>>> intersect with created CXL regions.
>>>>>>
>>>>>> Hi Smita,
>>>>>>
>>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>>> before region probe.
>>>>>>
>>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>>      CHECK   drivers/dax/hmem/hmem_notify.c
>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>>
>>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>>> into v6:
>>>>>
>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>>> struct resource *res)
>>>>>    {
>>>>>           walk_hmem_fn hmem_fn;
>>>>>
>>>>> -       guard(spinlock)(&hmem_notify_lock);
>>>>> +       spin_lock(&hmem_notify_lock);
>>>>>           hmem_fn = hmem_fallback_fn;
>>>>> +       spin_unlock(&hmem_notify_lock);
>>>>>
>>>>>           if (hmem_fn)
>>>>>                   hmem_fn(target_nid, res);
>>>>> -- 
>>>>
>>>> Hi Smita,  Adding the above got me past that, and doubling the timeout
>>>> below stopped that from happening. After that, I haven't had time to
>>>> trace so, I'll just dump on you for now:
>>>>
>>>> In /proc/iomem
>>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>>> actual region, not even disabled, is available.
>>>> c080000000-c47fffffff : region0
>>>>
>>>> And, here no CXL Window, no region, and a soft reserved.
>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>>     68e80000000-70e7fffffff : dax1.0
>>>>       68e80000000-70e7fffffff : System RAM (kmem)
>>>>
>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>>
>>> Hi Alison,
>>>
>>> To help better understand the current behavior, could you share more about
>>> your platform configuration? specifically, are there two memory cards
>>> involved? One at c080000000 (which appears as region0) and another at
>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>>> trying to understand the "before" state of the resources i.e, prior to
>>> trimming applied by my patches.
>>
>> Here are the soft reserveds -
>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>>
>> And this is what we expect -
>>
>> c080000000-17dbfffffff : CXL Window 0
>>    c080000000-c47fffffff : region2
>>      c080000000-c47fffffff : dax0.0
>>        c080000000-c47fffffff : System RAM (kmem)
>>
>>
>> 68e80000000-8d37fffffff : CXL Window 1
>>    68e80000000-70e7fffffff : region5
>>      68e80000000-70e7fffffff : dax1.0
>>        68e80000000-70e7fffffff : System RAM (kmem)
>>
>> And, like in prev message, iv v5 we get -
>>
>> c080000000-c47fffffff : region0
>>
>> 68e80000000-70e7fffffff : Soft Reserved
>>    68e80000000-70e7fffffff : dax1.0
>>      68e80000000-70e7fffffff : System RAM (kmem)
>>
>>
>> In v4, we 'almost' had what we expect, except that the HMEM driver
>> created those dax devices our of Soft Reserveds before region driver
>> could do same.
>>
> 
> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior.
> 
> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected.
> 
> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues..
> 
> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried):
> 
> if (IS_ENABLED(CONFIG_CXL_REGION) &&
>         region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {..
> 
> At this point, I’m running out of ideas on how to reliably coordinate this.. :(
> 
> Thanks
> Smita
> 
>>>
>>> Also, do you think it's feasible to change the direction of the soft reserve
>>> trimming, that is, defer it until after CXL region or memdev creation is
>>> complete? In this case it would be trimmed after but inline the existing
>>> region or memdev creation. This might simplify the flow by removing the need
>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>>> inside cxl_acpi_probe().
>>
>> Yes that aligns with my simple thinking. There's the trimming after a region
>> is successfully created, and it seems that could simply be called at the end
>> of *that* region creation.
>>
>> Then, there's the round up of all the unused Soft Reserveds, and that has
>> to wait until after all regions are created, ie. all endpoints have arrived
>> and we've given up all hope of creating another region in that space.
>> That's the timing challenge.
>>
>> -- Alison
>>
>>>
>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>>> and observed that it consistently avoided probe ordering issues in my setup.
>>>
>>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>>> v6 by immediately triggering fallback DAX registration
>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>>
>>> Thanks
>>> Smita
>>>
>>>>
>>>>>
>>>>> As for the log:
>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>>> cxl_mem probing
>>>>>
>>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>>
>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>>> trimming is attempted.
>>>>>
>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>>> guarantee load order when all components are built as modules. So even if
>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>>> cxl_mem in modular configurations. As a result, region creation is
>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>>> relevant probes complete.
>>>>>
>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>>> to return prematurely and trigger the timeout.
>>>>>
>>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>>> good enough to most boot-time races.
>>>>>
>>>>> One possible improvement I’m considering is to schedule a
>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>>> cxl_port) before initiating the soft reserve trimming.
>>>>>
>>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>>> looking for suggestions here.

Hi Smita,
Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery.

A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery.

I think either one will not require special ordering of the modules being loaded. 

DJ   

>>>>>
>>>>> Thanks
>>>>> Smita
>>>>>
>>>>>>
>>>>>>
>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>>> other info to reproduce.
>>>>>>
>>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>>> [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>>> [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>>> [   53.653540] preempt_count: 1, expected: 0
>>>>>> [   53.653554] RCU nest depth: 0, expected: 0
>>>>>> [   53.653568] 3 locks held by kworker/46:1/1875:
>>>>>> [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>> [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>> [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>> [   53.653598] Preemption disabled at:
>>>>>> [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>> [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>> [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>> [   53.653648] Call Trace:
>>>>>> [   53.653649]  <TASK>
>>>>>> [   53.653652]  dump_stack_lvl+0xa8/0xd0
>>>>>> [   53.653658]  dump_stack+0x14/0x20
>>>>>> [   53.653659]  __might_resched+0x1ae/0x2d0
>>>>>> [   53.653666]  __might_sleep+0x48/0x70
>>>>>> [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>>> [   53.653674]  ? __devm_add_action+0x3d/0x160
>>>>>> [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
>>>>>> [   53.653688]  __devres_alloc_node+0x4a/0x90
>>>>>> [   53.653689]  ? __devres_alloc_node+0x4a/0x90
>>>>>> [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>>> [   53.653693]  __devm_add_action+0x3d/0x160
>>>>>> [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
>>>>>> [   53.653700]  hmem_fallback_register_device+0x37/0x60
>>>>>> [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [   53.653739]  walk_iomem_res_desc+0x55/0xb0
>>>>>> [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [   53.653768]  process_one_work+0x1fa/0x630
>>>>>> [   53.653774]  worker_thread+0x1b2/0x360
>>>>>> [   53.653777]  kthread+0x128/0x250
>>>>>> [   53.653781]  ? __pfx_worker_thread+0x10/0x10
>>>>>> [   53.653784]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.653786]  ret_from_fork+0x139/0x1e0
>>>>>> [   53.653790]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.653792]  ret_from_fork_asm+0x1a/0x30
>>>>>> [   53.653801]  </TASK>
>>>>>>
>>>>>> [   53.654193] =============================
>>>>>> [   53.654203] [ BUG: Invalid wait context ]
>>>>>> [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
>>>>>> [   53.654623] -----------------------------
>>>>>> [   53.654785] kworker/46:1/1875 is trying to lock:
>>>>>> [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>>> [   53.655115] other info that might help us debug this:
>>>>>> [   53.655273] context-{5:5}
>>>>>> [   53.655428] 3 locks held by kworker/46:1/1875:
>>>>>> [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>> [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>> [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>> [   53.656062] stack backtrace:
>>>>>> [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>> [   53.656227] Tainted: [W]=WARN
>>>>>> [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>> [   53.656232] Call Trace:
>>>>>> [   53.656232]  <TASK>
>>>>>> [   53.656234]  dump_stack_lvl+0x85/0xd0
>>>>>> [   53.656238]  dump_stack+0x14/0x20
>>>>>> [   53.656239]  __lock_acquire+0xaf4/0x2200
>>>>>> [   53.656246]  lock_acquire+0xd8/0x300
>>>>>> [   53.656248]  ? kernfs_add_one+0x34/0x390
>>>>>> [   53.656252]  ? __might_resched+0x208/0x2d0
>>>>>> [   53.656257]  down_write+0x44/0xe0
>>>>>> [   53.656262]  ? kernfs_add_one+0x34/0x390
>>>>>> [   53.656263]  kernfs_add_one+0x34/0x390
>>>>>> [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
>>>>>> [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
>>>>>> [   53.656270]  kobject_add_internal+0xb1/0x2f0
>>>>>> [   53.656273]  kobject_add+0x7d/0xf0
>>>>>> [   53.656275]  ? get_device_parent+0x28/0x1e0
>>>>>> [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
>>>>>> [   53.656282]  device_add+0x124/0x8b0
>>>>>> [   53.656285]  ? dev_set_name+0x56/0x70
>>>>>> [   53.656287]  platform_device_add+0x102/0x260
>>>>>> [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>> [   53.656291]  hmem_fallback_register_device+0x37/0x60
>>>>>> [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [   53.656323]  walk_iomem_res_desc+0x55/0xb0
>>>>>> [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [   53.656346]  process_one_work+0x1fa/0x630
>>>>>> [   53.656350]  worker_thread+0x1b2/0x360
>>>>>> [   53.656352]  kthread+0x128/0x250
>>>>>> [   53.656354]  ? __pfx_worker_thread+0x10/0x10
>>>>>> [   53.656356]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.656357]  ret_from_fork+0x139/0x1e0
>>>>>> [   53.656360]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.656361]  ret_from_fork_asm+0x1a/0x30
>>>>>> [   53.656366]  </TASK>
>>>>>> [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>> [   53.663552]  schedule+0x4a/0x160
>>>>>> [   53.663553]  schedule_timeout+0x10a/0x120
>>>>>> [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
>>>>>> [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>> [   53.663558]  __wait_for_common+0xb9/0x1c0
>>>>>> [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
>>>>>> [   53.663561]  wait_for_completion+0x28/0x30
>>>>>> [   53.663562]  __synchronize_srcu+0xbf/0x180
>>>>>> [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>> [   53.663571]  ? i2c_repstart+0x30/0x80
>>>>>> [   53.663576]  synchronize_srcu+0x46/0x120
>>>>>> [   53.663577]  kill_dax+0x47/0x70
>>>>>> [   53.663580]  __devm_create_dev_dax+0x112/0x470
>>>>>> [   53.663582]  devm_create_dev_dax+0x26/0x50
>>>>>> [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>> [   53.663585]  platform_probe+0x61/0xd0
>>>>>> [   53.663589]  really_probe+0xe2/0x390
>>>>>> [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
>>>>>> [   53.663593]  __driver_probe_device+0x7e/0x160
>>>>>> [   53.663594]  driver_probe_device+0x23/0xa0
>>>>>> [   53.663596]  __device_attach_driver+0x92/0x120
>>>>>> [   53.663597]  bus_for_each_drv+0x8c/0xf0
>>>>>> [   53.663599]  __device_attach+0xc2/0x1f0
>>>>>> [   53.663601]  device_initial_probe+0x17/0x20
>>>>>> [   53.663603]  bus_probe_device+0xa8/0xb0
>>>>>> [   53.663604]  device_add+0x687/0x8b0
>>>>>> [   53.663607]  ? dev_set_name+0x56/0x70
>>>>>> [   53.663609]  platform_device_add+0x102/0x260
>>>>>> [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>> [   53.663612]  hmem_fallback_register_device+0x37/0x60
>>>>>> [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [   53.663637]  walk_iomem_res_desc+0x55/0xb0
>>>>>> [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [   53.663658]  process_one_work+0x1fa/0x630
>>>>>> [   53.663662]  worker_thread+0x1b2/0x360
>>>>>> [   53.663664]  kthread+0x128/0x250
>>>>>> [   53.663666]  ? __pfx_worker_thread+0x10/0x10
>>>>>> [   53.663668]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.663670]  ret_from_fork+0x139/0x1e0
>>>>>> [   53.663672]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.663673]  ret_from_fork_asm+0x1a/0x30
>>>>>> [   53.663677]  </TASK>
>>>>>> [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>> [   53.700264] INFO: lockdep is turned off.
>>>>>> [   53.701315] Preemption disabled at:
>>>>>> [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>> [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>> [   53.701633] Tainted: [W]=WARN
>>>>>> [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>> [   53.701638] Call Trace:
>>>>>> [   53.701638]  <TASK>
>>>>>> [   53.701640]  dump_stack_lvl+0xa8/0xd0
>>>>>> [   53.701644]  dump_stack+0x14/0x20
>>>>>> [   53.701645]  __schedule_bug+0xa2/0xd0
>>>>>> [   53.701649]  __schedule+0xe6f/0x10d0
>>>>>> [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
>>>>>> [   53.701655]  ? lock_release+0x1e6/0x2b0
>>>>>> [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>> [   53.701661]  schedule+0x4a/0x160
>>>>>> [   53.701662]  schedule_timeout+0x10a/0x120
>>>>>> [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
>>>>>> [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>> [   53.701667]  __wait_for_common+0xb9/0x1c0
>>>>>> [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
>>>>>> [   53.701670]  wait_for_completion+0x28/0x30
>>>>>> [   53.701671]  __synchronize_srcu+0xbf/0x180
>>>>>> [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>> [   53.701682]  ? i2c_repstart+0x30/0x80
>>>>>> [   53.701685]  synchronize_srcu+0x46/0x120
>>>>>> [   53.701687]  kill_dax+0x47/0x70
>>>>>> [   53.701689]  __devm_create_dev_dax+0x112/0x470
>>>>>> [   53.701691]  devm_create_dev_dax+0x26/0x50
>>>>>> [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>> [   53.701695]  platform_probe+0x61/0xd0
>>>>>> [   53.701698]  really_probe+0xe2/0x390
>>>>>> [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
>>>>>> [   53.701701]  __driver_probe_device+0x7e/0x160
>>>>>> [   53.701703]  driver_probe_device+0x23/0xa0
>>>>>> [   53.701704]  __device_attach_driver+0x92/0x120
>>>>>> [   53.701706]  bus_for_each_drv+0x8c/0xf0
>>>>>> [   53.701708]  __device_attach+0xc2/0x1f0
>>>>>> [   53.701710]  device_initial_probe+0x17/0x20
>>>>>> [   53.701711]  bus_probe_device+0xa8/0xb0
>>>>>> [   53.701712]  device_add+0x687/0x8b0
>>>>>> [   53.701715]  ? dev_set_name+0x56/0x70
>>>>>> [   53.701717]  platform_device_add+0x102/0x260
>>>>>> [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>> [   53.701720]  hmem_fallback_register_device+0x37/0x60
>>>>>> [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>> [   53.701734]  walk_iomem_res_desc+0x55/0xb0
>>>>>> [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>> [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>> [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>> [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>> [   53.701756]  process_one_work+0x1fa/0x630
>>>>>> [   53.701760]  worker_thread+0x1b2/0x360
>>>>>> [   53.701762]  kthread+0x128/0x250
>>>>>> [   53.701765]  ? __pfx_worker_thread+0x10/0x10
>>>>>> [   53.701766]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.701768]  ret_from_fork+0x139/0x1e0
>>>>>> [   53.701771]  ? __pfx_kthread+0x10/0x10
>>>>>> [   53.701772]  ret_from_fork_asm+0x1a/0x30
>>>>>> [   53.701777]  </TASK>
>>>>>>
>>>>>
>>>
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Koralahalli Channabasappa, Smita 2 months, 3 weeks ago

On 7/17/2025 12:06 PM, Dave Jiang wrote:
> 
> 
> On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote:
>>
>>
>> On 7/16/2025 4:48 PM, Alison Schofield wrote:
>>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>>> Hi Alison,
>>>>>>
>>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>>>> intersect with created CXL regions.
>>>>>>>
>>>>>>> Hi Smita,
>>>>>>>
>>>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>>>> before region probe.
>>>>>>>
>>>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>>>       CHECK   drivers/dax/hmem/hmem_notify.c
>>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>>>
>>>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>>>> into v6:
>>>>>>
>>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>>>> struct resource *res)
>>>>>>     {
>>>>>>            walk_hmem_fn hmem_fn;
>>>>>>
>>>>>> -       guard(spinlock)(&hmem_notify_lock);
>>>>>> +       spin_lock(&hmem_notify_lock);
>>>>>>            hmem_fn = hmem_fallback_fn;
>>>>>> +       spin_unlock(&hmem_notify_lock);
>>>>>>
>>>>>>            if (hmem_fn)
>>>>>>                    hmem_fn(target_nid, res);
>>>>>> -- 
>>>>>
>>>>> Hi Smita,  Adding the above got me past that, and doubling the timeout
>>>>> below stopped that from happening. After that, I haven't had time to
>>>>> trace so, I'll just dump on you for now:
>>>>>
>>>>> In /proc/iomem
>>>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>>>> actual region, not even disabled, is available.
>>>>> c080000000-c47fffffff : region0
>>>>>
>>>>> And, here no CXL Window, no region, and a soft reserved.
>>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>>>      68e80000000-70e7fffffff : dax1.0
>>>>>        68e80000000-70e7fffffff : System RAM (kmem)
>>>>>
>>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>>>
>>>> Hi Alison,
>>>>
>>>> To help better understand the current behavior, could you share more about
>>>> your platform configuration? specifically, are there two memory cards
>>>> involved? One at c080000000 (which appears as region0) and another at
>>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>>>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>>>> trying to understand the "before" state of the resources i.e, prior to
>>>> trimming applied by my patches.
>>>
>>> Here are the soft reserveds -
>>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
>>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>>>
>>> And this is what we expect -
>>>
>>> c080000000-17dbfffffff : CXL Window 0
>>>     c080000000-c47fffffff : region2
>>>       c080000000-c47fffffff : dax0.0
>>>         c080000000-c47fffffff : System RAM (kmem)
>>>
>>>
>>> 68e80000000-8d37fffffff : CXL Window 1
>>>     68e80000000-70e7fffffff : region5
>>>       68e80000000-70e7fffffff : dax1.0
>>>         68e80000000-70e7fffffff : System RAM (kmem)
>>>
>>> And, like in prev message, iv v5 we get -
>>>
>>> c080000000-c47fffffff : region0
>>>
>>> 68e80000000-70e7fffffff : Soft Reserved
>>>     68e80000000-70e7fffffff : dax1.0
>>>       68e80000000-70e7fffffff : System RAM (kmem)
>>>
>>>
>>> In v4, we 'almost' had what we expect, except that the HMEM driver
>>> created those dax devices our of Soft Reserveds before region driver
>>> could do same.
>>>
>>
>> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior.
>>
>> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected.
>>
>> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues..
>>
>> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried):
>>
>> if (IS_ENABLED(CONFIG_CXL_REGION) &&
>>          region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {..
>>
>> At this point, I’m running out of ideas on how to reliably coordinate this.. :(
>>
>> Thanks
>> Smita
>>
>>>>
>>>> Also, do you think it's feasible to change the direction of the soft reserve
>>>> trimming, that is, defer it until after CXL region or memdev creation is
>>>> complete? In this case it would be trimmed after but inline the existing
>>>> region or memdev creation. This might simplify the flow by removing the need
>>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>>>> inside cxl_acpi_probe().
>>>
>>> Yes that aligns with my simple thinking. There's the trimming after a region
>>> is successfully created, and it seems that could simply be called at the end
>>> of *that* region creation.
>>>
>>> Then, there's the round up of all the unused Soft Reserveds, and that has
>>> to wait until after all regions are created, ie. all endpoints have arrived
>>> and we've given up all hope of creating another region in that space.
>>> That's the timing challenge.
>>>
>>> -- Alison
>>>
>>>>
>>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>>>> and observed that it consistently avoided probe ordering issues in my setup.
>>>>
>>>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>>>> v6 by immediately triggering fallback DAX registration
>>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>>>
>>>> Thanks
>>>> Smita
>>>>
>>>>>
>>>>>>
>>>>>> As for the log:
>>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>>>> cxl_mem probing
>>>>>>
>>>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>>>
>>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>>>> trimming is attempted.
>>>>>>
>>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>>>> guarantee load order when all components are built as modules. So even if
>>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>>>> cxl_mem in modular configurations. As a result, region creation is
>>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>>>> relevant probes complete.
>>>>>>
>>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>>>> to return prematurely and trigger the timeout.
>>>>>>
>>>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>>>> good enough to most boot-time races.
>>>>>>
>>>>>> One possible improvement I’m considering is to schedule a
>>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>>>> cxl_port) before initiating the soft reserve trimming.
>>>>>>
>>>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>>>> looking for suggestions here.
> 
> Hi Smita,
> Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery.
> 
> A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery.
> 
> I think either one will not require special ordering of the modules being loaded.
> 
> DJ

I think we might need both, the counters and a settling timer to 
coordinate Soft Reserved trimming and DAX registration.

Here's the rough flow I'm thinking of. Let me know the flaws in this 
approach.

1. cxl_acpi_probe() schedules cxl_softreserv_work_fn() and exits early.
This work item is responsible for trimming leftover Soft Reserved memory 
ranges once all cxl_mem devices have finished probing.

2. A delayed work is initialized for the settle timer:

INIT_DELAYED_WORK(&cxl_probe_settle_work, cxl_probe_settle_fn);

3. In cxl_mem_probe():
      - Increment counter2 (memdevs in progress).
      - Increment counter1 (memdevs discovered).
      - On probe completion (success or failure), decrement counter2.
      - After each probe, re-arm the settle timer to extend the quiet
        period if more devices arrive (this might fail Im not sure if cxl
        mem devices come in too late)..
        mod_delayed_work(system_wq, &cxl_probe_settle_work, 30 * HZ);
      - Call wake_up(&cxl_softreserv_waitq); after each probe to notify
        listeners.

4. The settle timer callback (cxl_probe_settle_fn()) runs when no new 
devices have probed for a while (30s)
      timer_expired = true;
      wake_up(&cxl_softreserv_waitq);

5. In cxl_softreserv_work_fn()
     wait_event(cxl_softreserv_waitq,
     atomic_read(&cxl_mem_counter1) > 0 &&
     atomic_read(&cxl_mem_counter2) == 0 &&
     atomic_read(&timer_expired));

6. Once unblocked, cxl_softreserv_work_fn() trims Soft Reserved regions 
via cxl_region_softreserv_update().
(We do not perform any DAX fallback here as we dont want to endup with 
unresolved symbols when DAX_HMEM loads too late..)

7. Separately, dax_hmem_platform_probe() runs independently on module 
load, but also blocks on the same wait_event() condition if 
CONFIG_CXL_ACPI is enabled. Once the condition is satisfied, it invokes 
hmem_register_device() to register leftover Soft Reserved memory.

Thanks
Smita

> 
>>>>>>
>>>>>> Thanks
>>>>>> Smita
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>>>> other info to reproduce.
>>>>>>>
>>>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>>>> [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>>>> [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>>>> [   53.653540] preempt_count: 1, expected: 0
>>>>>>> [   53.653554] RCU nest depth: 0, expected: 0
>>>>>>> [   53.653568] 3 locks held by kworker/46:1/1875:
>>>>>>> [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>> [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>> [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>> [   53.653598] Preemption disabled at:
>>>>>>> [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>> [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>> [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>> [   53.653648] Call Trace:
>>>>>>> [   53.653649]  <TASK>
>>>>>>> [   53.653652]  dump_stack_lvl+0xa8/0xd0
>>>>>>> [   53.653658]  dump_stack+0x14/0x20
>>>>>>> [   53.653659]  __might_resched+0x1ae/0x2d0
>>>>>>> [   53.653666]  __might_sleep+0x48/0x70
>>>>>>> [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>>>> [   53.653674]  ? __devm_add_action+0x3d/0x160
>>>>>>> [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
>>>>>>> [   53.653688]  __devres_alloc_node+0x4a/0x90
>>>>>>> [   53.653689]  ? __devres_alloc_node+0x4a/0x90
>>>>>>> [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>>>> [   53.653693]  __devm_add_action+0x3d/0x160
>>>>>>> [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
>>>>>>> [   53.653700]  hmem_fallback_register_device+0x37/0x60
>>>>>>> [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [   53.653739]  walk_iomem_res_desc+0x55/0xb0
>>>>>>> [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [   53.653768]  process_one_work+0x1fa/0x630
>>>>>>> [   53.653774]  worker_thread+0x1b2/0x360
>>>>>>> [   53.653777]  kthread+0x128/0x250
>>>>>>> [   53.653781]  ? __pfx_worker_thread+0x10/0x10
>>>>>>> [   53.653784]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.653786]  ret_from_fork+0x139/0x1e0
>>>>>>> [   53.653790]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.653792]  ret_from_fork_asm+0x1a/0x30
>>>>>>> [   53.653801]  </TASK>
>>>>>>>
>>>>>>> [   53.654193] =============================
>>>>>>> [   53.654203] [ BUG: Invalid wait context ]
>>>>>>> [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
>>>>>>> [   53.654623] -----------------------------
>>>>>>> [   53.654785] kworker/46:1/1875 is trying to lock:
>>>>>>> [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>>>> [   53.655115] other info that might help us debug this:
>>>>>>> [   53.655273] context-{5:5}
>>>>>>> [   53.655428] 3 locks held by kworker/46:1/1875:
>>>>>>> [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>> [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>> [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>> [   53.656062] stack backtrace:
>>>>>>> [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>> [   53.656227] Tainted: [W]=WARN
>>>>>>> [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>> [   53.656232] Call Trace:
>>>>>>> [   53.656232]  <TASK>
>>>>>>> [   53.656234]  dump_stack_lvl+0x85/0xd0
>>>>>>> [   53.656238]  dump_stack+0x14/0x20
>>>>>>> [   53.656239]  __lock_acquire+0xaf4/0x2200
>>>>>>> [   53.656246]  lock_acquire+0xd8/0x300
>>>>>>> [   53.656248]  ? kernfs_add_one+0x34/0x390
>>>>>>> [   53.656252]  ? __might_resched+0x208/0x2d0
>>>>>>> [   53.656257]  down_write+0x44/0xe0
>>>>>>> [   53.656262]  ? kernfs_add_one+0x34/0x390
>>>>>>> [   53.656263]  kernfs_add_one+0x34/0x390
>>>>>>> [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
>>>>>>> [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
>>>>>>> [   53.656270]  kobject_add_internal+0xb1/0x2f0
>>>>>>> [   53.656273]  kobject_add+0x7d/0xf0
>>>>>>> [   53.656275]  ? get_device_parent+0x28/0x1e0
>>>>>>> [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
>>>>>>> [   53.656282]  device_add+0x124/0x8b0
>>>>>>> [   53.656285]  ? dev_set_name+0x56/0x70
>>>>>>> [   53.656287]  platform_device_add+0x102/0x260
>>>>>>> [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>> [   53.656291]  hmem_fallback_register_device+0x37/0x60
>>>>>>> [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [   53.656323]  walk_iomem_res_desc+0x55/0xb0
>>>>>>> [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [   53.656346]  process_one_work+0x1fa/0x630
>>>>>>> [   53.656350]  worker_thread+0x1b2/0x360
>>>>>>> [   53.656352]  kthread+0x128/0x250
>>>>>>> [   53.656354]  ? __pfx_worker_thread+0x10/0x10
>>>>>>> [   53.656356]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.656357]  ret_from_fork+0x139/0x1e0
>>>>>>> [   53.656360]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.656361]  ret_from_fork_asm+0x1a/0x30
>>>>>>> [   53.656366]  </TASK>
>>>>>>> [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>> [   53.663552]  schedule+0x4a/0x160
>>>>>>> [   53.663553]  schedule_timeout+0x10a/0x120
>>>>>>> [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
>>>>>>> [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>>> [   53.663558]  __wait_for_common+0xb9/0x1c0
>>>>>>> [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
>>>>>>> [   53.663561]  wait_for_completion+0x28/0x30
>>>>>>> [   53.663562]  __synchronize_srcu+0xbf/0x180
>>>>>>> [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>> [   53.663571]  ? i2c_repstart+0x30/0x80
>>>>>>> [   53.663576]  synchronize_srcu+0x46/0x120
>>>>>>> [   53.663577]  kill_dax+0x47/0x70
>>>>>>> [   53.663580]  __devm_create_dev_dax+0x112/0x470
>>>>>>> [   53.663582]  devm_create_dev_dax+0x26/0x50
>>>>>>> [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>> [   53.663585]  platform_probe+0x61/0xd0
>>>>>>> [   53.663589]  really_probe+0xe2/0x390
>>>>>>> [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
>>>>>>> [   53.663593]  __driver_probe_device+0x7e/0x160
>>>>>>> [   53.663594]  driver_probe_device+0x23/0xa0
>>>>>>> [   53.663596]  __device_attach_driver+0x92/0x120
>>>>>>> [   53.663597]  bus_for_each_drv+0x8c/0xf0
>>>>>>> [   53.663599]  __device_attach+0xc2/0x1f0
>>>>>>> [   53.663601]  device_initial_probe+0x17/0x20
>>>>>>> [   53.663603]  bus_probe_device+0xa8/0xb0
>>>>>>> [   53.663604]  device_add+0x687/0x8b0
>>>>>>> [   53.663607]  ? dev_set_name+0x56/0x70
>>>>>>> [   53.663609]  platform_device_add+0x102/0x260
>>>>>>> [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>> [   53.663612]  hmem_fallback_register_device+0x37/0x60
>>>>>>> [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [   53.663637]  walk_iomem_res_desc+0x55/0xb0
>>>>>>> [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [   53.663658]  process_one_work+0x1fa/0x630
>>>>>>> [   53.663662]  worker_thread+0x1b2/0x360
>>>>>>> [   53.663664]  kthread+0x128/0x250
>>>>>>> [   53.663666]  ? __pfx_worker_thread+0x10/0x10
>>>>>>> [   53.663668]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.663670]  ret_from_fork+0x139/0x1e0
>>>>>>> [   53.663672]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.663673]  ret_from_fork_asm+0x1a/0x30
>>>>>>> [   53.663677]  </TASK>
>>>>>>> [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>> [   53.700264] INFO: lockdep is turned off.
>>>>>>> [   53.701315] Preemption disabled at:
>>>>>>> [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>> [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>> [   53.701633] Tainted: [W]=WARN
>>>>>>> [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>> [   53.701638] Call Trace:
>>>>>>> [   53.701638]  <TASK>
>>>>>>> [   53.701640]  dump_stack_lvl+0xa8/0xd0
>>>>>>> [   53.701644]  dump_stack+0x14/0x20
>>>>>>> [   53.701645]  __schedule_bug+0xa2/0xd0
>>>>>>> [   53.701649]  __schedule+0xe6f/0x10d0
>>>>>>> [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
>>>>>>> [   53.701655]  ? lock_release+0x1e6/0x2b0
>>>>>>> [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>>> [   53.701661]  schedule+0x4a/0x160
>>>>>>> [   53.701662]  schedule_timeout+0x10a/0x120
>>>>>>> [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
>>>>>>> [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>>> [   53.701667]  __wait_for_common+0xb9/0x1c0
>>>>>>> [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
>>>>>>> [   53.701670]  wait_for_completion+0x28/0x30
>>>>>>> [   53.701671]  __synchronize_srcu+0xbf/0x180
>>>>>>> [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>> [   53.701682]  ? i2c_repstart+0x30/0x80
>>>>>>> [   53.701685]  synchronize_srcu+0x46/0x120
>>>>>>> [   53.701687]  kill_dax+0x47/0x70
>>>>>>> [   53.701689]  __devm_create_dev_dax+0x112/0x470
>>>>>>> [   53.701691]  devm_create_dev_dax+0x26/0x50
>>>>>>> [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>> [   53.701695]  platform_probe+0x61/0xd0
>>>>>>> [   53.701698]  really_probe+0xe2/0x390
>>>>>>> [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
>>>>>>> [   53.701701]  __driver_probe_device+0x7e/0x160
>>>>>>> [   53.701703]  driver_probe_device+0x23/0xa0
>>>>>>> [   53.701704]  __device_attach_driver+0x92/0x120
>>>>>>> [   53.701706]  bus_for_each_drv+0x8c/0xf0
>>>>>>> [   53.701708]  __device_attach+0xc2/0x1f0
>>>>>>> [   53.701710]  device_initial_probe+0x17/0x20
>>>>>>> [   53.701711]  bus_probe_device+0xa8/0xb0
>>>>>>> [   53.701712]  device_add+0x687/0x8b0
>>>>>>> [   53.701715]  ? dev_set_name+0x56/0x70
>>>>>>> [   53.701717]  platform_device_add+0x102/0x260
>>>>>>> [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>> [   53.701720]  hmem_fallback_register_device+0x37/0x60
>>>>>>> [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>> [   53.701734]  walk_iomem_res_desc+0x55/0xb0
>>>>>>> [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>> [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>> [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>> [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>> [   53.701756]  process_one_work+0x1fa/0x630
>>>>>>> [   53.701760]  worker_thread+0x1b2/0x360
>>>>>>> [   53.701762]  kthread+0x128/0x250
>>>>>>> [   53.701765]  ? __pfx_worker_thread+0x10/0x10
>>>>>>> [   53.701766]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.701768]  ret_from_fork+0x139/0x1e0
>>>>>>> [   53.701771]  ? __pfx_kthread+0x10/0x10
>>>>>>> [   53.701772]  ret_from_fork_asm+0x1a/0x30
>>>>>>> [   53.701777]  </TASK>
>>>>>>>
>>>>>>
>>>>
>>
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by Dave Jiang 2 months, 3 weeks ago


On 7/17/25 4:20 PM, Koralahalli Channabasappa, Smita wrote:
> On 7/17/2025 12:06 PM, Dave Jiang wrote:
>>
>>
>> On 7/17/25 10:58 AM, Koralahalli Channabasappa, Smita wrote:
>>>
>>>
>>> On 7/16/2025 4:48 PM, Alison Schofield wrote:
>>>> On Wed, Jul 16, 2025 at 02:29:52PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>> On 7/16/2025 1:20 PM, Alison Schofield wrote:
>>>>>> On Tue, Jul 15, 2025 at 11:01:23PM -0700, Koralahalli Channabasappa, Smita wrote:
>>>>>>> Hi Alison,
>>>>>>>
>>>>>>> On 7/15/2025 2:07 PM, Alison Schofield wrote:
>>>>>>>> On Tue, Jul 15, 2025 at 06:04:00PM +0000, Smita Koralahalli wrote:
>>>>>>>>> This series introduces the ability to manage SOFT RESERVED iomem
>>>>>>>>> resources, enabling the CXL driver to remove any portions that
>>>>>>>>> intersect with created CXL regions.
>>>>>>>>
>>>>>>>> Hi Smita,
>>>>>>>>
>>>>>>>> This set applied cleanly to todays cxl-next but fails like appended
>>>>>>>> before region probe.
>>>>>>>>
>>>>>>>> BTW - there were sparse warnings in the build that look related:
>>>>>>>>       CHECK   drivers/dax/hmem/hmem_notify.c
>>>>>>>> drivers/dax/hmem/hmem_notify.c:10:6: warning: context imbalance in 'hmem_register_fallback_handler' - wrong count at exit
>>>>>>>> drivers/dax/hmem/hmem_notify.c:24:9: warning: context imbalance in 'hmem_fallback_register_device' - wrong count at exit
>>>>>>>
>>>>>>> Thanks for pointing this bug. I failed to release the spinlock before
>>>>>>> calling hmem_register_device(), which internally calls platform_device_add()
>>>>>>> and can sleep. The following fix addresses that bug. I’ll incorporate this
>>>>>>> into v6:
>>>>>>>
>>>>>>> diff --git a/drivers/dax/hmem/hmem_notify.c b/drivers/dax/hmem/hmem_notify.c
>>>>>>> index 6c276c5bd51d..8f411f3fe7bd 100644
>>>>>>> --- a/drivers/dax/hmem/hmem_notify.c
>>>>>>> +++ b/drivers/dax/hmem/hmem_notify.c
>>>>>>> @@ -18,8 +18,9 @@ void hmem_fallback_register_device(int target_nid, const
>>>>>>> struct resource *res)
>>>>>>>     {
>>>>>>>            walk_hmem_fn hmem_fn;
>>>>>>>
>>>>>>> -       guard(spinlock)(&hmem_notify_lock);
>>>>>>> +       spin_lock(&hmem_notify_lock);
>>>>>>>            hmem_fn = hmem_fallback_fn;
>>>>>>> +       spin_unlock(&hmem_notify_lock);
>>>>>>>
>>>>>>>            if (hmem_fn)
>>>>>>>                    hmem_fn(target_nid, res);
>>>>>>> -- 
>>>>>>
>>>>>> Hi Smita,  Adding the above got me past that, and doubling the timeout
>>>>>> below stopped that from happening. After that, I haven't had time to
>>>>>> trace so, I'll just dump on you for now:
>>>>>>
>>>>>> In /proc/iomem
>>>>>> Here, we see a regions resource, no CXL Window, and no dax, and no
>>>>>> actual region, not even disabled, is available.
>>>>>> c080000000-c47fffffff : region0
>>>>>>
>>>>>> And, here no CXL Window, no region, and a soft reserved.
>>>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>>>>      68e80000000-70e7fffffff : dax1.0
>>>>>>        68e80000000-70e7fffffff : System RAM (kmem)
>>>>>>
>>>>>> I haven't yet walked through the v4 to v5 changes so I'll do that next.
>>>>>
>>>>> Hi Alison,
>>>>>
>>>>> To help better understand the current behavior, could you share more about
>>>>> your platform configuration? specifically, are there two memory cards
>>>>> involved? One at c080000000 (which appears as region0) and another at
>>>>> 68e80000000 (which is falling back to kmem via dax1.0)? Additionally, how
>>>>> are the Soft Reserved ranges laid out on your system for these cards? I'm
>>>>> trying to understand the "before" state of the resources i.e, prior to
>>>>> trimming applied by my patches.
>>>>
>>>> Here are the soft reserveds -
>>>> [] BIOS-e820: [mem 0x000000c080000000-0x000000c47fffffff] soft reserved
>>>> [] BIOS-e820: [mem 0x0000068e80000000-0x0000070e7fffffff] soft reserved
>>>>
>>>> And this is what we expect -
>>>>
>>>> c080000000-17dbfffffff : CXL Window 0
>>>>     c080000000-c47fffffff : region2
>>>>       c080000000-c47fffffff : dax0.0
>>>>         c080000000-c47fffffff : System RAM (kmem)
>>>>
>>>>
>>>> 68e80000000-8d37fffffff : CXL Window 1
>>>>     68e80000000-70e7fffffff : region5
>>>>       68e80000000-70e7fffffff : dax1.0
>>>>         68e80000000-70e7fffffff : System RAM (kmem)
>>>>
>>>> And, like in prev message, iv v5 we get -
>>>>
>>>> c080000000-c47fffffff : region0
>>>>
>>>> 68e80000000-70e7fffffff : Soft Reserved
>>>>     68e80000000-70e7fffffff : dax1.0
>>>>       68e80000000-70e7fffffff : System RAM (kmem)
>>>>
>>>>
>>>> In v4, we 'almost' had what we expect, except that the HMEM driver
>>>> created those dax devices our of Soft Reserveds before region driver
>>>> could do same.
>>>>
>>>
>>> Yeah, the only part I’m uncertain about in v5 is scheduling the fallback work from the failure path of cxl_acpi_probe(). That doesn’t feel like the right place to do it, and I suspect it might be contributing to the unexpected behavior.
>>>
>>> v4 had most of the necessary pieces in place, but it didn’t handle situations well when the driver load order didn’t go as expected.
>>>
>>> Even if we modify v4 to avoid triggering hmem_register_device() directly from cxl_acpi_probe() which helps avoid unresolved symbol errors when cxl_acpi_probe() loads too early, and instead only rely on dax_hmem to pick up Soft Reserved regions after cxl_acpi creates regions, we still run into timing issues..
>>>
>>> Specifically, there's no guarantee that hmem_register_device() will correctly skip the following check if the region state isn't fully ready, even with MODULE_SOFTDEP("pre: cxl_acpi") or using late_initcall() (which I tried):
>>>
>>> if (IS_ENABLED(CONFIG_CXL_REGION) &&
>>>          region_intersects(res->start, resource_size(res), IORESOURCE_MEM, IORES_DESC_CXL) != REGION_DISJOINT) {..
>>>
>>> At this point, I’m running out of ideas on how to reliably coordinate this.. :(
>>>
>>> Thanks
>>> Smita
>>>
>>>>>
>>>>> Also, do you think it's feasible to change the direction of the soft reserve
>>>>> trimming, that is, defer it until after CXL region or memdev creation is
>>>>> complete? In this case it would be trimmed after but inline the existing
>>>>> region or memdev creation. This might simplify the flow by removing the need
>>>>> for wait_event_timeout(), wait_for_device_probe() and the workqueue logic
>>>>> inside cxl_acpi_probe().
>>>>
>>>> Yes that aligns with my simple thinking. There's the trimming after a region
>>>> is successfully created, and it seems that could simply be called at the end
>>>> of *that* region creation.
>>>>
>>>> Then, there's the round up of all the unused Soft Reserveds, and that has
>>>> to wait until after all regions are created, ie. all endpoints have arrived
>>>> and we've given up all hope of creating another region in that space.
>>>> That's the timing challenge.
>>>>
>>>> -- Alison
>>>>
>>>>>
>>>>> (As a side note I experimented changing cxl_acpi_init() to a late_initcall()
>>>>> and observed that it consistently avoided probe ordering issues in my setup.
>>>>>
>>>>> Additional note: I realized that even when cxl_acpi_probe() fails, the
>>>>> fallback DAX registration path (via cxl_softreserv_mem_update()) still waits
>>>>> on cxl_mem_active() and wait_for_device_probe(). I plan to address this in
>>>>> v6 by immediately triggering fallback DAX registration
>>>>> (hmem_register_device()) when the ACPI probe fails, instead of waiting.)
>>>>>
>>>>> Thanks
>>>>> Smita
>>>>>
>>>>>>
>>>>>>>
>>>>>>> As for the log:
>>>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for
>>>>>>> cxl_mem probing
>>>>>>>
>>>>>>> I’m still analyzing that. Here's what was my thought process so far.
>>>>>>>
>>>>>>> - This occurs when cxl_acpi_probe() runs significantly earlier than
>>>>>>> cxl_mem_probe(), so CXL region creation (which happens in
>>>>>>> cxl_port_endpoint_probe()) may or may not have completed by the time
>>>>>>> trimming is attempted.
>>>>>>>
>>>>>>> - Both cxl_acpi and cxl_mem have MODULE_SOFTDEPs on cxl_port. This does
>>>>>>> guarantee load order when all components are built as modules. So even if
>>>>>>> the timeout occurs and cxl_mem_probe() hasn’t run within the wait window,
>>>>>>> MODULE_SOFTDEP ensures that cxl_port is loaded before both cxl_acpi and
>>>>>>> cxl_mem in modular configurations. As a result, region creation is
>>>>>>> eventually guaranteed, and wait_for_device_probe() will succeed once the
>>>>>>> relevant probes complete.
>>>>>>>
>>>>>>> - However, when both CONFIG_CXL_PORT=y and CONFIG_CXL_ACPI=y, there's no
>>>>>>> guarantee of probe ordering. In such cases, cxl_acpi_probe() may finish
>>>>>>> before cxl_port_probe() even begins, which can cause wait_for_device_probe()
>>>>>>> to return prematurely and trigger the timeout.
>>>>>>>
>>>>>>> - In my local setup, I observed that a 30-second timeout was generally
>>>>>>> sufficient to catch this race, allowing cxl_port_probe() to load while
>>>>>>> cxl_acpi_probe() is still active. Since we cannot mix built-in and modular
>>>>>>> components (i.e., have cxl_acpi=y and cxl_port=m), the timeout serves as a
>>>>>>> best-effort mechanism. After the timeout, wait_for_device_probe() ensures
>>>>>>> cxl_port_probe() has completed before trimming proceeds, making the logic
>>>>>>> good enough to most boot-time races.
>>>>>>>
>>>>>>> One possible improvement I’m considering is to schedule a
>>>>>>> delayed_workqueue() from cxl_acpi_probe(). This deferred work could wait
>>>>>>> slightly longer for cxl_mem_probe() to complete (which itself softdeps on
>>>>>>> cxl_port) before initiating the soft reserve trimming.
>>>>>>>
>>>>>>> That said, I'm still evaluating better options to more robustly coordinate
>>>>>>> probe ordering between cxl_acpi, cxl_port, cxl_mem and cxl_region and
>>>>>>> looking for suggestions here.
>>
>> Hi Smita,
>> Reading this thread and thinking about what can be done to deal with this. Throwing out some ideas and see what you think. My idea is to create two global counters that are are protected by a lock. You hava delayed workqueue that checks these counters. If counter1 is 0, go back to sleep and check later continuously with a reasonable time period. Every time a memdev endpoint starts probe, increment counter1 and counter2 atomically. Every time the probe is successful, decrement counter2. When you reach the condition of 'if (counter1 && counter2 == 0)' I think you can start soft reserve discovery.
>>
>> A different idea came from Dan. Arm a timer on the first memdev probe. Kick the timer to increment every time a new memdev gets probed. At some point things settles and timer goes off to trigger soft reserved discovery.
>>
>> I think either one will not require special ordering of the modules being loaded.
>>
>> DJ
> 
> I think we might need both, the counters and a settling timer to coordinate Soft Reserved trimming and DAX registration.
> 
> Here's the rough flow I'm thinking of. Let me know the flaws in this approach.

Seems reasonable to me. Don't forget to cancel timer if your condition is met and you are woken up early by a probe() finish. It really is best effort in dealing with the situation.

DJ

> 
> 1. cxl_acpi_probe() schedules cxl_softreserv_work_fn() and exits early.
> This work item is responsible for trimming leftover Soft Reserved memory ranges once all cxl_mem devices have finished probing.
> 
> 2. A delayed work is initialized for the settle timer:
> 
> INIT_DELAYED_WORK(&cxl_probe_settle_work, cxl_probe_settle_fn);
> 
> 3. In cxl_mem_probe():
>      - Increment counter2 (memdevs in progress).
>      - Increment counter1 (memdevs discovered).
>      - On probe completion (success or failure), decrement counter2.
>      - After each probe, re-arm the settle timer to extend the quiet
>        period if more devices arrive (this might fail Im not sure if cxl
>        mem devices come in too late)..
>        mod_delayed_work(system_wq, &cxl_probe_settle_work, 30 * HZ);
>      - Call wake_up(&cxl_softreserv_waitq); after each probe to notify
>        listeners.
> 
> 4. The settle timer callback (cxl_probe_settle_fn()) runs when no new devices have probed for a while (30s)
>      timer_expired = true;
>      wake_up(&cxl_softreserv_waitq);
> 
> 5. In cxl_softreserv_work_fn()
>     wait_event(cxl_softreserv_waitq,
>     atomic_read(&cxl_mem_counter1) > 0 &&
>     atomic_read(&cxl_mem_counter2) == 0 &&
>     atomic_read(&timer_expired));
> 
> 6. Once unblocked, cxl_softreserv_work_fn() trims Soft Reserved regions via cxl_region_softreserv_update().
> (We do not perform any DAX fallback here as we dont want to endup with unresolved symbols when DAX_HMEM loads too late..)
> 
> 7. Separately, dax_hmem_platform_probe() runs independently on module load, but also blocks on the same wait_event() condition if CONFIG_CXL_ACPI is enabled. Once the condition is satisfied, it invokes hmem_register_device() to register leftover Soft Reserved memory.
> 
> Thanks
> Smita
> 
>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Smita
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This isn't all the logs, I trimmed. Let me know if you need more or
>>>>>>>> other info to reproduce.
>>>>>>>>
>>>>>>>> [   53.652454] cxl_acpi:cxl_softreserv_mem_work_fn:888: Timeout waiting for cxl_mem probing
>>>>>>>> [   53.653293] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
>>>>>>>> [   53.653513] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1875, name: kworker/46:1
>>>>>>>> [   53.653540] preempt_count: 1, expected: 0
>>>>>>>> [   53.653554] RCU nest depth: 0, expected: 0
>>>>>>>> [   53.653568] 3 locks held by kworker/46:1/1875:
>>>>>>>> [   53.653569]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>>> [   53.653583]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>>> [   53.653589]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>>> [   53.653598] Preemption disabled at:
>>>>>>>> [   53.653599] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>>> [   53.653640] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Not tainted 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>>> [   53.653643] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>>> [   53.653648] Call Trace:
>>>>>>>> [   53.653649]  <TASK>
>>>>>>>> [   53.653652]  dump_stack_lvl+0xa8/0xd0
>>>>>>>> [   53.653658]  dump_stack+0x14/0x20
>>>>>>>> [   53.653659]  __might_resched+0x1ae/0x2d0
>>>>>>>> [   53.653666]  __might_sleep+0x48/0x70
>>>>>>>> [   53.653668]  __kmalloc_node_track_caller_noprof+0x349/0x510
>>>>>>>> [   53.653674]  ? __devm_add_action+0x3d/0x160
>>>>>>>> [   53.653685]  ? __pfx_devm_action_release+0x10/0x10
>>>>>>>> [   53.653688]  __devres_alloc_node+0x4a/0x90
>>>>>>>> [   53.653689]  ? __devres_alloc_node+0x4a/0x90
>>>>>>>> [   53.653691]  ? __pfx_release_memregion+0x10/0x10 [dax_hmem]
>>>>>>>> [   53.653693]  __devm_add_action+0x3d/0x160
>>>>>>>> [   53.653696]  hmem_register_device+0xea/0x230 [dax_hmem]
>>>>>>>> [   53.653700]  hmem_fallback_register_device+0x37/0x60
>>>>>>>> [   53.653703]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [   53.653739]  walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [   53.653744]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [   53.653755]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [   53.653761]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [   53.653763]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [   53.653768]  process_one_work+0x1fa/0x630
>>>>>>>> [   53.653774]  worker_thread+0x1b2/0x360
>>>>>>>> [   53.653777]  kthread+0x128/0x250
>>>>>>>> [   53.653781]  ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [   53.653784]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.653786]  ret_from_fork+0x139/0x1e0
>>>>>>>> [   53.653790]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.653792]  ret_from_fork_asm+0x1a/0x30
>>>>>>>> [   53.653801]  </TASK>
>>>>>>>>
>>>>>>>> [   53.654193] =============================
>>>>>>>> [   53.654203] [ BUG: Invalid wait context ]
>>>>>>>> [   53.654451] 6.16.0CXL-NEXT-ALISON-SR-V5+ #5 Tainted: G        W
>>>>>>>> [   53.654623] -----------------------------
>>>>>>>> [   53.654785] kworker/46:1/1875 is trying to lock:
>>>>>>>> [   53.654946] ff37d7824096d588 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x34/0x390
>>>>>>>> [   53.655115] other info that might help us debug this:
>>>>>>>> [   53.655273] context-{5:5}
>>>>>>>> [   53.655428] 3 locks held by kworker/46:1/1875:
>>>>>>>> [   53.655579]  #0: ff37d78240041548 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x578/0x630
>>>>>>>> [   53.655739]  #1: ff6b0385dedf3e38 (cxl_sr_work){+.+.}-{0:0}, at: process_one_work+0x1bd/0x630
>>>>>>>> [   53.655900]  #2: ffffffffb33476d8 (hmem_notify_lock){+.+.}-{3:3}, at: hmem_fallback_register_device+0x23/0x60
>>>>>>>> [   53.656062] stack backtrace:
>>>>>>>> [   53.656224] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>>> [   53.656227] Tainted: [W]=WARN
>>>>>>>> [   53.656228] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>>> [   53.656232] Call Trace:
>>>>>>>> [   53.656232]  <TASK>
>>>>>>>> [   53.656234]  dump_stack_lvl+0x85/0xd0
>>>>>>>> [   53.656238]  dump_stack+0x14/0x20
>>>>>>>> [   53.656239]  __lock_acquire+0xaf4/0x2200
>>>>>>>> [   53.656246]  lock_acquire+0xd8/0x300
>>>>>>>> [   53.656248]  ? kernfs_add_one+0x34/0x390
>>>>>>>> [   53.656252]  ? __might_resched+0x208/0x2d0
>>>>>>>> [   53.656257]  down_write+0x44/0xe0
>>>>>>>> [   53.656262]  ? kernfs_add_one+0x34/0x390
>>>>>>>> [   53.656263]  kernfs_add_one+0x34/0x390
>>>>>>>> [   53.656265]  kernfs_create_dir_ns+0x5a/0xa0
>>>>>>>> [   53.656268]  sysfs_create_dir_ns+0x74/0xd0
>>>>>>>> [   53.656270]  kobject_add_internal+0xb1/0x2f0
>>>>>>>> [   53.656273]  kobject_add+0x7d/0xf0
>>>>>>>> [   53.656275]  ? get_device_parent+0x28/0x1e0
>>>>>>>> [   53.656280]  ? __pfx_klist_children_get+0x10/0x10
>>>>>>>> [   53.656282]  device_add+0x124/0x8b0
>>>>>>>> [   53.656285]  ? dev_set_name+0x56/0x70
>>>>>>>> [   53.656287]  platform_device_add+0x102/0x260
>>>>>>>> [   53.656289]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>>> [   53.656291]  hmem_fallback_register_device+0x37/0x60
>>>>>>>> [   53.656294]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [   53.656323]  walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [   53.656326]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [   53.656335]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [   53.656342]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [   53.656343]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [   53.656346]  process_one_work+0x1fa/0x630
>>>>>>>> [   53.656350]  worker_thread+0x1b2/0x360
>>>>>>>> [   53.656352]  kthread+0x128/0x250
>>>>>>>> [   53.656354]  ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [   53.656356]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.656357]  ret_from_fork+0x139/0x1e0
>>>>>>>> [   53.656360]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.656361]  ret_from_fork_asm+0x1a/0x30
>>>>>>>> [   53.656366]  </TASK>
>>>>>>>> [   53.662274] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>>> [   53.663552]  schedule+0x4a/0x160
>>>>>>>> [   53.663553]  schedule_timeout+0x10a/0x120
>>>>>>>> [   53.663555]  ? debug_smp_processor_id+0x1b/0x30
>>>>>>>> [   53.663556]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>>>> [   53.663558]  __wait_for_common+0xb9/0x1c0
>>>>>>>> [   53.663559]  ? __pfx_schedule_timeout+0x10/0x10
>>>>>>>> [   53.663561]  wait_for_completion+0x28/0x30
>>>>>>>> [   53.663562]  __synchronize_srcu+0xbf/0x180
>>>>>>>> [   53.663566]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>>> [   53.663571]  ? i2c_repstart+0x30/0x80
>>>>>>>> [   53.663576]  synchronize_srcu+0x46/0x120
>>>>>>>> [   53.663577]  kill_dax+0x47/0x70
>>>>>>>> [   53.663580]  __devm_create_dev_dax+0x112/0x470
>>>>>>>> [   53.663582]  devm_create_dev_dax+0x26/0x50
>>>>>>>> [   53.663584]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>>> [   53.663585]  platform_probe+0x61/0xd0
>>>>>>>> [   53.663589]  really_probe+0xe2/0x390
>>>>>>>> [   53.663591]  ? __pfx___device_attach_driver+0x10/0x10
>>>>>>>> [   53.663593]  __driver_probe_device+0x7e/0x160
>>>>>>>> [   53.663594]  driver_probe_device+0x23/0xa0
>>>>>>>> [   53.663596]  __device_attach_driver+0x92/0x120
>>>>>>>> [   53.663597]  bus_for_each_drv+0x8c/0xf0
>>>>>>>> [   53.663599]  __device_attach+0xc2/0x1f0
>>>>>>>> [   53.663601]  device_initial_probe+0x17/0x20
>>>>>>>> [   53.663603]  bus_probe_device+0xa8/0xb0
>>>>>>>> [   53.663604]  device_add+0x687/0x8b0
>>>>>>>> [   53.663607]  ? dev_set_name+0x56/0x70
>>>>>>>> [   53.663609]  platform_device_add+0x102/0x260
>>>>>>>> [   53.663610]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>>> [   53.663612]  hmem_fallback_register_device+0x37/0x60
>>>>>>>> [   53.663614]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [   53.663637]  walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [   53.663640]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [   53.663647]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [   53.663654]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [   53.663655]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [   53.663658]  process_one_work+0x1fa/0x630
>>>>>>>> [   53.663662]  worker_thread+0x1b2/0x360
>>>>>>>> [   53.663664]  kthread+0x128/0x250
>>>>>>>> [   53.663666]  ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [   53.663668]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.663670]  ret_from_fork+0x139/0x1e0
>>>>>>>> [   53.663672]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.663673]  ret_from_fork_asm+0x1a/0x30
>>>>>>>> [   53.663677]  </TASK>
>>>>>>>> [   53.700107] BUG: scheduling while atomic: kworker/46:1/1875/0x00000002
>>>>>>>> [   53.700264] INFO: lockdep is turned off.
>>>>>>>> [   53.701315] Preemption disabled at:
>>>>>>>> [   53.701316] [<ffffffffb1e23993>] hmem_fallback_register_device+0x23/0x60
>>>>>>>> [   53.701631] CPU: 46 UID: 0 PID: 1875 Comm: kworker/46:1 Tainted: G        W           6.16.0CXL-NEXT-ALISON-SR-V5+ #5 PREEMPT(voluntary)
>>>>>>>> [   53.701633] Tainted: [W]=WARN
>>>>>>>> [   53.701635] Workqueue: events cxl_softreserv_mem_work_fn [cxl_acpi]
>>>>>>>> [   53.701638] Call Trace:
>>>>>>>> [   53.701638]  <TASK>
>>>>>>>> [   53.701640]  dump_stack_lvl+0xa8/0xd0
>>>>>>>> [   53.701644]  dump_stack+0x14/0x20
>>>>>>>> [   53.701645]  __schedule_bug+0xa2/0xd0
>>>>>>>> [   53.701649]  __schedule+0xe6f/0x10d0
>>>>>>>> [   53.701652]  ? debug_smp_processor_id+0x1b/0x30
>>>>>>>> [   53.701655]  ? lock_release+0x1e6/0x2b0
>>>>>>>> [   53.701658]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>>>> [   53.701661]  schedule+0x4a/0x160
>>>>>>>> [   53.701662]  schedule_timeout+0x10a/0x120
>>>>>>>> [   53.701664]  ? debug_smp_processor_id+0x1b/0x30
>>>>>>>> [   53.701666]  ? trace_hardirqs_on+0x5f/0xd0
>>>>>>>> [   53.701667]  __wait_for_common+0xb9/0x1c0
>>>>>>>> [   53.701668]  ? __pfx_schedule_timeout+0x10/0x10
>>>>>>>> [   53.701670]  wait_for_completion+0x28/0x30
>>>>>>>> [   53.701671]  __synchronize_srcu+0xbf/0x180
>>>>>>>> [   53.701677]  ? __pfx_wakeme_after_rcu+0x10/0x10
>>>>>>>> [   53.701682]  ? i2c_repstart+0x30/0x80
>>>>>>>> [   53.701685]  synchronize_srcu+0x46/0x120
>>>>>>>> [   53.701687]  kill_dax+0x47/0x70
>>>>>>>> [   53.701689]  __devm_create_dev_dax+0x112/0x470
>>>>>>>> [   53.701691]  devm_create_dev_dax+0x26/0x50
>>>>>>>> [   53.701693]  dax_hmem_probe+0x87/0xd0 [dax_hmem]
>>>>>>>> [   53.701695]  platform_probe+0x61/0xd0
>>>>>>>> [   53.701698]  really_probe+0xe2/0x390
>>>>>>>> [   53.701700]  ? __pfx___device_attach_driver+0x10/0x10
>>>>>>>> [   53.701701]  __driver_probe_device+0x7e/0x160
>>>>>>>> [   53.701703]  driver_probe_device+0x23/0xa0
>>>>>>>> [   53.701704]  __device_attach_driver+0x92/0x120
>>>>>>>> [   53.701706]  bus_for_each_drv+0x8c/0xf0
>>>>>>>> [   53.701708]  __device_attach+0xc2/0x1f0
>>>>>>>> [   53.701710]  device_initial_probe+0x17/0x20
>>>>>>>> [   53.701711]  bus_probe_device+0xa8/0xb0
>>>>>>>> [   53.701712]  device_add+0x687/0x8b0
>>>>>>>> [   53.701715]  ? dev_set_name+0x56/0x70
>>>>>>>> [   53.701717]  platform_device_add+0x102/0x260
>>>>>>>> [   53.701718]  hmem_register_device+0x160/0x230 [dax_hmem]
>>>>>>>> [   53.701720]  hmem_fallback_register_device+0x37/0x60
>>>>>>>> [   53.701722]  cxl_softreserv_mem_register+0x24/0x30 [cxl_core]
>>>>>>>> [   53.701734]  walk_iomem_res_desc+0x55/0xb0
>>>>>>>> [   53.701738]  ? __pfx_cxl_softreserv_mem_register+0x10/0x10 [cxl_core]
>>>>>>>> [   53.701745]  cxl_region_softreserv_update+0x46/0x50 [cxl_core]
>>>>>>>> [   53.701751]  cxl_softreserv_mem_work_fn+0x4a/0x110 [cxl_acpi]
>>>>>>>> [   53.701752]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>>>>>> [   53.701756]  process_one_work+0x1fa/0x630
>>>>>>>> [   53.701760]  worker_thread+0x1b2/0x360
>>>>>>>> [   53.701762]  kthread+0x128/0x250
>>>>>>>> [   53.701765]  ? __pfx_worker_thread+0x10/0x10
>>>>>>>> [   53.701766]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.701768]  ret_from_fork+0x139/0x1e0
>>>>>>>> [   53.701771]  ? __pfx_kthread+0x10/0x10
>>>>>>>> [   53.701772]  ret_from_fork_asm+0x1a/0x30
>>>>>>>> [   53.701777]  </TASK>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
> 
>

Re: [PATCH v5 0/7] Add managed SOFT RESERVE resource handling

Posted by dan.j.williams@intel.com 2 months, 2 weeks ago

Smita Koralahalli wrote:
> This series introduces the ability to manage SOFT RESERVED iomem
> resources, enabling the CXL driver to remove any portions that
> intersect with created CXL regions.
> 
> The current approach of leaving SOFT RESERVED entries as is can result
> in failures during device hotplug such as CXL because the address range
> remains reserved and unavailable for reuse even after region teardown.

I will go through the patches, but the main concern here is not hotplug,
it is region assembly failure.

We have a constant drip of surprising platform behaviors that trip up
the driver leaving memory stranded. Specifically, device-dax defers to
CXL to assemble the region representing the soft-reserve range, CXL
fails to complete that assembly due to being confused by the platform,
end user wonders why their platform BIOS sees memory capacity that Linux
does not see.

So the priority order of solutions needed here is:

1/ Fix all shipping platform "quirks", try to prevent new ones from
   being created. I.e. ideally, long term, Linux doed not need a
   soft-reserve fallback and just always ignores Soft Reserve in
   CXL Windows because the CXL subsystem will handle it.

2/ In the near term forseeable future, for all yet to be solved or yet
   to be discovered platform quirks, provide a device-dax fallback to
   recover baseline device-dax behavior (equivalent to putting cxl_acpi on
   a modprobe deny-list).

3/ For hotplug, remove the conflicting resource.

> To address this, the CXL driver now uses a background worker that waits
> for cxl_mem driver probe to complete before scanning for intersecting
> resources. Then the driver walks through created CXL regions to trim any
> intersections with SOFT RESERVED resources in the iomem tree.

The precision of this gives me pause. I think it is fine to make this
more coarse because any mismatch between Soft Reserve and a CXL Window
resource should be cause to give up on the CXL side.

If a Soft Reserve range straddles a CXL window and "System RAM", give up
on trying to use the CXL driver on that system.

CXL does not completely cover a soft-reserve region, give up on trying
to use the CXL driver on that system.

Effectively anytime we detect unexpected platform shenanigans it is
likely indicating missing understanding in the Linux driver.

> The following scenarios have been tested:

Nice! Appreciate you including the test case results.

[..]
> Example 3: No alignment
> |---------- "Soft Reserved" ----------|
> 	|---- "Region #" ----|

Per above, CXL subsystem should completely give up in this scenario. The
BIOS said that all of the range is Conventional memory and CXL is only
creating a region for part of it. Somebody is wrong. Given the fact that
non-CXL aware OSes would try to use the entirety of the Soft Reserved
region, then this scenario is "disable CXL, it clearly does not
understand this platform".