[PATCH v2] cxl/core: Return error when cxl_endpoint_gather_bandwidth() handles a non-PCI device

Li Zhijian posted 1 patch 1 month ago
There is a newer version of this series
drivers/cxl/core/cdat.c | 3 +++
1 file changed, 3 insertions(+)
[PATCH v2] cxl/core: Return error when cxl_endpoint_gather_bandwidth() handles a non-PCI device
Posted by Li Zhijian 1 month ago
The function cxl_endpoint_gather_bandwidth() invokes
pci_bus_read/write_XXX(), however, not all CXL devices are presently
implemented via PCI. It is recognized that the cxl_test has realized a CXL
device using a platform device.

Calling pci_bus_read/write_XXX() in cxl_test will cause kernel panic:
 platform cxl_host_bridge.3: host supports CXL (restricted)
 Oops: general protection fault, probably for non-canonical address 0x3ef17856fcae4fbd: 0000 [#1] PREEMPT SMP PTI
 CPU: 1 UID: 0 PID: 9167 Comm: cxl Kdump: loaded Tainted: G           OE      6.12.0-rc3-master+ #66
 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
 Hardware name: LENOVO 90CXCTO1WW/, BIOS FCKT70AUS 04/23/2015
 RIP: 0010:pci_bus_read_config_word+0x1c/0x60
 Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 b8 87 00 00 00 48 83 ec 08 c7 44 24 04 00 00 00 00 f6 c2 01 75 29 <48> 8b 87 c0 00 00 00 48 89 cb 4c 8d 44 24 04 b9 02 00 00 00 48 8b
 RSP: 0018:ffffa115034dfbb8 EFLAGS: 00010246
 RAX: 0000000000000087 RBX: 0000000000000012 RCX: ffffa115034dfbfe
 RDX: 0000000000000016 RSI: 000000006f4e2f4e RDI: 3ef17856fcae4efd
 RBP: ffff8cc229121b48 R08: 0000000000000010 R09: 0000000000000000
 R10: 0000000000000001 R11: ffff8cc225434360 R12: ffffa115034dfbfe
 R13: 0000000000000000 R14: ffff8cc2f119a080 R15: ffffa115034dfc50
 FS:  00007f31d93537c0(0000) GS:ffff8cc510a80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f31d95f3370 CR3: 00000001163ea001 CR4: 00000000001726f0
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x27
  ? die_addr+0x38/0x60
  ? exc_general_protection+0x1f5/0x4b0
  ? asm_exc_general_protection+0x22/0x30
  ? pci_bus_read_config_word+0x1c/0x60
  pcie_capability_read_word+0x93/0xb0
  pcie_link_speed_mbps+0x18/0x50
  cxl_pci_get_bandwidth+0x18/0x60 [cxl_core]
  cxl_endpoint_gather_bandwidth.constprop.0+0xf4/0x230 [cxl_core]
  ? xas_store+0x54/0x660
  ? preempt_count_add+0x69/0xa0
  ? _raw_spin_lock+0x13/0x40
  ? __kmalloc_cache_noprof+0xe7/0x270
  cxl_region_shared_upstream_bandwidth_update+0x9c/0x790 [cxl_core]
  cxl_region_attach+0x520/0x7e0 [cxl_core]
  store_targetN+0xf2/0x120 [cxl_core]
  kernfs_fop_write_iter+0x13a/0x1f0
  vfs_write+0x23b/0x410
  ksys_write+0x53/0xd0
  do_syscall_64+0x62/0x180
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

And Ying also reported a KASAN error with similar calltrace.

Reported-by: "Huang, Ying" <ying.huang@intel.com>
Closes: https://lore.kernel.org/linux-cxl/87y12w9vp5.fsf@yhuang6-desk2.ccr.corp.intel.com/
Fixes: a5ab0de0ebaa ("cxl: Calculate region bandwidth of targets with shared upstream link")

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
V2:
  Check device type in original cxl_endpoint_gather_bandwidth() instead of mocking a new one. # Dan
  Also noticed that the existing cxl_switch_gather_bandwidth() also have the same check.
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/cxl/core/cdat.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index ef1621d40f05..1a510e692ac0 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -641,6 +641,9 @@ static int cxl_endpoint_gather_bandwidth(struct cxl_region *cxlr,
 	void *ptr;
 	int rc;
 
+	if (!dev_is_pci(cxlds->dev))
+		return -EINVAL;
+
 	if (cxlds->rcd)
 		return -ENODEV;
 
-- 
2.44.0
Re: [PATCH v2] cxl/core: Return error when cxl_endpoint_gather_bandwidth() handles a non-PCI device
Posted by Dan Williams 1 month ago
Li Zhijian wrote:
> The function cxl_endpoint_gather_bandwidth() invokes
> pci_bus_read/write_XXX(), however, not all CXL devices are presently
> implemented via PCI. It is recognized that the cxl_test has realized a CXL
> device using a platform device.


> Calling pci_bus_read/write_XXX() in cxl_test will cause kernel panic:

I like that you include the failure info. I would also trim it to just
the salient information, like this:

 platform cxl_host_bridge.3: host supports CXL (restricted)
 Oops: general protection fault, probably for non-canonical address 0x3ef17856fcae4fbd: 0000 [#1] PREEMPT SMP PTI
 RIP: 0010:pci_bus_read_config_word+0x1c/0x60
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x27
  ? die_addr+0x38/0x60
  ? exc_general_protection+0x1f5/0x4b0
  ? asm_exc_general_protection+0x22/0x30
  ? pci_bus_read_config_word+0x1c/0x60
  pcie_capability_read_word+0x93/0xb0
  pcie_link_speed_mbps+0x18/0x50
  cxl_pci_get_bandwidth+0x18/0x60 [cxl_core]
  cxl_endpoint_gather_bandwidth.constprop.0+0xf4/0x230 [cxl_core]
  ? xas_store+0x54/0x660
  ? preempt_count_add+0x69/0xa0
  ? _raw_spin_lock+0x13/0x40
  ? __kmalloc_cache_noprof+0xe7/0x270
  cxl_region_shared_upstream_bandwidth_update+0x9c/0x790 [cxl_core]
  cxl_region_attach+0x520/0x7e0 [cxl_core]
  store_targetN+0xf2/0x120 [cxl_core]

> And Ying also reported a KASAN error with similar calltrace.
> 
> Reported-by: "Huang, Ying" <ying.huang@intel.com>
> Closes: https://lore.kernel.org/linux-cxl/87y12w9vp5.fsf@yhuang6-desk2.ccr.corp.intel.com/

Minor, but this can also be trimmed:

Closes: http://lore.kernel.org/87y12w9vp5.fsf@yhuang6-desk2.ccr.corp.intel.com

> Fixes: a5ab0de0ebaa ("cxl: Calculate region bandwidth of targets with shared upstream link")
> 
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
> V2:
>   Check device type in original cxl_endpoint_gather_bandwidth() instead of mocking a new one. # Dan
>   Also noticed that the existing cxl_switch_gather_bandwidth() also have the same check.
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  drivers/cxl/core/cdat.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index ef1621d40f05..1a510e692ac0 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c
> @@ -641,6 +641,9 @@ static int cxl_endpoint_gather_bandwidth(struct cxl_region *cxlr,
>  	void *ptr;
>  	int rc;
>  
> +	if (!dev_is_pci(cxlds->dev))
> +		return -EINVAL;

This should be -ENODEV or -ENXIO. If this error code ever leaked out to
userspace the application would think it passed an "invalid argument" vs
encountering "no such device".

Feel free to add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...after fixing that up.