If a kernel module registers a firmware upload API ops set, then it's
unable to be moved due to effectively a cyclic reference that the module
depends on the upload which depends on the module.
Instead, only require the try_module_get when an upload is requested to
disallow unloading a module only while the upload is in progress.
Fixes: 97730bbb242c ("firmware_loader: Add firmware-upload support")
CC: Sean Christopherson <seanjc@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Ashish Kalra <ashish.kalra@amd.com>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: John Allen <john.allen@amd.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: "David S. Miller" <davem@davemloft.net>
CC: Michael Roth <michael.roth@amd.com>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Russ Weight <russ.weight@linux.dev>
CC: Danilo Krummrich <dakr@redhat.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: "Rafael J. Wysocki" <rafael@kernel.org>
CC: Tianfei zhang <tianfei.zhang@intel.com>
CC: Alexey Kardashevskiy <aik@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Russ Weight <russ.weight@linux.dev>
Signed-off-by: Dionna Glaze <dionnaglaze@google.com>
---
drivers/base/firmware_loader/sysfs_upload.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/drivers/base/firmware_loader/sysfs_upload.c b/drivers/base/firmware_loader/sysfs_upload.c
index 829270067d163..7d9c6aef7720a 100644
--- a/drivers/base/firmware_loader/sysfs_upload.c
+++ b/drivers/base/firmware_loader/sysfs_upload.c
@@ -204,6 +204,7 @@ static void fw_upload_main(struct work_struct *work)
fwlp->ops->cleanup(fwl);
putdev_exit:
+ module_put(fwlp->module);
put_device(fw_dev->parent);
/*
@@ -239,6 +240,9 @@ int fw_upload_start(struct fw_sysfs *fw_sysfs)
}
fwlp = fw_sysfs->fw_upload_priv;
+ if (!try_module_get(fwlp->module)) /* released in fw_upload_main */
+ return -EFAULT;
+
mutex_lock(&fwlp->lock);
/* Do not interfere with an on-going fw_upload */
@@ -310,13 +314,10 @@ firmware_upload_register(struct module *module, struct device *parent,
return ERR_PTR(-EINVAL);
}
- if (!try_module_get(module))
- return ERR_PTR(-EFAULT);
-
fw_upload = kzalloc(sizeof(*fw_upload), GFP_KERNEL);
if (!fw_upload) {
ret = -ENOMEM;
- goto exit_module_put;
+ goto exit_err;
}
fw_upload_priv = kzalloc(sizeof(*fw_upload_priv), GFP_KERNEL);
@@ -358,7 +359,7 @@ firmware_upload_register(struct module *module, struct device *parent,
if (ret) {
dev_err(fw_dev, "%s: device_register failed\n", __func__);
put_device(fw_dev);
- goto exit_module_put;
+ goto exit_err;
}
return fw_upload;
@@ -372,8 +373,7 @@ firmware_upload_register(struct module *module, struct device *parent,
free_fw_upload:
kfree(fw_upload);
-exit_module_put:
- module_put(module);
+exit_err:
return ERR_PTR(ret);
}
@@ -387,7 +387,6 @@ void firmware_upload_unregister(struct fw_upload *fw_upload)
{
struct fw_sysfs *fw_sysfs = fw_upload->priv;
struct fw_upload_priv *fw_upload_priv = fw_sysfs->fw_upload_priv;
- struct module *module = fw_upload_priv->module;
mutex_lock(&fw_upload_priv->lock);
if (fw_upload_priv->progress == FW_UPLOAD_PROG_IDLE) {
@@ -403,6 +402,5 @@ void firmware_upload_unregister(struct fw_upload *fw_upload)
unregister:
device_unregister(&fw_sysfs->dev);
- module_put(module);
}
EXPORT_SYMBOL_GPL(firmware_upload_unregister);
--
2.47.0.277.g8800431eea-goog
Dionna Glaze wrote: > If a kernel module registers a firmware upload API ops set, then it's > unable to be moved due to effectively a cyclic reference that the module > depends on the upload which depends on the module. > > Instead, only require the try_module_get when an upload is requested to > disallow unloading a module only while the upload is in progress. Oh, interesting, I wondered why CXL did not uncover this loop in its usage only to realize that CXL calls firmware registration from the cxl_pci module, but the @module paramter passed to firmware_upload_register() is the cxl_core module. I.e. we are accidentally avoiding the problem. I assume other CONFIG_FW_UPLOAD users simply do not test module removal. However, I think the fix is simply to remove all module reference taking by the firmware_loader core. It is the consumer's responsibility to call firmware_upload_unregister() in its module removal path and that should flush any and all future usage of the passed in ops structure.
On Tue, Nov 12, 2024 at 06:40:28PM -0800, Dan Williams wrote: > Dionna Glaze wrote: > > If a kernel module registers a firmware upload API ops set, then it's > > unable to be moved due to effectively a cyclic reference that the module > > depends on the upload which depends on the module. > > > > Instead, only require the try_module_get when an upload is requested to > > disallow unloading a module only while the upload is in progress. > > Oh, interesting, I wondered why CXL did not uncover this loop in its > usage only to realize that CXL calls firmware registration from the > cxl_pci module, but the @module paramter passed to > firmware_upload_register() is the cxl_core module. I.e. we are > accidentally avoiding the problem. I assume other CONFIG_FW_UPLOAD users > simply do not test module removal. > > However, I think the fix is simply to remove all module reference taking > by the firmware_loader core. It is the consumer's responsibility to call > firmware_upload_unregister() in its module removal path and that should > flush any and all future usage of the passed in ops structure. As I understand it, if a module directly references symbols in another module, then the reference count is automatically incremented to ensure that the dependent symbols are available to the consumer. In this case, the firmware_loader does not directly reference symbol names in the device driver that registered it. The call-back function pointers are provided during registration. Without explicitly incrementing the module reference count, it is possible to remove the device driver while leaving the firmware loader instance (and sysfs entries) intact. Accessing those sysfs nodes would result in references to pointers that are no longer valid. Clearly this would be an unexpected/unusual case. Someone with root access would have to remove the device driver. I'm not sure how much effort should be expended in preventing it - but this is the reasoning behind the incrementing/decrementing of the module reference counts. - Russ
Russ Weight wrote: [..] > Clearly this would be an unexpected/unusual case. Someone with root > access would have to remove the device driver. I'm not sure how much > effort should be expended in preventing it - but this is the reasoning > behind the incrementing/decrementing of the module reference counts. The module reference needs to be held only if the producer of those symbols can be removed without triggering some coordinated removal with action consumer. A driver that fails to call firmware_upload_unregister() in its module removal path is simply a driver with a memory-leak and use-after-free bug, not something the firmware upload core needs to worry about. So, the prevention mechanism is "thou shalt use firmware_upload_unregister() correctly", and when that is in place explicit module references are not only redundant, but trying to implement them causes circular dependency loops.
On 11/14/24 12:17, Dan Williams wrote: > Russ Weight wrote: > [..] >> Clearly this would be an unexpected/unusual case. Someone with root >> access would have to remove the device driver. I'm not sure how much >> effort should be expended in preventing it - but this is the reasoning >> behind the incrementing/decrementing of the module reference counts. > > The module reference needs to be held only if the producer of those > symbols can be removed without triggering some coordinated removal with > action consumer. A driver that fails to call > firmware_upload_unregister() in its module removal path is simply a driver > with a memory-leak and use-after-free bug, not something the firmware > upload core needs to worry about. > > So, the prevention mechanism is "thou shalt use > firmware_upload_unregister() correctly", and when that is in place > explicit module references are not only redundant, but trying to > implement them causes circular dependency loops. I believe that is how other similar services, like debugfs, work, the module is responsible for cleaning up. Thanks, Tom
On Thu, Nov 14, 2024 at 01:30:16PM -0600, Tom Lendacky wrote: > On 11/14/24 12:17, Dan Williams wrote: > > Russ Weight wrote: > > [..] > >> Clearly this would be an unexpected/unusual case. Someone with root > >> access would have to remove the device driver. I'm not sure how much > >> effort should be expended in preventing it - but this is the reasoning > >> behind the incrementing/decrementing of the module reference counts. > > > > The module reference needs to be held only if the producer of those > > symbols can be removed without triggering some coordinated removal with > > action consumer. A driver that fails to call > > firmware_upload_unregister() in its module removal path is simply a driver > > with a memory-leak and use-after-free bug, not something the firmware > > upload core needs to worry about. > > > > So, the prevention mechanism is "thou shalt use > > firmware_upload_unregister() correctly", and when that is in place > > explicit module references are not only redundant, but trying to > > implement them causes circular dependency loops. Thanks for the explanation. Makes total sense to me. I agree that the module reference counts can/should be removed. - Russ > > I believe that is how other similar services, like debugfs, work, the > module is responsible for cleaning up. > > Thanks, > Tom
On Tue, Nov 12, 2024 at 6:40 PM Dan Williams <dan.j.williams@intel.com> wrote: >... > However, I think the fix is simply to remove all module reference taking > by the firmware_loader core. It is the consumer's responsibility to call > firmware_upload_unregister() in its module removal path and that should > flush any and all future usage of the passed in ops structure. That would suggest the addition of the refcounting in v1 to fix a test means the test_firmware is wrong? https://lore.kernel.org/all/20220421212204.36052-5-russell.h.weight@intel.com/ Adding Kees in case he knows better. -- -Dionna Glaze, PhD, CISSP, CCSP (she/her)
© 2016 - 2024 Red Hat, Inc.