drivers/base/core.c | 26 +++++++++++++++++++++++--- drivers/of/overlay.c | 6 ++++++ include/linux/device.h | 1 + 3 files changed, 30 insertions(+), 3 deletions(-)
Hi, In the following sequence: of_platform_depopulate(); /* Remove devices from a DT overlay node */ of_overlay_remove(); /* Remove the DT overlay node itself */ Some warnings are raised by __of_changeset_entry_destroy() which was called from of_overlay_remove(): ERROR: memory leak, expected refcount 1 instead of 2 ... The issue is that, during the device devlink removals triggered from the of_platform_depopulate(), jobs are put in a workqueue. These jobs drop the reference to the devices. When a device is no more referenced (refcount == 0), it is released and the reference to its of_node is dropped by a call to of_node_put(). These operations are fully correct except that, because of the workqueue, they are done asynchronously with respect to function calls. In the sequence provided, the jobs are run too late, after the call to __of_changeset_entry_destroy() and so a missing of_node_put() call is detected by __of_changeset_entry_destroy(). This series fixes this issue introducing device_link_wait_removal() in order to wait for the end of jobs execution (patch 1) and using this function to synchronize the overlay removal with the end of jobs execution (patch 2). Best regards, Hervé Herve Codina (2): driver core: Introduce device_link_wait_removal() of: overlay: Synchronize of_overlay_remove() with the devlink removals drivers/base/core.c | 26 +++++++++++++++++++++++--- drivers/of/overlay.c | 6 ++++++ include/linux/device.h | 1 + 3 files changed, 30 insertions(+), 3 deletions(-) -- 2.42.0
On Thu, Nov 30, 2023 at 06:41:07PM +0100, Herve Codina wrote: > Hi, +Saravana for comment Looks okay to me though. > > In the following sequence: > of_platform_depopulate(); /* Remove devices from a DT overlay node */ > of_overlay_remove(); /* Remove the DT overlay node itself */ > > Some warnings are raised by __of_changeset_entry_destroy() which was > called from of_overlay_remove(): > ERROR: memory leak, expected refcount 1 instead of 2 ... > > The issue is that, during the device devlink removals triggered from the > of_platform_depopulate(), jobs are put in a workqueue. > These jobs drop the reference to the devices. When a device is no more > referenced (refcount == 0), it is released and the reference to its > of_node is dropped by a call to of_node_put(). > These operations are fully correct except that, because of the > workqueue, they are done asynchronously with respect to function calls. > > In the sequence provided, the jobs are run too late, after the call to > __of_changeset_entry_destroy() and so a missing of_node_put() call is > detected by __of_changeset_entry_destroy(). > > This series fixes this issue introducing device_link_wait_removal() in > order to wait for the end of jobs execution (patch 1) and using this > function to synchronize the overlay removal with the end of jobs > execution (patch 2). > > Best regards, > Hervé > > Herve Codina (2): > driver core: Introduce device_link_wait_removal() > of: overlay: Synchronize of_overlay_remove() with the devlink removals > > drivers/base/core.c | 26 +++++++++++++++++++++++--- > drivers/of/overlay.c | 6 ++++++ > include/linux/device.h | 1 + > 3 files changed, 30 insertions(+), 3 deletions(-) > > -- > 2.42.0 >
On Wed, Dec 6, 2023 at 9:15 AM Rob Herring <robh@kernel.org> wrote: > > On Thu, Nov 30, 2023 at 06:41:07PM +0100, Herve Codina wrote: > > Hi, > > +Saravana for comment I'll respond to this within a week -- very swamped at the moment. The main thing I want to make sure is that we don't cause an indirect deadlock with this wait(). I'll go back and look at why we added the work queue and then check for device/devlink locking issues. -Saravana > > Looks okay to me though. > > > > > In the following sequence: > > of_platform_depopulate(); /* Remove devices from a DT overlay node */ > > of_overlay_remove(); /* Remove the DT overlay node itself */ > > > > Some warnings are raised by __of_changeset_entry_destroy() which was > > called from of_overlay_remove(): > > ERROR: memory leak, expected refcount 1 instead of 2 ... > > > > The issue is that, during the device devlink removals triggered from the > > of_platform_depopulate(), jobs are put in a workqueue. > > These jobs drop the reference to the devices. When a device is no more > > referenced (refcount == 0), it is released and the reference to its > > of_node is dropped by a call to of_node_put(). > > These operations are fully correct except that, because of the > > workqueue, they are done asynchronously with respect to function calls. > > > > In the sequence provided, the jobs are run too late, after the call to > > __of_changeset_entry_destroy() and so a missing of_node_put() call is > > detected by __of_changeset_entry_destroy(). > > > > This series fixes this issue introducing device_link_wait_removal() in > > order to wait for the end of jobs execution (patch 1) and using this > > function to synchronize the overlay removal with the end of jobs > > execution (patch 2). > > > > Best regards, > > Hervé > > > > Herve Codina (2): > > driver core: Introduce device_link_wait_removal() > > of: overlay: Synchronize of_overlay_remove() with the devlink removals > > > > drivers/base/core.c | 26 +++++++++++++++++++++++--- > > drivers/of/overlay.c | 6 ++++++ > > include/linux/device.h | 1 + > > 3 files changed, 30 insertions(+), 3 deletions(-) > > > > -- > > 2.42.0 > >
On Wed, Dec 6, 2023 at 7:09 PM Saravana Kannan <saravanak@google.com> wrote: > > On Wed, Dec 6, 2023 at 9:15 AM Rob Herring <robh@kernel.org> wrote: > > > > On Thu, Nov 30, 2023 at 06:41:07PM +0100, Herve Codina wrote: > > > Hi, > > > > +Saravana for comment > > I'll respond to this within a week -- very swamped at the moment. The > main thing I want to make sure is that we don't cause an indirect > deadlock with this wait(). I'll go back and look at why we added the > work queue and then check for device/devlink locking issues. > Sorry about the long delay, but I finally got back to this because Nuno nudged me to review a similar patch they sent. I'll leave some easy to address comments in the patches. -Saravana > -Saravana > > > > > Looks okay to me though. > > > > > > > > In the following sequence: > > > of_platform_depopulate(); /* Remove devices from a DT overlay node */ > > > of_overlay_remove(); /* Remove the DT overlay node itself */ > > > > > > Some warnings are raised by __of_changeset_entry_destroy() which was > > > called from of_overlay_remove(): > > > ERROR: memory leak, expected refcount 1 instead of 2 ... > > > > > > The issue is that, during the device devlink removals triggered from the > > > of_platform_depopulate(), jobs are put in a workqueue. > > > These jobs drop the reference to the devices. When a device is no more > > > referenced (refcount == 0), it is released and the reference to its > > > of_node is dropped by a call to of_node_put(). > > > These operations are fully correct except that, because of the > > > workqueue, they are done asynchronously with respect to function calls. > > > > > > In the sequence provided, the jobs are run too late, after the call to > > > __of_changeset_entry_destroy() and so a missing of_node_put() call is > > > detected by __of_changeset_entry_destroy(). > > > > > > This series fixes this issue introducing device_link_wait_removal() in > > > order to wait for the end of jobs execution (patch 1) and using this > > > function to synchronize the overlay removal with the end of jobs > > > execution (patch 2). > > > > > > Best regards, > > > Hervé > > > > > > Herve Codina (2): > > > driver core: Introduce device_link_wait_removal() > > > of: overlay: Synchronize of_overlay_remove() with the devlink removals > > > > > > drivers/base/core.c | 26 +++++++++++++++++++++++--- > > > drivers/of/overlay.c | 6 ++++++ > > > include/linux/device.h | 1 + > > > 3 files changed, 30 insertions(+), 3 deletions(-) > > > > > > -- > > > 2.42.0 > > >
Hello Saravana, Rob, Hervé,
[+Miquèl, who contributed to the discussion with Hervé and me]
On Wed, 6 Dec 2023 19:09:06 -0800
Saravana Kannan <saravanak@google.com> wrote:
> On Wed, Dec 6, 2023 at 9:15 AM Rob Herring <robh@kernel.org> wrote:
> >
> > On Thu, Nov 30, 2023 at 06:41:07PM +0100, Herve Codina wrote:
> > > Hi,
> >
> > +Saravana for comment
>
> I'll respond to this within a week -- very swamped at the moment. The
> main thing I want to make sure is that we don't cause an indirect
> deadlock with this wait(). I'll go back and look at why we added the
> work queue and then check for device/devlink locking issues.
While working on a project unrelated to Hervé's work, I also ended up
in getting sporadic but frequent "ERROR: memory leak, expected refcount
1 instead of..." messages, which persisted even after adding this patch
series on my tree.
My use case is the insertion and removal of a simple overlay describing
a regulator-fixed and an I2C GPIO expander using it. The messages appear
regardless of whether the insertion and removal is done from kernel code
or via the configfs interface (out-of-tree patches from [0]).
I reconstructed the sequence of operations, all of which stem from
of_overlay_remove():
int of_overlay_remove(int *ovcs_id)
{
...
device_link_wait_removal(); // proposed by this patch series
mutex_lock(&of_mutex);
...
ret = __of_changeset_revert_notify(&ovcs->cset);
// this ends up calling (excerpt from a long stack trace):
// -> of_i2c_notify
// -> device_remove
// -> devm_regulator_release
// -> device_link_remove
// -> devlink_dev_release, which queues work for
// device_link_release_fn, which in turn calls:
// -> device_put
// -> device_release
// -> {platform,regulator,...}_dev*_release
// -> of_node_put() [**]
...
free_overlay_changeset(ovcs);
// calls:
// -> of_changeset_destroy
// -> __of_changeset_entry_destroy
// -> pr_err("ERROR: memory leak, expected refcount 1 instead of %d...
// The error appears or not, based on when the workqueue runs
err_unlock:
mutex_unlock(&of_mutex);
...
}
So this adds up to the question of whether devlink removal should actually
be run asynchronously or not.
A simple short-term solution is to move the call to
device_link_wait_removal() later, just before free_overlay_changeset():
diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 1a8a6620748c..eccf08cf2160 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -1375,12 +1375,6 @@ int of_overlay_remove(int *ovcs_id)
goto out;
}
- /*
- * Wait for any ongoing device link removals before removing some of
- * nodes
- */
- device_link_wait_removal();
-
mutex_lock(&of_mutex);
ovcs = idr_find(&ovcs_idr, *ovcs_id);
@@ -1427,6 +1421,14 @@ int of_overlay_remove(int *ovcs_id)
if (!ret)
ret = ret_tmp;
+ /*
+ * Wait for any ongoing device link removals before removing some of
+ * nodes
+ */
+ mutex_unlock(&of_mutex);
+ device_link_wait_removal();
+ mutex_lock(&of_mutex);
+
free_overlay_changeset(ovcs);
err_unlock:
This obviously raises the question of whether unlocking and re-locking
the mutex is potentially dangerous. I have no answer to this right away,
but I tested this change with CONFIG_PROVE_LOCKING=y and no issue showed
up after several overlay load/unload sequences so I am not aware of any
actual issues with this change.
[0] https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git/log/?h=topic/overlays
Luca
--
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Hi,
On Wed, 20 Dec 2023 18:16:27 +0100
Luca Ceresoli <luca.ceresoli@bootlin.com> wrote:
> Hello Saravana, Rob, Hervé,
>
> [+Miquèl, who contributed to the discussion with Hervé and me]
>
> On Wed, 6 Dec 2023 19:09:06 -0800
> Saravana Kannan <saravanak@google.com> wrote:
>
> > On Wed, Dec 6, 2023 at 9:15 AM Rob Herring <robh@kernel.org> wrote:
> > >
> > > On Thu, Nov 30, 2023 at 06:41:07PM +0100, Herve Codina wrote:
> > > > Hi,
> > >
> > > +Saravana for comment
> >
> > I'll respond to this within a week -- very swamped at the moment. The
> > main thing I want to make sure is that we don't cause an indirect
> > deadlock with this wait(). I'll go back and look at why we added the
> > work queue and then check for device/devlink locking issues.
>
> While working on a project unrelated to Hervé's work, I also ended up
> in getting sporadic but frequent "ERROR: memory leak, expected refcount
> 1 instead of..." messages, which persisted even after adding this patch
> series on my tree.
>
> My use case is the insertion and removal of a simple overlay describing
> a regulator-fixed and an I2C GPIO expander using it. The messages appear
> regardless of whether the insertion and removal is done from kernel code
> or via the configfs interface (out-of-tree patches from [0]).
>
> I reconstructed the sequence of operations, all of which stem from
> of_overlay_remove():
>
> int of_overlay_remove(int *ovcs_id)
> {
> ...
>
> device_link_wait_removal(); // proposed by this patch series
>
> mutex_lock(&of_mutex);
>
> ...
>
> ret = __of_changeset_revert_notify(&ovcs->cset);
> // this ends up calling (excerpt from a long stack trace):
> // -> of_i2c_notify
> // -> device_remove
> // -> devm_regulator_release
> // -> device_link_remove
> // -> devlink_dev_release, which queues work for
> // device_link_release_fn, which in turn calls:
> // -> device_put
> // -> device_release
> // -> {platform,regulator,...}_dev*_release
> // -> of_node_put() [**]
>
> ...
>
> free_overlay_changeset(ovcs);
> // calls:
> // -> of_changeset_destroy
> // -> __of_changeset_entry_destroy
> // -> pr_err("ERROR: memory leak, expected refcount 1 instead of %d...
> // The error appears or not, based on when the workqueue runs
>
> err_unlock:
> mutex_unlock(&of_mutex);
>
> ...
> }
>
> So this adds up to the question of whether devlink removal should actually
> be run asynchronously or not.
>
> A simple short-term solution is to move the call to
> device_link_wait_removal() later, just before free_overlay_changeset():
Indeed, during of_overlay_remove() notifications can be done and in Luca's
use-case, they lead to some device removals and so devlink removals.
That's why we move the synchronization calling device_link_wait_removal()
after notifications and so just before free_overlay_changeset().
>
>
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 1a8a6620748c..eccf08cf2160 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -1375,12 +1375,6 @@ int of_overlay_remove(int *ovcs_id)
> goto out;
> }
>
> - /*
> - * Wait for any ongoing device link removals before removing some of
> - * nodes
> - */
> - device_link_wait_removal();
> -
> mutex_lock(&of_mutex);
>
> ovcs = idr_find(&ovcs_idr, *ovcs_id);
> @@ -1427,6 +1421,14 @@ int of_overlay_remove(int *ovcs_id)
> if (!ret)
> ret = ret_tmp;
>
> + /*
> + * Wait for any ongoing device link removals before removing some of
> + * nodes
> + */
> + mutex_unlock(&of_mutex);
> + device_link_wait_removal();
> + mutex_lock(&of_mutex);
> +
> free_overlay_changeset(ovcs);
>
> err_unlock:
>
>
> This obviously raises the question of whether unlocking and re-locking
> the mutex is potentially dangerous. I have no answer to this right away,
> but I tested this change with CONFIG_PROVE_LOCKING=y and no issue showed
> up after several overlay load/unload sequences so I am not aware of any
> actual issues with this change.
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git/log/?h=topic/overlays
>
> Luca
Thanks Luca for this complementary use-case related to this issue.
Hervé
--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
© 2016 - 2025 Red Hat, Inc.