[PATCH v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()

Haoxiang Li posted 1 patch 1 month ago
There is a newer version of this series
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
[PATCH v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()
Posted by Haoxiang Li 1 month ago
If dqm->ops.initialize() fails, add deallocate_hiq_sdma_mqd()
to release the memory allocated by allocate_hiq_sdma_mqd().
Move deallocate_hiq_sdma_mqd() up to ensure proper function
visibility at the point of use.

Fixes: 11614c36bc8f ("drm/amdkfd: Allocate MQD trunk for HIQ and SDMA")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn>
---
Changes in v2:
- Move deallocate_hiq_sdma_mqd() up. Thanks, Felix!
- Add a Fixes tag.
---
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index d7a2e7178ea9..8af0929ca40a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2919,6 +2919,14 @@ static int allocate_hiq_sdma_mqd(struct device_queue_manager *dqm)
 	return retval;
 }
 
+static void deallocate_hiq_sdma_mqd(struct kfd_node *dev,
+				    struct kfd_mem_obj *mqd)
+{
+	WARN(!mqd, "No hiq sdma mqd trunk to free");
+
+	amdgpu_amdkfd_free_gtt_mem(dev->adev, &mqd->gtt_mem);
+}
+
 struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
 {
 	struct device_queue_manager *dqm;
@@ -3042,19 +3050,14 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
 		return dqm;
 	}
 
+	if (!dev->kfd->shared_resources.enable_mes)
+		deallocate_hiq_sdma_mqd(dev, &dqm->hiq_sdma_mqd);
+
 out_free:
 	kfree(dqm);
 	return NULL;
 }
 
-static void deallocate_hiq_sdma_mqd(struct kfd_node *dev,
-				    struct kfd_mem_obj *mqd)
-{
-	WARN(!mqd, "No hiq sdma mqd trunk to free");
-
-	amdgpu_amdkfd_free_gtt_mem(dev->adev, &mqd->gtt_mem);
-}
-
 void device_queue_manager_uninit(struct device_queue_manager *dqm)
 {
 	dqm->ops.stop(dqm);
-- 
2.25.1
Re: [PATCH v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()
Posted by Markus Elfring 1 month ago
…
> Move deallocate_hiq_sdma_mqd() up to ensure proper function
> visibility at the point of use.

…
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -2919,6 +2919,14 @@ static int allocate_hiq_sdma_mqd(struct device_queue_manager *dqm)
>  	return retval;
>  }
>  
> +static void deallocate_hiq_sdma_mqd(struct kfd_node *dev,
> +				    struct kfd_mem_obj *mqd)
> +{
> +	WARN(!mqd, "No hiq sdma mqd trunk to free");
> +
> +	amdgpu_amdkfd_free_gtt_mem(dev->adev, &mqd->gtt_mem);
> +}

Is there also a need to reconsider the implementation of the applied
null pointer check here?

Regards,
Markus
Re: [PATCH v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()
Posted by Felix Kuehling 1 month ago
On 2026-01-06 04:30, Markus Elfring wrote:
> …
>> Move deallocate_hiq_sdma_mqd() up to ensure proper function
>> visibility at the point of use.
> …
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> @@ -2919,6 +2919,14 @@ static int allocate_hiq_sdma_mqd(struct device_queue_manager *dqm)
>>   	return retval;
>>   }
>>   
>> +static void deallocate_hiq_sdma_mqd(struct kfd_node *dev,
>> +				    struct kfd_mem_obj *mqd)
>> +{
>> +	WARN(!mqd, "No hiq sdma mqd trunk to free");
>> +
>> +	amdgpu_amdkfd_free_gtt_mem(dev->adev, &mqd->gtt_mem);
>> +}
> Is there also a need to reconsider the implementation of the applied
> null pointer check here?

Yeah, we have a WARN if mqd is NULL but then we still call 
amdgpu_amdkfd_free_gtt_mem. There is a NULL pointer check in 
amdgpu_amdkfd_free_gtt_mem, but &mqd->gtt_mem won't be NULL because it's 
not at the start of struct kfd_mem_obj. So you'd get a kernel oops if 
mqd is ever NULL here.

That said, I've never seen anyone complain about this WARN and a 
subsequent kernel oops.

The only other place that calls deallocate_hiq_sdma_mqd is 
device_queue_manager_uninit (only if not using MES). It can only be 
called with a valid dqm if device_queue_manager_init succeeded, which 
always ends with a valid dqm->hiq_sdma_mqd if not using MES.

My conclusion is that this WARN is just unnecessary. But it's also harmless.

Regards,
   Felix


>
> Regards,
> Markus
Re: [v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()
Posted by Markus Elfring 1 month ago
> My conclusion is that this WARN is just unnecessary.

Would you like to omit such a questionable macro call then?


> But it's also harmless.

How do you think about to avoid special development concerns here?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.19-rc4#n1242

Regards,
Markus
Re: [v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()
Posted by Kuehling, Felix 1 month ago
On 2026-01-09 03:37, Markus Elfring wrote:
>> My conclusion is that this WARN is just unnecessary.
> Would you like to omit such a questionable macro call then?

I don't feel strongly about it. I already submitted Haoxiang's patch.


>
>
>> But it's also harmless.
> How do you think about to avoid special development concerns here?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.19-rc4#n1242

No. I think the WARN is used exactly as it was meant to be here: to 
check for something that should never happen.

Regards,
   Felix


>
> Regards,
> Markus
Re: [v2] drm/amdkfd: fix a memory leak in device_queue_manager_init()
Posted by Markus Elfring 1 month ago
>>> But it's also harmless.
>> How do you think about to avoid special development concerns here?
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.19-rc4#n1242
> 
> No. I think the WARN is used exactly as it was meant to be here: to check for something that should never happen.

Do we stumble on another target conflict at such a source code place?

Would you like to avoid undefined behaviour here?
https://wiki.sei.cmu.edu/confluence/spaces/c/pages/87152449/EXP34-C.+Do+not+dereference+null+pointers

Regards,
Markus