cgroups: Add support for pinned device memory

[RFC 0/3] cgroups: Add support for pinned device memory

Posted by Maarten Lankhorst 1 month, 2 weeks ago

When exporting dma-bufs to other devices, even when it is allowed to use
move_notify in some drivers, performance will degrade severely when
eviction happens.

A perticular example where this can happen is in a multi-card setup,
where PCI-E peer-to-peer is used to prevent using access to system memory.

If the buffer is evicted to system memory, not only the evicting GPU wher
the buffer resided is affected, but it will also stall the GPU that is
waiting on the buffer.

It also makes sense for long running jobs not to be preempted by having
its buffers evicted, so it will make sense to have the ability to pin
from system memory too.

This is dependant on patches by Dave Airlie, so it's not part of this
series yet. But I'm planning on extending pinning to the memory cgroup
controller in the future to handle this case.

Implementation details:

For each cgroup up until the root cgroup, the 'min' limit is checked
against currently effectively pinned value. If the value will go above
'min', the pinning attempt is rejected.

Pinned memory is handled slightly different and affects calculating
effective min/low values. Pinned memory is subtracted from both,
and needs to be added afterwards when calculating.

This is because increasing the amount of pinned memory, the amount of
free min/low memory decreases for all cgroups that are part of the
hierarchy.

Maarten Lankhorst (3):
  page_counter: Allow for pinning some amount of memory
  cgroup/dmem: Implement pinning device memory
  drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation

 drivers/gpu/drm/xe/xe_bo.c      | 66 +++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_dma_buf.c | 10 +++-
 include/linux/cgroup_dmem.h     |  2 +
 include/linux/page_counter.h    |  8 +++
 include/uapi/drm/xe_drm.h       | 10 +++-
 kernel/cgroup/dmem.c            | 57 ++++++++++++++++++-
 mm/page_counter.c               | 98 ++++++++++++++++++++++++++++++---
 7 files changed, 237 insertions(+), 14 deletions(-)

-- 
2.50.0

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Natalie Vock 1 month ago

Hi,

On 8/19/25 13:49, Maarten Lankhorst wrote:
> When exporting dma-bufs to other devices, even when it is allowed to use
> move_notify in some drivers, performance will degrade severely when
> eviction happens.
> 
> A perticular example where this can happen is in a multi-card setup,
> where PCI-E peer-to-peer is used to prevent using access to system memory.
> 
> If the buffer is evicted to system memory, not only the evicting GPU wher
> the buffer resided is affected, but it will also stall the GPU that is
> waiting on the buffer.
> 
> It also makes sense for long running jobs not to be preempted by having
> its buffers evicted, so it will make sense to have the ability to pin
> from system memory too.
> 
> This is dependant on patches by Dave Airlie, so it's not part of this
> series yet. But I'm planning on extending pinning to the memory cgroup
> controller in the future to handle this case.
> 
> Implementation details:
> 
> For each cgroup up until the root cgroup, the 'min' limit is checked
> against currently effectively pinned value. If the value will go above
> 'min', the pinning attempt is rejected.

Why do you want to reject pins in this case? What happens in desktop 
usecases (e.g. PRIME buffer sharing)? AFAIU, you kind of need to be able 
to pin buffers and export them to other devices for that whole thing to 
work, right? If the user doesn't explicitly set a min value, wouldn't 
the value being zero mean any pins will be rejected (and thus PRIME 
would break)?

If your objective is to prevent pinned buffers from being evicted, 
perhaps you could instead make TTM try to avoid evicting pinned buffers 
and prefer unpinned buffers as long as there are unpinned buffers to 
evict? As long as the total amount of pinned memory stays below min, no 
pinned buffers should get evicted with that either.

Best,
Natalie

> 
> Pinned memory is handled slightly different and affects calculating
> effective min/low values. Pinned memory is subtracted from both,
> and needs to be added afterwards when calculating.
> 
> This is because increasing the amount of pinned memory, the amount of
> free min/low memory decreases for all cgroups that are part of the
> hierarchy.
> 
> Maarten Lankhorst (3):
>    page_counter: Allow for pinning some amount of memory
>    cgroup/dmem: Implement pinning device memory
>    drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and implementation
> 
>   drivers/gpu/drm/xe/xe_bo.c      | 66 +++++++++++++++++++++-
>   drivers/gpu/drm/xe/xe_dma_buf.c | 10 +++-
>   include/linux/cgroup_dmem.h     |  2 +
>   include/linux/page_counter.h    |  8 +++
>   include/uapi/drm/xe_drm.h       | 10 +++-
>   kernel/cgroup/dmem.c            | 57 ++++++++++++++++++-
>   mm/page_counter.c               | 98 ++++++++++++++++++++++++++++++---
>   7 files changed, 237 insertions(+), 14 deletions(-)
>

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Maarten Lankhorst 1 month ago

Hello Nathalie,

Den 2025-09-01 kl. 14:45, skrev Natalie Vock:
> Hi,
> 
> On 8/19/25 13:49, Maarten Lankhorst wrote:
>> When exporting dma-bufs to other devices, even when it is allowed to use
>> move_notify in some drivers, performance will degrade severely when
>> eviction happens.
>>
>> A perticular example where this can happen is in a multi-card setup,
>> where PCI-E peer-to-peer is used to prevent using access to system memory.
>>
>> If the buffer is evicted to system memory, not only the evicting GPU wher
>> the buffer resided is affected, but it will also stall the GPU that is
>> waiting on the buffer.
>>
>> It also makes sense for long running jobs not to be preempted by having
>> its buffers evicted, so it will make sense to have the ability to pin
>> from system memory too.
>>
>> This is dependant on patches by Dave Airlie, so it's not part of this
>> series yet. But I'm planning on extending pinning to the memory cgroup
>> controller in the future to handle this case.
>>
>> Implementation details:
>>
>> For each cgroup up until the root cgroup, the 'min' limit is checked
>> against currently effectively pinned value. If the value will go above
>> 'min', the pinning attempt is rejected.
> 
> Why do you want to reject pins in this case? What happens in desktop usecases (e.g. PRIME buffer sharing)? AFAIU, you kind of need to be able to pin buffers and export them to other devices for that whole thing to work, right? If the user doesn't explicitly set a min value, wouldn't the value being zero mean any pins will be rejected (and thus PRIME would break)?
> 
> If your objective is to prevent pinned buffers from being evicted, perhaps you could instead make TTM try to avoid evicting pinned buffers and prefer unpinned buffers as long as there are unpinned buffers to evict? As long as the total amount of pinned memory stays below min, no pinned buffers should get evicted with that either.
That would be setting an eviction priority, that can be done but that gives no guarantee memory will not be evicted.

Kind regards,
~Maarten

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Thomas Hellström 1 month ago

Hi,

On Mon, 2025-09-01 at 14:45 +0200, Natalie Vock wrote:
> Hi,
> 
> On 8/19/25 13:49, Maarten Lankhorst wrote:
> > When exporting dma-bufs to other devices, even when it is allowed
> > to use
> > move_notify in some drivers, performance will degrade severely when
> > eviction happens.
> > 
> > A perticular example where this can happen is in a multi-card
> > setup,
> > where PCI-E peer-to-peer is used to prevent using access to system
> > memory.
> > 
> > If the buffer is evicted to system memory, not only the evicting
> > GPU wher
> > the buffer resided is affected, but it will also stall the GPU that
> > is
> > waiting on the buffer.
> > 
> > It also makes sense for long running jobs not to be preempted by
> > having
> > its buffers evicted, so it will make sense to have the ability to
> > pin
> > from system memory too.
> > 
> > This is dependant on patches by Dave Airlie, so it's not part of
> > this
> > series yet. But I'm planning on extending pinning to the memory
> > cgroup
> > controller in the future to handle this case.
> > 
> > Implementation details:
> > 
> > For each cgroup up until the root cgroup, the 'min' limit is
> > checked
> > against currently effectively pinned value. If the value will go
> > above
> > 'min', the pinning attempt is rejected.
> 
> Why do you want to reject pins in this case? What happens in desktop 
> usecases (e.g. PRIME buffer sharing)? AFAIU, you kind of need to be
> able 
> to pin buffers and export them to other devices for that whole thing
> to 
> work, right? If the user doesn't explicitly set a min value, wouldn't
> the value being zero mean any pins will be rejected (and thus PRIME 
> would break)?

That's really the point. If an unprivileged malicious process is
allowed to pin arbitrary amounts of memory, thats a DOS vector.

However drivers that allow unlimited pinning today need to take care
when implementing restrictions to avoid regressions. Like perhaps
adding this behind a config option.

That said, IMO dma-buf clients should implement move_notify() whenever
possible to provide an option to avoid pinning unless necessary.

/Thomas



> 
> If your objective is to prevent pinned buffers from being evicted, 
> perhaps you could instead make TTM try to avoid evicting pinned
> buffers 
> and prefer unpinned buffers as long as there are unpinned buffers to 
> evict? As long as the total amount of pinned memory stays below min,
> no 
> pinned buffers should get evicted with that either.


> 
> Best,
> Natalie
> 
> > 
> > Pinned memory is handled slightly different and affects calculating
> > effective min/low values. Pinned memory is subtracted from both,
> > and needs to be added afterwards when calculating.
> > 
> > This is because increasing the amount of pinned memory, the amount
> > of
> > free min/low memory decreases for all cgroups that are part of the
> > hierarchy.
> > 
> > Maarten Lankhorst (3):
> >    page_counter: Allow for pinning some amount of memory
> >    cgroup/dmem: Implement pinning device memory
> >    drm/xe: Add DRM_XE_GEM_CREATE_FLAG_PINNED flag and
> > implementation
> > 
> >   drivers/gpu/drm/xe/xe_bo.c      | 66 +++++++++++++++++++++-
> >   drivers/gpu/drm/xe/xe_dma_buf.c | 10 +++-
> >   include/linux/cgroup_dmem.h     |  2 +
> >   include/linux/page_counter.h    |  8 +++
> >   include/uapi/drm/xe_drm.h       | 10 +++-
> >   kernel/cgroup/dmem.c            | 57 ++++++++++++++++++-
> >   mm/page_counter.c               | 98
> > ++++++++++++++++++++++++++++++---
> >   7 files changed, 237 insertions(+), 14 deletions(-)
> > 
>

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by David Hildenbrand 1 month ago

On 19.08.25 13:49, Maarten Lankhorst wrote:
> When exporting dma-bufs to other devices, even when it is allowed to use
> move_notify in some drivers, performance will degrade severely when
> eviction happens.
> 
> A perticular example where this can happen is in a multi-card setup,
> where PCI-E peer-to-peer is used to prevent using access to system memory.
> 
> If the buffer is evicted to system memory, not only the evicting GPU wher
> the buffer resided is affected, but it will also stall the GPU that is
> waiting on the buffer.
> 
> It also makes sense for long running jobs not to be preempted by having
> its buffers evicted, so it will make sense to have the ability to pin
> from system memory too.
> 
> This is dependant on patches by Dave Airlie, so it's not part of this
> series yet. But I'm planning on extending pinning to the memory cgroup
> controller in the future to handle this case.
> 
> Implementation details:
> 
> For each cgroup up until the root cgroup, the 'min' limit is checked
> against currently effectively pinned value. If the value will go above
> 'min', the pinning attempt is rejected.
> 
> Pinned memory is handled slightly different and affects calculating
> effective min/low values. Pinned memory is subtracted from both,
> and needs to be added afterwards when calculating.

The term "pinning" is overloaded, and frequently we refer to 
pin_user_pages() and friends.

So I'm wondering if there is an alternative term to describe what you 
want to achieve.

Is it something like "unevictable" ?

-- 
Cheers

David / dhildenb

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Maarten Lankhorst 1 month ago

Hello David,

Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
> On 19.08.25 13:49, Maarten Lankhorst wrote:
>> When exporting dma-bufs to other devices, even when it is allowed to use
>> move_notify in some drivers, performance will degrade severely when
>> eviction happens.
>>
>> A perticular example where this can happen is in a multi-card setup,
>> where PCI-E peer-to-peer is used to prevent using access to system memory.
>>
>> If the buffer is evicted to system memory, not only the evicting GPU wher
>> the buffer resided is affected, but it will also stall the GPU that is
>> waiting on the buffer.
>>
>> It also makes sense for long running jobs not to be preempted by having
>> its buffers evicted, so it will make sense to have the ability to pin
>> from system memory too.
>>
>> This is dependant on patches by Dave Airlie, so it's not part of this
>> series yet. But I'm planning on extending pinning to the memory cgroup
>> controller in the future to handle this case.
>>
>> Implementation details:
>>
>> For each cgroup up until the root cgroup, the 'min' limit is checked
>> against currently effectively pinned value. If the value will go above
>> 'min', the pinning attempt is rejected.
>>
>> Pinned memory is handled slightly different and affects calculating
>> effective min/low values. Pinned memory is subtracted from both,
>> and needs to be added afterwards when calculating.
> 
> The term "pinning" is overloaded, and frequently we refer to pin_user_pages() and friends.
> 
> So I'm wondering if there is an alternative term to describe what you want to achieve.
> 
> Is it something like "unevictable" ?
It could be required to include a call pin_user_pages(), in case a process wants to pin 
from a user's address space to the gpu.

It's not done yet, but it wouldn't surprise me if we want to include it in the future.
Functionally it's similar to mlock() and related functions.

Perhaps call it mlocked instead?

Kind regards,
~Maarten Lankhorst

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Thomas Hellström 1 month ago

Hi,

On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
> Hello David,
> 
> Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
> > On 19.08.25 13:49, Maarten Lankhorst wrote:
> > > When exporting dma-bufs to other devices, even when it is allowed
> > > to use
> > > move_notify in some drivers, performance will degrade severely
> > > when
> > > eviction happens.
> > > 
> > > A perticular example where this can happen is in a multi-card
> > > setup,
> > > where PCI-E peer-to-peer is used to prevent using access to
> > > system memory.
> > > 
> > > If the buffer is evicted to system memory, not only the evicting
> > > GPU wher
> > > the buffer resided is affected, but it will also stall the GPU
> > > that is
> > > waiting on the buffer.
> > > 
> > > It also makes sense for long running jobs not to be preempted by
> > > having
> > > its buffers evicted, so it will make sense to have the ability to
> > > pin
> > > from system memory too.
> > > 
> > > This is dependant on patches by Dave Airlie, so it's not part of
> > > this
> > > series yet. But I'm planning on extending pinning to the memory
> > > cgroup
> > > controller in the future to handle this case.
> > > 
> > > Implementation details:
> > > 
> > > For each cgroup up until the root cgroup, the 'min' limit is
> > > checked
> > > against currently effectively pinned value. If the value will go
> > > above
> > > 'min', the pinning attempt is rejected.
> > > 
> > > Pinned memory is handled slightly different and affects
> > > calculating
> > > effective min/low values. Pinned memory is subtracted from both,
> > > and needs to be added afterwards when calculating.
> > 
> > The term "pinning" is overloaded, and frequently we refer to
> > pin_user_pages() and friends.
> > 
> > So I'm wondering if there is an alternative term to describe what
> > you want to achieve.
> > 
> > Is it something like "unevictable" ?
> It could be required to include a call pin_user_pages(), in case a
> process wants to pin 
> from a user's address space to the gpu.
> 
> It's not done yet, but it wouldn't surprise me if we want to include
> it in the future.
> Functionally it's similar to mlock() and related functions.
> 
> Perhaps call it mlocked instead?

I was under the impression that mlocked() memory can be migrated to
other physical memory but not to swap? whereas pinned memory needs to
remain the exact same physical memory.

IMO "pinned" is pretty established within GPU drivers (dma-buf, TTM)
and essentially means the same as "pin" in "pin_user_pages", so
inventing a new name would probably cause even more confusion?

Thanks,
Thomas




> 
> Kind regards,
> ~Maarten Lankhorst

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by David Hildenbrand 1 month ago

On 01.09.25 20:21, Thomas Hellström wrote:
> Hi,
> 
> On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
>> Hello David,
>>
>> Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
>>> On 19.08.25 13:49, Maarten Lankhorst wrote:
>>>> When exporting dma-bufs to other devices, even when it is allowed
>>>> to use
>>>> move_notify in some drivers, performance will degrade severely
>>>> when
>>>> eviction happens.
>>>>
>>>> A perticular example where this can happen is in a multi-card
>>>> setup,
>>>> where PCI-E peer-to-peer is used to prevent using access to
>>>> system memory.
>>>>
>>>> If the buffer is evicted to system memory, not only the evicting
>>>> GPU wher
>>>> the buffer resided is affected, but it will also stall the GPU
>>>> that is
>>>> waiting on the buffer.
>>>>
>>>> It also makes sense for long running jobs not to be preempted by
>>>> having
>>>> its buffers evicted, so it will make sense to have the ability to
>>>> pin
>>>> from system memory too.
>>>>
>>>> This is dependant on patches by Dave Airlie, so it's not part of
>>>> this
>>>> series yet. But I'm planning on extending pinning to the memory
>>>> cgroup
>>>> controller in the future to handle this case.
>>>>
>>>> Implementation details:
>>>>
>>>> For each cgroup up until the root cgroup, the 'min' limit is
>>>> checked
>>>> against currently effectively pinned value. If the value will go
>>>> above
>>>> 'min', the pinning attempt is rejected.
>>>>
>>>> Pinned memory is handled slightly different and affects
>>>> calculating
>>>> effective min/low values. Pinned memory is subtracted from both,
>>>> and needs to be added afterwards when calculating.
>>>
>>> The term "pinning" is overloaded, and frequently we refer to
>>> pin_user_pages() and friends.
>>>
>>> So I'm wondering if there is an alternative term to describe what
>>> you want to achieve.
>>>
>>> Is it something like "unevictable" ?
>> It could be required to include a call pin_user_pages(), in case a

We'll only care about long-term pinnings (i.e., FOLL_LONGTERM). Ordinary 
short-term pinning is just fine.

(see how even "pinning" is overloaded? :) )

>> process wants to pin
>> from a user's address space to the gpu.
>>
>> It's not done yet, but it wouldn't surprise me if we want to include
>> it in the future.
>> Functionally it's similar to mlock() and related functions.

Traditionally, vfio, io_uring and rdma do exactly that: they use GUP to 
longterm pin and then account that memory towards RLIMIT_MEMLOCK.

If you grep for "rlimit(RLIMIT_MEMLOCK)", you'll see what I mean.

There are known issues with that: imagine long-term pinning the same 
folio through GUP with 2 interfaces (e.g., vfio, io_uring, rdma), or 
within the same interface.

You'd account the memory multiple times, which is horrible. And so far 
there is no easy way out.

>>
>> Perhaps call it mlocked instead?
> 
> I was under the impression that mlocked() memory can be migrated to
> other physical memory but not to swap? whereas pinned memory needs to
> remain the exact same physical memory.

Yes, exactly.

> 
> IMO "pinned" is pretty established within GPU drivers (dma-buf, TTM)
> and essentially means the same as "pin" in "pin_user_pages", so
> inventing a new name would probably cause even more confusion?

If it's the same thing, absolutely. But Marteen said "It's not done yet, 
but it wouldn't surprise me if we want to include it in the future".

So how is the memory we are talking about in this series "pinned" ?


-- 
Cheers

David / dhildenb

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Thomas Hellström 1 month ago

On Mon, 2025-09-01 at 20:38 +0200, David Hildenbrand wrote:
> On 01.09.25 20:21, Thomas Hellström wrote:
> > Hi,
> > 
> > On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
> > > Hello David,
> > > 
> > > Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
> > > > On 19.08.25 13:49, Maarten Lankhorst wrote:
> > > > > When exporting dma-bufs to other devices, even when it is
> > > > > allowed
> > > > > to use
> > > > > move_notify in some drivers, performance will degrade
> > > > > severely
> > > > > when
> > > > > eviction happens.
> > > > > 
> > > > > A perticular example where this can happen is in a multi-card
> > > > > setup,
> > > > > where PCI-E peer-to-peer is used to prevent using access to
> > > > > system memory.
> > > > > 
> > > > > If the buffer is evicted to system memory, not only the
> > > > > evicting
> > > > > GPU wher
> > > > > the buffer resided is affected, but it will also stall the
> > > > > GPU
> > > > > that is
> > > > > waiting on the buffer.
> > > > > 
> > > > > It also makes sense for long running jobs not to be preempted
> > > > > by
> > > > > having
> > > > > its buffers evicted, so it will make sense to have the
> > > > > ability to
> > > > > pin
> > > > > from system memory too.
> > > > > 
> > > > > This is dependant on patches by Dave Airlie, so it's not part
> > > > > of
> > > > > this
> > > > > series yet. But I'm planning on extending pinning to the
> > > > > memory
> > > > > cgroup
> > > > > controller in the future to handle this case.
> > > > > 
> > > > > Implementation details:
> > > > > 
> > > > > For each cgroup up until the root cgroup, the 'min' limit is
> > > > > checked
> > > > > against currently effectively pinned value. If the value will
> > > > > go
> > > > > above
> > > > > 'min', the pinning attempt is rejected.
> > > > > 
> > > > > Pinned memory is handled slightly different and affects
> > > > > calculating
> > > > > effective min/low values. Pinned memory is subtracted from
> > > > > both,
> > > > > and needs to be added afterwards when calculating.
> > > > 
> > > > The term "pinning" is overloaded, and frequently we refer to
> > > > pin_user_pages() and friends.
> > > > 
> > > > So I'm wondering if there is an alternative term to describe
> > > > what
> > > > you want to achieve.
> > > > 
> > > > Is it something like "unevictable" ?
> > > It could be required to include a call pin_user_pages(), in case
> > > a
> 
> We'll only care about long-term pinnings (i.e., FOLL_LONGTERM).
> Ordinary 
> short-term pinning is just fine.
> 
> (see how even "pinning" is overloaded? :) )
> 
> > > process wants to pin
> > > from a user's address space to the gpu.
> > > 
> > > It's not done yet, but it wouldn't surprise me if we want to
> > > include
> > > it in the future.
> > > Functionally it's similar to mlock() and related functions.
> 
> Traditionally, vfio, io_uring and rdma do exactly that: they use GUP
> to 
> longterm pin and then account that memory towards RLIMIT_MEMLOCK.
> 
> If you grep for "rlimit(RLIMIT_MEMLOCK)", you'll see what I mean.
> 
> There are known issues with that: imagine long-term pinning the same 
> folio through GUP with 2 interfaces (e.g., vfio, io_uring, rdma), or 
> within the same interface.
> 
> You'd account the memory multiple times, which is horrible. And so
> far 
> there is no easy way out.
> 
> > > 
> > > Perhaps call it mlocked instead?
> > 
> > I was under the impression that mlocked() memory can be migrated to
> > other physical memory but not to swap? whereas pinned memory needs
> > to
> > remain the exact same physical memory.
> 
> Yes, exactly.
> 
> > 
> > IMO "pinned" is pretty established within GPU drivers (dma-buf,
> > TTM)
> > and essentially means the same as "pin" in "pin_user_pages", so
> > inventing a new name would probably cause even more confusion?
> 
> If it's the same thing, absolutely. But Marteen said "It's not done
> yet, 
> but it wouldn't surprise me if we want to include it in the future".
> 
> So how is the memory we are talking about in this series "pinned" ?

Reading the cover-letter from Maarten, he only talks about pinning
affecting performance, which would be similar to user-space calling
mlock(), although I doubt that moving content to other physical pages
within the same memory type will be a near-term use-case.

However what's more important are situation where a device (like RDMA)
needs to pin, because it can't handle the case where access is
interrupted and content transferred to another physical location.

Perhaps Maarten could you elaborate whether this series is intended for
both these use-cases?

/Thomas



> 
>

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Maarten Lankhorst 4 weeks, 1 day ago

Hey,

Den 2025-09-02 kl. 15:42, skrev Thomas Hellström:
> On Mon, 2025-09-01 at 20:38 +0200, David Hildenbrand wrote:
>> On 01.09.25 20:21, Thomas Hellström wrote:
>>> Hi,
>>>
>>> On Mon, 2025-09-01 at 20:16 +0200, Maarten Lankhorst wrote:
>>>> Hello David,
>>>>
>>>> Den 2025-09-01 kl. 14:25, skrev David Hildenbrand:
>>>>> On 19.08.25 13:49, Maarten Lankhorst wrote:
>>>>>> When exporting dma-bufs to other devices, even when it is
>>>>>> allowed
>>>>>> to use
>>>>>> move_notify in some drivers, performance will degrade
>>>>>> severely
>>>>>> when
>>>>>> eviction happens.
>>>>>>
>>>>>> A perticular example where this can happen is in a multi-card
>>>>>> setup,
>>>>>> where PCI-E peer-to-peer is used to prevent using access to
>>>>>> system memory.
>>>>>>
>>>>>> If the buffer is evicted to system memory, not only the
>>>>>> evicting
>>>>>> GPU wher
>>>>>> the buffer resided is affected, but it will also stall the
>>>>>> GPU
>>>>>> that is
>>>>>> waiting on the buffer.
>>>>>>
>>>>>> It also makes sense for long running jobs not to be preempted
>>>>>> by
>>>>>> having
>>>>>> its buffers evicted, so it will make sense to have the
>>>>>> ability to
>>>>>> pin
>>>>>> from system memory too.
>>>>>>
>>>>>> This is dependant on patches by Dave Airlie, so it's not part
>>>>>> of
>>>>>> this
>>>>>> series yet. But I'm planning on extending pinning to the
>>>>>> memory
>>>>>> cgroup
>>>>>> controller in the future to handle this case.
>>>>>>
>>>>>> Implementation details:
>>>>>>
>>>>>> For each cgroup up until the root cgroup, the 'min' limit is
>>>>>> checked
>>>>>> against currently effectively pinned value. If the value will
>>>>>> go
>>>>>> above
>>>>>> 'min', the pinning attempt is rejected.
>>>>>>
>>>>>> Pinned memory is handled slightly different and affects
>>>>>> calculating
>>>>>> effective min/low values. Pinned memory is subtracted from
>>>>>> both,
>>>>>> and needs to be added afterwards when calculating.
>>>>>
>>>>> The term "pinning" is overloaded, and frequently we refer to
>>>>> pin_user_pages() and friends.
>>>>>
>>>>> So I'm wondering if there is an alternative term to describe
>>>>> what
>>>>> you want to achieve.
>>>>>
>>>>> Is it something like "unevictable" ?
>>>> It could be required to include a call pin_user_pages(), in case
>>>> a
>>
>> We'll only care about long-term pinnings (i.e., FOLL_LONGTERM).
>> Ordinary 
>> short-term pinning is just fine.
>>
>> (see how even "pinning" is overloaded? :) )
>>
>>>> process wants to pin
>>>> from a user's address space to the gpu.
>>>>
>>>> It's not done yet, but it wouldn't surprise me if we want to
>>>> include
>>>> it in the future.
>>>> Functionally it's similar to mlock() and related functions.
>>
>> Traditionally, vfio, io_uring and rdma do exactly that: they use GUP
>> to 
>> longterm pin and then account that memory towards RLIMIT_MEMLOCK.
>>
>> If you grep for "rlimit(RLIMIT_MEMLOCK)", you'll see what I mean.
>>
>> There are known issues with that: imagine long-term pinning the same 
>> folio through GUP with 2 interfaces (e.g., vfio, io_uring, rdma), or 
>> within the same interface.
>>
>> You'd account the memory multiple times, which is horrible. And so
>> far 
>> there is no easy way out.
>>
>>>>
>>>> Perhaps call it mlocked instead?
>>>
>>> I was under the impression that mlocked() memory can be migrated to
>>> other physical memory but not to swap? whereas pinned memory needs
>>> to
>>> remain the exact same physical memory.
>>
>> Yes, exactly.
>>
>>>
>>> IMO "pinned" is pretty established within GPU drivers (dma-buf,
>>> TTM)
>>> and essentially means the same as "pin" in "pin_user_pages", so
>>> inventing a new name would probably cause even more confusion?
>>
>> If it's the same thing, absolutely. But Marteen said "It's not done
>> yet, 
>> but it wouldn't surprise me if we want to include it in the future".
>>
>> So how is the memory we are talking about in this series "pinned" ?
> 
> Reading the cover-letter from Maarten, he only talks about pinning
> affecting performance, which would be similar to user-space calling
> mlock(), although I doubt that moving content to other physical pages
> within the same memory type will be a near-term use-case.
> 
> However what's more important are situation where a device (like RDMA)
> needs to pin, because it can't handle the case where access is
> interrupted and content transferred to another physical location.
> 
> Perhaps Maarten could you elaborate whether this series is intended for
> both these use-cases?
Yeah, this is definitely for the latter case too.

It's a performance optimization for the generic case, and very nice
to have for the second case, to prevent unlimited vram pinning.
With cgroups, we would be able to limit the amounts of used memory there.

Kind regards,
~Maarten

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Michal Koutný 1 month, 1 week ago

Hello Maarten.

On Tue, Aug 19, 2025 at 01:49:33PM +0200, Maarten Lankhorst <dev@lankhorst.se> wrote:
> Implementation details:
> 
> For each cgroup up until the root cgroup, the 'min' limit is checked
> against currently effectively pinned value. If the value will go above
> 'min', the pinning attempt is rejected.

How is pinning different from setting a 'min' limit (from a user
perspective)?

> 
> Pinned memory is handled slightly different and affects calculating
> effective min/low values. Pinned memory is subtracted from both,
> and needs to be added afterwards when calculating.
> 
> This is because increasing the amount of pinned memory, the amount of
> free min/low memory decreases for all cgroups that are part of the
> hierarchy.

What is supposed to happen with pinned memory after cgroup removal?
I find the page_counter changes little bit complex without understanding
of the difference between min and pinned. Should this be conceptually
similar to memory.stat:unevictable? Or rather mlock(2)? So far neither
of those needed interaction with min/low values (in memcg).

Thanks,
Michal

Re: [RFC 0/3] cgroups: Add support for pinned device memory

Posted by Maarten Lankhorst 1 month ago

Hey,

Den 2025-08-26 kl. 16:20, skrev Michal Koutný:
> Hello Maarten.
> 
> On Tue, Aug 19, 2025 at 01:49:33PM +0200, Maarten Lankhorst <dev@lankhorst.se> wrote:
>> Implementation details:
>>
>> For each cgroup up until the root cgroup, the 'min' limit is checked
>> against currently effectively pinned value. If the value will go above
>> 'min', the pinning attempt is rejected.
> 
> How is pinning different from setting a 'min' limit (from a user
> perspective)?
It's related, in fact you have to set the 'min' limit first.

The 'pinned' allows you to pick /which/ memory falls under the 'min' limit.

>>
>> Pinned memory is handled slightly different and affects calculating
>> effective min/low values. Pinned memory is subtracted from both,
>> and needs to be added afterwards when calculating.
>>
>> This is because increasing the amount of pinned memory, the amount of
>> free min/low memory decreases for all cgroups that are part of the
>> hierarchy.
> 
> What is supposed to happen with pinned memory after cgroup removal?
I think for accounting purposes pinned memory stays pinned,
otherwise the idea of pinning is lost. However when you kill all
processes in the cgroup, that should solve itself eventually.

> I find the page_counter changes little bit complex without understanding
> of the difference between min and pinned. Should this be conceptually
> similar to memory.stat:unevictable? Or rather mlock(2)? So far neither
> of those needed interaction with min/low values (in memcg).
You could in theory implement mlockall using the 'min' values too.

The page counter changes implement the following:

Lets say you have this tree with 'min' values.

      / '5' A
X'6' -- '5' B
      \ '5' C

Effective min without pinned pages:
      / '2' A
X'6' -- '2' B
      \ '2' C

Now 'B' pins 3 pages:

Effective min:
         / '1' A
X'3+3p' -- '1' B (1 + 3 pinned pages makes effective min 4)
         \ '1' C

Same for applies to effective 'low' calculations.

Kind regards,
~Maarten