[RFC v2 0/3] iommu/intel: Free empty page tables on unmaps

Pasha Tatashin posted 3 patches 1 week, 4 days ago
drivers/iommu/intel/iommu.c | 154 ++++++++++++++++++++++++++++--------
drivers/iommu/intel/iommu.h |  42 ++++++++--
drivers/iommu/iommu-pages.h |  30 +++++--
3 files changed, 180 insertions(+), 46 deletions(-)
[RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
Posted by Pasha Tatashin 1 week, 4 days ago
Changelog
================================================================
v2: Use mapcount instead of refcount
    Synchronized with IOMMU Observability changes.
================================================================

This series frees empty page tables on unmaps. It intends to be a
low overhead feature.

The read-writer lock is used to synchronize page table, but most of
time the lock is held is reader. It is held as a writer for short
period of time when unmapping a page that is bigger than the current
iova request. For all other cases this lock is read-only.

page->mapcount is used in order to track number of entries at each page
table.

Microbenchmark data using iova_stress[1]:

Base:
$ ./iova_stress -s 16
dma_size:       4K iova space: 16T iommu: ~  32847M time:   36.074s

Fix:
$ ./iova_stress -s 16
dma_size:       4K iova space: 16T iommu: ~     27M time:   38.870s

The test maps/unmaps 4K pages and cycles through the IOVA space in a tight loop.
Base uses 32G of memory, and test completes in 36.074s
Fix uses 0G of memory, and test completes in 38.870s.

I believe the proposed fix is a good compromise in terms of complexity/
scalability. A more scalable solution would be to spread read/writer
lock per-page table, and user page->private field to store the lock
itself.

However, since iommu already has some protection: i.e. no-one touches
the iova space of the request map/unmap we can avoid the extra complexity
and rely on a single per page table RW lock, and be in a reader mode
most of the time.

[1] https://github.com/soleen/iova_stress

Pasha Tatashin (3):
  iommu/intel: Use page->_mapcount to count number of entries in IOMMU
  iommu/intel: synchronize page table map and unmap operations
  iommu/intel: free empty page tables on unmaps

 drivers/iommu/intel/iommu.c | 154 ++++++++++++++++++++++++++++--------
 drivers/iommu/intel/iommu.h |  42 ++++++++--
 drivers/iommu/iommu-pages.h |  30 +++++--
 3 files changed, 180 insertions(+), 46 deletions(-)

-- 
2.44.0.769.g3c40516874-goog
Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
Posted by David Hildenbrand 1 week, 4 days ago
On 26.04.24 05:43, Pasha Tatashin wrote:
> Changelog
> ================================================================
> v2: Use mapcount instead of refcount
>      Synchronized with IOMMU Observability changes.
> ================================================================
> 
> This series frees empty page tables on unmaps. It intends to be a
> low overhead feature.
> 
> The read-writer lock is used to synchronize page table, but most of
> time the lock is held is reader. It is held as a writer for short
> period of time when unmapping a page that is bigger than the current
> iova request. For all other cases this lock is read-only.
> 
> page->mapcount is used in order to track number of entries at each page
> table.

I'm wondering if this will conflict with page_type at some point? We're 
already converting other page table users to ptdesc. CCing Willy.

-- 
Cheers,

David / dhildenb
Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
Posted by Pasha Tatashin 1 week, 4 days ago
On Fri, Apr 26, 2024 at 2:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.04.24 05:43, Pasha Tatashin wrote:
> > Changelog
> > ================================================================
> > v2: Use mapcount instead of refcount
> >      Synchronized with IOMMU Observability changes.
> > ================================================================
> >
> > This series frees empty page tables on unmaps. It intends to be a
> > low overhead feature.
> >
> > The read-writer lock is used to synchronize page table, but most of
> > time the lock is held is reader. It is held as a writer for short
> > period of time when unmapping a page that is bigger than the current
> > iova request. For all other cases this lock is read-only.
> >
> > page->mapcount is used in order to track number of entries at each page
> > table.
>
> I'm wondering if this will conflict with page_type at some point? We're
> already converting other page table users to ptdesc. CCing Willy.

Hi David,

This contradicts with the following comment in mm_types.h:
 * If your page will not be mapped to userspace, you can also use the four
 * bytes in the mapcount union, but you must call
page_mapcount_reset()
 * before freeing it.

Thank you,
Pasha
Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
Posted by David Hildenbrand 1 week, 4 days ago
On 26.04.24 15:49, Pasha Tatashin wrote:
> On Fri, Apr 26, 2024 at 2:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.04.24 05:43, Pasha Tatashin wrote:
>>> Changelog
>>> ================================================================
>>> v2: Use mapcount instead of refcount
>>>       Synchronized with IOMMU Observability changes.
>>> ================================================================
>>>
>>> This series frees empty page tables on unmaps. It intends to be a
>>> low overhead feature.
>>>
>>> The read-writer lock is used to synchronize page table, but most of
>>> time the lock is held is reader. It is held as a writer for short
>>> period of time when unmapping a page that is bigger than the current
>>> iova request. For all other cases this lock is read-only.
>>>
>>> page->mapcount is used in order to track number of entries at each page
>>> table.
>>
>> I'm wondering if this will conflict with page_type at some point? We're
>> already converting other page table users to ptdesc. CCing Willy.
> 
> Hi David,

Hi!

> 
> This contradicts with the following comment in mm_types.h:
>   * If your page will not be mapped to userspace, you can also use the four
>   * bytes in the mapcount union, but you must call
> page_mapcount_reset()
>   * before freeing it.

I think the documentation is a bit outdated, because we now have page 
types that are: "For pages that are never mapped to userspace"

which includes

#define PG_table

(we should update that comment, because we're now also using it for 
hugetlb that can be mapped to user space, which is fine.)

Right now, using page->_mapcount would likely still be fine, as long as 
you cannot end up creating a value that would resemble a type (e.g., 
PG_offline could be bad).

But staring at users of _mapcount and page_mapcount_reset() ... you'd be 
pretty much the only user of that.

mm/zsmalloc.c calls page_mapcount_reset(), and I am not completely sure 
why ... I can see it touch page->index but not page->_mapcount.


Hopefully Willy can comment.

-- 
Cheers,

David / dhildenb

Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
Posted by Matthew Wilcox 1 week, 3 days ago
On Fri, Apr 26, 2024 at 04:39:05PM +0200, David Hildenbrand wrote:
> On 26.04.24 15:49, Pasha Tatashin wrote:
> > On Fri, Apr 26, 2024 at 2:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 26.04.24 05:43, Pasha Tatashin wrote:
> > > > Changelog
> > > > ================================================================
> > > > v2: Use mapcount instead of refcount
> > > >       Synchronized with IOMMU Observability changes.
> > > > ================================================================
> > > > 
> > > > This series frees empty page tables on unmaps. It intends to be a
> > > > low overhead feature.
> > > > 
> > > > The read-writer lock is used to synchronize page table, but most of
> > > > time the lock is held is reader. It is held as a writer for short
> > > > period of time when unmapping a page that is bigger than the current
> > > > iova request. For all other cases this lock is read-only.
> > > > 
> > > > page->mapcount is used in order to track number of entries at each page
> > > > table.
> > > 
> > > I'm wondering if this will conflict with page_type at some point? We're
> > > already converting other page table users to ptdesc. CCing Willy.
> > 
> > Hi David,
> 
> Hi!
> 
> > 
> > This contradicts with the following comment in mm_types.h:
> >   * If your page will not be mapped to userspace, you can also use the four
> >   * bytes in the mapcount union, but you must call
> > page_mapcount_reset()
> >   * before freeing it.
> 
> I think the documentation is a bit outdated, because we now have page types
> that are: "For pages that are never mapped to userspace"
> 
> which includes
> 
> #define PG_table
> 
> (we should update that comment, because we're now also using it for hugetlb
> that can be mapped to user space, which is fine.)
> 
> Right now, using page->_mapcount would likely still be fine, as long as you
> cannot end up creating a value that would resemble a type (e.g., PG_offline
> could be bad).
> 
> But staring at users of _mapcount and page_mapcount_reset() ... you'd be
> pretty much the only user of that.
> 
> mm/zsmalloc.c calls page_mapcount_reset(), and I am not completely sure why
> ... I can see it touch page->index but not page->_mapcount.
> 
> 
> Hopefully Willy can comment.

I feel like I have to say "no" to Pasha far too often ;-(

Agreed the documentation is out of date.

I think there's a lot of space in the struct page that can be used.
These are iommu page tables, not cpu page tables, so things are a bit
different for them.  But should they be converted to use ptdesc?  Maybe!

I'd suggest putting this into the union with pt_mm and pt_frag_refcount.
I think it could even go in the union with pt_list, but I think I'd
rather see it in the pt_mm union.
Re: [RFC v2 0/3] iommu/intel: Free empty page tables on unmaps
Posted by Jason Gunthorpe 1 week, 1 day ago
On Fri, Apr 26, 2024 at 08:39:14PM +0100, Matthew Wilcox wrote:

> I think there's a lot of space in the struct page that can be used.
> These are iommu page tables, not cpu page tables, so things are a bit
> different for them.  But should they be converted to use ptdesc?  Maybe!

Definately! Someday we will need more stuff in here..

Jason