Improve alloc_migration_target_by_mpol()'s treatment of MPOL_INTERLEAVE.
Make an effort in do_mbind(), to identify the correct interleave index
for the first page to be migrated, so that it and all subsequent pages
from the same vma will be targeted to precisely their intended nodes.
Pages from following vmas will still be interleaved from the requested
nodemask, but perhaps starting from a different base.
Whether this is worth doing at all, or worth improving further, is
arguable: queue_folio_required() is right not to care about the precise
placement on interleaved nodes; but this little effort seems appropriate.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/mempolicy.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a7b34b9c00ef..b01922e88548 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -430,6 +430,11 @@ static bool strictly_unmovable(unsigned long flags)
MPOL_MF_STRICT;
}
+struct migration_mpol { /* for alloc_migration_target_by_mpol() */
+ struct mempolicy *pol;
+ pgoff_t ilx;
+};
+
struct queue_pages {
struct list_head *pagelist;
unsigned long flags;
@@ -1178,8 +1183,9 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
static struct folio *alloc_migration_target_by_mpol(struct folio *src,
unsigned long private)
{
- struct mempolicy *pol = (struct mempolicy *)private;
- pgoff_t ilx = 0; /* improve on this later */
+ struct migration_mpol *mmpol = (struct migration_mpol *)private;
+ struct mempolicy *pol = mmpol->pol;
+ pgoff_t ilx = mmpol->ilx;
struct page *page;
unsigned int order;
int nid = numa_node_id();
@@ -1234,6 +1240,7 @@ static long do_mbind(unsigned long start, unsigned long len,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
struct vma_iterator vmi;
+ struct migration_mpol mmpol;
struct mempolicy *new;
unsigned long end;
long err;
@@ -1314,9 +1321,45 @@ static long do_mbind(unsigned long start, unsigned long len,
new = get_task_policy(current);
mpol_get(new);
}
+ mmpol.pol = new;
+ mmpol.ilx = 0;
+
+ /*
+ * In the interleaved case, attempt to allocate on exactly the
+ * targeted nodes, for the first VMA to be migrated; for later
+ * VMAs, the nodes will still be interleaved from the targeted
+ * nodemask, but one by one may be selected differently.
+ */
+ if (new->mode == MPOL_INTERLEAVE) {
+ struct page *page;
+ unsigned int order;
+ unsigned long addr = -EFAULT;
+
+ list_for_each_entry(page, &pagelist, lru) {
+ if (!PageKsm(page))
+ break;
+ }
+ if (!list_entry_is_head(page, &pagelist, lru)) {
+ vma_iter_init(&vmi, mm, start);
+ for_each_vma_range(vmi, vma, end) {
+ addr = page_address_in_vma(page, vma);
+ if (addr != -EFAULT)
+ break;
+ }
+ }
+ if (addr != -EFAULT) {
+ order = compound_order(page);
+ /* We already know the pol, but not the ilx */
+ mpol_cond_put(get_vma_policy(vma, addr, order,
+ &mmpol.ilx));
+ /* Set base from which to increment by index */
+ mmpol.ilx -= page->index >> order;
+ }
+ }
+
nr_failed |= migrate_pages(&pagelist,
alloc_migration_target_by_mpol, NULL,
- (unsigned long)new, MIGRATE_SYNC,
+ (unsigned long)&mmpol, MIGRATE_SYNC,
MR_MEMPOLICY_MBIND, NULL);
}
--
2.35.3
mm-unstable commit edd33b8807a1 ("mempolicy: migration attempt to match
interleave nodes") added a second vma_iter search to do_mbind(), to
determine the interleave index to be used in the MPOL_INTERLEAVE case.
But sadly it added it just after the mmap_write_unlock(), leaving this
new VMA search unprotected: and so syzbot reports suspicious RCU usage
from lib/maple_tree.c:856.
This could be fixed with an rcu_read_lock/unlock() pair (per Liam);
but since we have been relying on the mmap_lock up to this point, it's
slightly better to extend it over the new search too, for a well-defined
result consistent with the policy this mbind() is establishing (rather
than whatever might follow once the mmap_lock is dropped).
Reported-by: syzbot+79fcba037b6df73756d3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/000000000000c05f1b0608657fde@google.com/
Fixes: edd33b8807a1 ("mempolicy: migration attempt to match interleave nodes")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/mempolicy.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 989293180eb6..5e472e6e0507 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1291,8 +1291,6 @@ static long do_mbind(unsigned long start, unsigned long len,
}
}
- mmap_write_unlock(mm);
-
if (!err && !list_empty(&pagelist)) {
/* Convert MPOL_DEFAULT's NULL to task or default policy */
if (!new) {
@@ -1334,7 +1332,11 @@ static long do_mbind(unsigned long start, unsigned long len,
mmpol.ilx -= page->index >> order;
}
}
+ }
+ mmap_write_unlock(mm);
+
+ if (!err && !list_empty(&pagelist)) {
nr_failed |= migrate_pages(&pagelist,
alloc_migration_target_by_mpol, NULL,
(unsigned long)&mmpol, MIGRATE_SYNC,
--
2.35.3
* Hugh Dickins <hughd@google.com> [231024 02:50]:
> mm-unstable commit edd33b8807a1 ("mempolicy: migration attempt to match
> interleave nodes") added a second vma_iter search to do_mbind(), to
> determine the interleave index to be used in the MPOL_INTERLEAVE case.
>
> But sadly it added it just after the mmap_write_unlock(), leaving this
> new VMA search unprotected: and so syzbot reports suspicious RCU usage
> from lib/maple_tree.c:856.
>
> This could be fixed with an rcu_read_lock/unlock() pair (per Liam);
> but since we have been relying on the mmap_lock up to this point, it's
> slightly better to extend it over the new search too, for a well-defined
> result consistent with the policy this mbind() is establishing (rather
> than whatever might follow once the mmap_lock is dropped).
Would downgrading the lock work? It would avoid the potential writing
issue and should still satisfy lockdep.
>
> Reported-by: syzbot+79fcba037b6df73756d3@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/linux-mm/000000000000c05f1b0608657fde@google.com/
> Fixes: edd33b8807a1 ("mempolicy: migration attempt to match interleave nodes")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> mm/mempolicy.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 989293180eb6..5e472e6e0507 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1291,8 +1291,6 @@ static long do_mbind(unsigned long start, unsigned long len,
> }
> }
>
> - mmap_write_unlock(mm);
> -
> if (!err && !list_empty(&pagelist)) {
> /* Convert MPOL_DEFAULT's NULL to task or default policy */
> if (!new) {
> @@ -1334,7 +1332,11 @@ static long do_mbind(unsigned long start, unsigned long len,
> mmpol.ilx -= page->index >> order;
> }
> }
> + }
>
> + mmap_write_unlock(mm);
> +
> + if (!err && !list_empty(&pagelist)) {
> nr_failed |= migrate_pages(&pagelist,
> alloc_migration_target_by_mpol, NULL,
> (unsigned long)&mmpol, MIGRATE_SYNC,
> --
> 2.35.3
>
On Tue, 24 Oct 2023, Liam R. Howlett wrote:
> * Hugh Dickins <hughd@google.com> [231024 02:50]:
> > mm-unstable commit edd33b8807a1 ("mempolicy: migration attempt to match
> > interleave nodes") added a second vma_iter search to do_mbind(), to
> > determine the interleave index to be used in the MPOL_INTERLEAVE case.
> >
> > But sadly it added it just after the mmap_write_unlock(), leaving this
> > new VMA search unprotected: and so syzbot reports suspicious RCU usage
> > from lib/maple_tree.c:856.
> >
> > This could be fixed with an rcu_read_lock/unlock() pair (per Liam);
> > but since we have been relying on the mmap_lock up to this point, it's
> > slightly better to extend it over the new search too, for a well-defined
> > result consistent with the policy this mbind() is establishing (rather
> > than whatever might follow once the mmap_lock is dropped).
>
> Would downgrading the lock work? It would avoid the potential writing
> issue and should still satisfy lockdep.
Downgrading the lock would work, but it would be a pointless complication.
The "second vma_iter search" is not a lengthy operation (normally it just
checks pgoff,start,end of the first VMA and immediately breaks out; in
worst case it just makes that check on each VMA involved: it doesn't get
into splits or merges or pte scans), we already have mmap_lock, yes it's
only needed for read during that scani, but it's not worth playing with.
Whereas migrating an indefinite number of pages, with all the allocating
and unmapping and copying and remapping involved, really is something we
prefer not to hold mmap_lock across.
Hugh
On Tue, Oct 24, 2023 at 09:32:44AM -0700, Hugh Dickins wrote:
> On Tue, 24 Oct 2023, Liam R. Howlett wrote:
>
> > * Hugh Dickins <hughd@google.com> [231024 02:50]:
> > > mm-unstable commit edd33b8807a1 ("mempolicy: migration attempt to match
> > > interleave nodes") added a second vma_iter search to do_mbind(), to
> > > determine the interleave index to be used in the MPOL_INTERLEAVE case.
> > >
> > > But sadly it added it just after the mmap_write_unlock(), leaving this
> > > new VMA search unprotected: and so syzbot reports suspicious RCU usage
> > > from lib/maple_tree.c:856.
> > >
> > > This could be fixed with an rcu_read_lock/unlock() pair (per Liam);
> > > but since we have been relying on the mmap_lock up to this point, it's
> > > slightly better to extend it over the new search too, for a well-defined
> > > result consistent with the policy this mbind() is establishing (rather
> > > than whatever might follow once the mmap_lock is dropped).
> >
> > Would downgrading the lock work? It would avoid the potential writing
> > issue and should still satisfy lockdep.
>
> Downgrading the lock would work, but it would be a pointless complication.
I tend to agree. It's also becoming far less important these days
with the vast majority of page faults handled under the per-VMA lock.
We might be able to turn it into a mutex instead of an rwsem without
seeing a noticable drop-off in performance. Not volunteering to try this.
© 2016 - 2025 Red Hat, Inc.