[v2] mm/madvise: remove redundant mmap_lock operations from process_madvise()

[RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by SeongJae Park 1 year ago

Optimize redundant mmap lock operations from process_madvise() by
directly doing the mmap locking first, and then the remaining works for
all ranges in the loop.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/madvise.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 913936a5c353..1a796a03a4fb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 	total_len = iov_iter_count(iter);
 
+	ret = madvise_lock(mm, behavior);
+	if (ret)
+		return ret;
+
 	while (iov_iter_count(iter)) {
-		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
-				 iter_iov_len(iter), behavior);
+		unsigned long start = (unsigned long)iter_iov_addr(iter);
+		size_t len_in = iter_iov_len(iter);
+		size_t len;
+
+		if (!is_valid_madvise(start, len_in, behavior)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		len = PAGE_ALIGN(len_in);
+		if (start + len == start)
+			ret = 0;
+		else
+			ret = madvise_do_behavior(mm, start, len_in, len,
+					behavior);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
@@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
+	madvise_unlock(mm, behavior);
 
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
 
-- 
2.39.5

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Lorenzo Stoakes 1 year ago

On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>

I wonder if this might increase lock contention because now all of the
vector operations will hold the relevant mm lock without releasing after
each operation?

Probably it's ok given limited size of iov, but maybe in future we'd want
to set a limit on the ranges before we drop/reacquire lock?

I've tested this against my PIDFD_SELF changes [0] and the
process_madvise() invocation in the guard-pages tests which utilises
process_madvise() in a perhaps 'unusual' way so is a good test bed for
this, and all working fine.

Amazingly, this appears to be the only place (afaict) where
process_madivse() is actually tested...

Buuuut, I think this change is incorrect, I comment on this below. Should
be an easy fix though.

[0]:https://lore.kernel.org/all/cover.1738268370.git.lorenzo.stoakes@oracle.com/

> ---
>  mm/madvise.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 913936a5c353..1a796a03a4fb 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
>  	total_len = iov_iter_count(iter);
>
> +	ret = madvise_lock(mm, behavior);
> +	if (ret)
> +		return ret;
> +
>  	while (iov_iter_count(iter)) {
> -		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> -				 iter_iov_len(iter), behavior);
> +		unsigned long start = (unsigned long)iter_iov_addr(iter);

This is nicer than just passing in.
> +		size_t len_in = iter_iov_len(iter);

Equally so here...

> +		size_t len;
> +
> +		if (!is_valid_madvise(start, len_in, behavior)) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		len = PAGE_ALIGN(len_in);
> +		if (start + len == start)
> +			ret = 0;
> +		else
> +			ret = madvise_do_behavior(mm, start, len_in, len,
> +					behavior);

Very slight duplication here (I wonder if we can somehow wrap the 'if zero
length return 0' thing?).

But it doesn't really matter, probably better to spell it out, actually.

>  		/*
>  		 * An madvise operation is attempting to restart the syscall,
>  		 * but we cannot proceed as it would not be correct to repeat
> @@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  			break;
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
> +	madvise_unlock(mm, behavior);
>
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;

So I think this is now wrong because of the work I did recently. In this code:

		/*
		 * An madvise operation is attempting to restart the syscall,
		 * but we cannot proceed as it would not be correct to repeat
		 * the operation in aggregate, and would be surprising to the
		 * user.
		 *
		 * As we have already dropped locks, it is safe to just loop and
		 * try again. We check for fatal signals in case we need exit
		 * early anyway.
		 */
		if (ret == -ERESTARTNOINTR) {
			if (fatal_signal_pending(current)) {
				ret = -EINTR;
				break;
			}
			continue;
		}

Note that it assumes the locks have been dropped before simply trying
again, as the only way this would happen is because of a race, and we may
end up stuck in a loop if we just hold on to the lock.

So I'd suggest updating this comment and changing the code like this:

		if (ret == -ERESTARTNOINTR) {
			if (fatal_signal_pending(current)) {
				ret = -EINTR;
				break;
			}

+			/* Drop and reacquire lock to unwind race. */
+			madvise_unlock(mm, behaviour);
+			madvise_lock(mm, behaviour);
			continue;
		}

Which brings back the existing behaviour.

By the way I hate that this function swallows error codes. But that's not
your fault, and is now established user-facing behaviour so yeah. Big sigh.

>
> --
> 2.39.5

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by SeongJae Park 1 year ago

On Fri, 31 Jan 2025 16:53:17 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > Optimize redundant mmap lock operations from process_madvise() by
> > directly doing the mmap locking first, and then the remaining works for
> > all ranges in the loop.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
[...]
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
[...]
> >  		/*
> >  		 * An madvise operation is attempting to restart the syscall,
> >  		 * but we cannot proceed as it would not be correct to repeat
> > @@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  			break;
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> >  	}
> > +	madvise_unlock(mm, behavior);
> >
> >  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> 
> So I think this is now wrong because of the work I did recently. In this code:
> 
> 		/*
> 		 * An madvise operation is attempting to restart the syscall,
> 		 * but we cannot proceed as it would not be correct to repeat
> 		 * the operation in aggregate, and would be surprising to the
> 		 * user.
> 		 *
> 		 * As we have already dropped locks, it is safe to just loop and
> 		 * try again. We check for fatal signals in case we need exit
> 		 * early anyway.
> 		 */
> 		if (ret == -ERESTARTNOINTR) {
> 			if (fatal_signal_pending(current)) {
> 				ret = -EINTR;
> 				break;
> 			}
> 			continue;
> 		}
> 
> Note that it assumes the locks have been dropped before simply trying
> again, as the only way this would happen is because of a race, and we may
> end up stuck in a loop if we just hold on to the lock.

Nice catch!

> 
> So I'd suggest updating this comment and changing the code like this:
> 
> 		if (ret == -ERESTARTNOINTR) {
> 			if (fatal_signal_pending(current)) {
> 				ret = -EINTR;
> 				break;
> 			}
> 
> +			/* Drop and reacquire lock to unwind race. */
> +			madvise_unlock(mm, behaviour);
> +			madvise_lock(mm, behaviour);
> 			continue;
> 		}
> 
> Which brings back the existing behaviour.

Thank you for this kind suggestion.  I will update next version of this patch
in this way.

> 
> By the way I hate that this function swallows error codes. But that's not
> your fault, and is now established user-facing behaviour so yeah. Big sigh.
> 
> >
> > --
> > 2.39.5


Thanks,
SJ

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Davidlohr Bueso 1 year ago

On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:

>On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
>> Optimize redundant mmap lock operations from process_madvise() by
>> directly doing the mmap locking first, and then the remaining works for
>> all ranges in the loop.
>>
>> Signed-off-by: SeongJae Park <sj@kernel.org>
>
>I wonder if this might increase lock contention because now all of the
>vector operations will hold the relevant mm lock without releasing after
>each operation?

That was exactly my concern. While afaict the numbers presented in v1
are quite nice, this is ultimately a micro-benchmark, where no other
unrelated threads are impacted by these new hold times.

>Probably it's ok given limited size of iov, but maybe in future we'd want
>to set a limit on the ranges before we drop/reacquire lock?

imo, this should best be done in the same patch/series. Maybe extend
the benchmark to use IOV_MAX and find a sweet spot?

Thanks,
Davidlohr

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Liam R. Howlett 1 year ago

* Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> 
> > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > Optimize redundant mmap lock operations from process_madvise() by
> > > directly doing the mmap locking first, and then the remaining works for
> > > all ranges in the loop.
> > > 
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > 
> > I wonder if this might increase lock contention because now all of the
> > vector operations will hold the relevant mm lock without releasing after
> > each operation?
> 
> That was exactly my concern. While afaict the numbers presented in v1
> are quite nice, this is ultimately a micro-benchmark, where no other
> unrelated threads are impacted by these new hold times.

Indeed, I was also concerned about this scenario.

But this method does have the added advantage of keeping the vma space
in the same state as it was expected during the initial call - although
the race does still exist on looking vs acting on the data.  This would
just remove the intermediate changes.

> 
> > Probably it's ok given limited size of iov, but maybe in future we'd want
> > to set a limit on the ranges before we drop/reacquire lock?
> 
> imo, this should best be done in the same patch/series. Maybe extend
> the benchmark to use IOV_MAX and find a sweet spot?

Are you worried this is over-engineering for a problem that may never be
an issue, or is there a particular usecase you have in mind?

It is probably worth investigating, and maybe a potential usecase would
help with the targeted sweet spot?

Thanks,
Liam

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Shakeel Butt 1 year ago

On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > 
> > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > directly doing the mmap locking first, and then the remaining works for
> > > > all ranges in the loop.
> > > > 
> > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > 
> > > I wonder if this might increase lock contention because now all of the
> > > vector operations will hold the relevant mm lock without releasing after
> > > each operation?
> > 
> > That was exactly my concern. While afaict the numbers presented in v1
> > are quite nice, this is ultimately a micro-benchmark, where no other
> > unrelated threads are impacted by these new hold times.
> 
> Indeed, I was also concerned about this scenario.
> 
> But this method does have the added advantage of keeping the vma space
> in the same state as it was expected during the initial call - although
> the race does still exist on looking vs acting on the data.  This would
> just remove the intermediate changes.
> 
> > 
> > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > to set a limit on the ranges before we drop/reacquire lock?
> > 
> > imo, this should best be done in the same patch/series. Maybe extend
> > the benchmark to use IOV_MAX and find a sweet spot?
> 
> Are you worried this is over-engineering for a problem that may never be
> an issue, or is there a particular usecase you have in mind?
> 
> It is probably worth investigating, and maybe a potential usecase would
> help with the targeted sweet spot?
> 
> 

Lorenzo already explained that it is not an issue at the moment. I think
this is good discussion to have as I think we will be expanding the
limit from UIO_FASTIOV to higher value (maybe UIO_MAXIOV) soon in the
followup. At the moment, my gut feeling is that batch size of regions
given to the syscall will depend on the advise parameter. For example,
For MADV_[NO]HUGEPAGE which is a simple flag [re]set, a single write
lock and possible a single tree traversal will be better than multiple
write lock/unlock operations even for large batch size. Anyways we will
need some experimental data to show that.

JFYI SJ is planning to work on two improvements: (1) single tree
traversal for all the given regions and (2) TLB flush batching for
MADV_DONTNEED[_LOCKED] and MADV_FREE.

Thanks everyone for your time and feedback.

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Lorenzo Stoakes 1 year ago

On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> >
> > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > directly doing the mmap locking first, and then the remaining works for
> > > > all ranges in the loop.
> > > >
> > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > >
> > > I wonder if this might increase lock contention because now all of the
> > > vector operations will hold the relevant mm lock without releasing after
> > > each operation?
> >
> > That was exactly my concern. While afaict the numbers presented in v1
> > are quite nice, this is ultimately a micro-benchmark, where no other
> > unrelated threads are impacted by these new hold times.
>
> Indeed, I was also concerned about this scenario.
>
> But this method does have the added advantage of keeping the vma space
> in the same state as it was expected during the initial call - although
> the race does still exist on looking vs acting on the data.  This would
> just remove the intermediate changes.
>
> >
> > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > to set a limit on the ranges before we drop/reacquire lock?
> >
> > imo, this should best be done in the same patch/series. Maybe extend
> > the benchmark to use IOV_MAX and find a sweet spot?
>
> Are you worried this is over-engineering for a problem that may never be
> an issue, or is there a particular usecase you have in mind?
>
> It is probably worth investigating, and maybe a potential usecase would
> help with the targeted sweet spot?
>

Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
high, but rather UIO_FASTIOV, which is limited to 8 entries.

(Some have been surprised by this limitation...!)

So I think at this point scaling isn't a huge issue, I raise it because in
future we may want to increase this limit, at which point we should think about
it, which is why I sort of hand-waved it away a bit.

> Thanks,
> Liam
>

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Lorenzo Stoakes 8 months, 3 weeks ago

On Fri, Jan 31, 2025 at 05:51:45PM +0000, Lorenzo Stoakes wrote:
> On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > >
> > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > all ranges in the loop.
> > > > >
> > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > >
> > > > I wonder if this might increase lock contention because now all of the
> > > > vector operations will hold the relevant mm lock without releasing after
> > > > each operation?
> > >
> > > That was exactly my concern. While afaict the numbers presented in v1
> > > are quite nice, this is ultimately a micro-benchmark, where no other
> > > unrelated threads are impacted by these new hold times.
> >
> > Indeed, I was also concerned about this scenario.
> >
> > But this method does have the added advantage of keeping the vma space
> > in the same state as it was expected during the initial call - although
> > the race does still exist on looking vs acting on the data.  This would
> > just remove the intermediate changes.
> >
> > >
> > > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > > to set a limit on the ranges before we drop/reacquire lock?
> > >
> > > imo, this should best be done in the same patch/series. Maybe extend
> > > the benchmark to use IOV_MAX and find a sweet spot?
> >
> > Are you worried this is over-engineering for a problem that may never be
> > an issue, or is there a particular usecase you have in mind?
> >
> > It is probably worth investigating, and maybe a potential usecase would
> > help with the targeted sweet spot?
> >
>
> Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> high, but rather UIO_FASTIOV, which is limited to 8 entries.
>
> (Some have been surprised by this limitation...!)

Surprised, perhaps because I was wrong about this :) Apologies for that.

SJ raised this in [0] and the non-RFC version of this series is over at [1].

[0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/
[1]: https://lore.kernel.org/all/20250206061517.2958-1-sj@kernel.org/

We should revisit this and determine whether the drop/reacquire lock is
required, perhaps doing some experiments around heavy operations using
UIO_MAXIOV entries?

SJ - could you take a look at this please?

>
> So I think at this point scaling isn't a huge issue, I raise it because in
> future we may want to increase this limit, at which point we should think about
> it, which is why I sort of hand-waved it away a bit.

Again as I said here, I suspect _probably_ this won't be too much of an
issue - but it is absolutely one we need to address.

>
> > Thanks,
> > Liam
> >

Cheers, Lorenzo

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by SeongJae Park 8 months, 3 weeks ago

On Sat, 17 May 2025 20:28:49 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Jan 31, 2025 at 05:51:45PM +0000, Lorenzo Stoakes wrote:
> > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > > >
> > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > > all ranges in the loop.
> > > > > >
> > > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > > >
> > > > > I wonder if this might increase lock contention because now all of the
> > > > > vector operations will hold the relevant mm lock without releasing after
> > > > > each operation?
> > > >
> > > > That was exactly my concern. While afaict the numbers presented in v1
> > > > are quite nice, this is ultimately a micro-benchmark, where no other
> > > > unrelated threads are impacted by these new hold times.
> > >
> > > Indeed, I was also concerned about this scenario.
> > >
> > > But this method does have the added advantage of keeping the vma space
> > > in the same state as it was expected during the initial call - although
> > > the race does still exist on looking vs acting on the data.  This would
> > > just remove the intermediate changes.
> > >
> > > >
> > > > > Probably it's ok given limited size of iov, but maybe in future we'd want
> > > > > to set a limit on the ranges before we drop/reacquire lock?
> > > >
> > > > imo, this should best be done in the same patch/series. Maybe extend
> > > > the benchmark to use IOV_MAX and find a sweet spot?
> > >
> > > Are you worried this is over-engineering for a problem that may never be
> > > an issue, or is there a particular usecase you have in mind?
> > >
> > > It is probably worth investigating, and maybe a potential usecase would
> > > help with the targeted sweet spot?
> > >
> >
> > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> > high, but rather UIO_FASTIOV, which is limited to 8 entries.
> >
> > (Some have been surprised by this limitation...!)
> 
> Surprised, perhaps because I was wrong about this :) Apologies for that.
> 
> SJ raised this in [0] and the non-RFC version of this series is over at [1].
> 
> [0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/
> [1]: https://lore.kernel.org/all/20250206061517.2958-1-sj@kernel.org/

I actually mentioned[1] I think the real limit is UIO_MAXIOV but still that
wouldn't be a real problem since users can tune the batching size.  Actually
jemalloc has made a change to use process_madvise() with up to 128 batching
size.

I impatiently sent[3] the next revision without giving you enough time to
reply, though.

> 
> We should revisit this and determine whether the drop/reacquire lock is
> required, perhaps doing some experiments around heavy operations using
> UIO_MAXIOV entries?
> 
> SJ - could you take a look at this please?

We had a chance to test this against a production workload, and found no
visible regression.  The workload is not intesively calling process_madvise()
though.  Our internal testing of kernels having this change also didn't find
any problem so far, though process_madvise() calls from the internal testing is
also not intensive to my best knowledge.

So my thought about UIO_MAXIOV is same.  I anticipate no issue (until someone
yells ;) ) and didn't find an evidence of the problem.  But also same to the
previous discussion[1], I agree more testing would be good, while I have no
good list of benchmarks for this.  It would be nice if someone can give me the
name of the benchmarks.

> 
> >
> > So I think at this point scaling isn't a huge issue, I raise it because in
> > future we may want to increase this limit, at which point we should think about
> > it, which is why I sort of hand-waved it away a bit.
> 
> Again as I said here, I suspect _probably_ this won't be too much of an
> issue - but it is absolutely one we need to address.

Yes, I agree :)

[1] https://lore.kernel.org/20250204195343.16500-1-sj@kernel.org
[2] https://github.com/jemalloc/jemalloc/pull/2794/commits/c3604456d4c1f570348a
[3] https://lore.kernel.org/20250206062801.3060-1-sj@kernel.org


Thanks,
SJ

[...]

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by SeongJae Park 1 year ago

On Fri, 31 Jan 2025 17:51:45 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > >
> > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > all ranges in the loop.
> > > > >
> > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > >
> > > > I wonder if this might increase lock contention because now all of the
> > > > vector operations will hold the relevant mm lock without releasing after
> > > > each operation?
> > >
> > > That was exactly my concern. While afaict the numbers presented in v1
> > > are quite nice, this is ultimately a micro-benchmark, where no other
> > > unrelated threads are impacted by these new hold times.
> >
> > Indeed, I was also concerned about this scenario.
> >
> > But this method does have the added advantage of keeping the vma space
> > in the same state as it was expected during the initial call - although
> > the race does still exist on looking vs acting on the data.  This would
> > just remove the intermediate changes.
> >
> > >
> > > > Probably it's ok given limited size of iov,

I think so.  Also, users could adjust the batching size for their workloads.

> > > > but maybe in future we'd want
> > > > to set a limit on the ranges before we drop/reacquire lock?
> > >
> > > imo, this should best be done in the same patch/series. Maybe extend
> > > the benchmark to use IOV_MAX and find a sweet spot?
> >
> > Are you worried this is over-engineering for a problem that may never be
> > an issue, or is there a particular usecase you have in mind?
> >
> > It is probably worth investigating, and maybe a potential usecase would
> > help with the targeted sweet spot?

I think the sweet spot may depend on the workload and the advice type.  So
selection of the benchmark will be important.  Do you have a benchmark in your
mind?  My humble microbenchmark[1] does only single-thread usage performance
evaluation, so may not be appropriate to be used here.  I actually do the
evaluation with batching size up to the IOV_MAX (1024), but show no clear
evidence of the sweet spot.

> >
> 
> Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> high, but rather UIO_FASTIOV, which is limited to 8 entries.

process_madvise() indeed have iovec array of UIO_FASTIOV size, namely iovstack.
But, if I understood the code correctly, iostack is used only for a fast path
that the user requested advicing less than UIO_FASTIOV regions.

I actually confirmed I can make the loop itrate 1024 times, using my
microbenchmark[1].  My step for the check was running the program with
'eval_pmadv $((4*1024*1024)) $((4*1024)) $((4*1024*1024))' command, and
counting the number of the itration of the vector_madvise() main loop using
printk().  Please let me know if I'm missing something.

> 
> (Some have been surprised by this limitation...!)
> 
> So I think at this point scaling isn't a huge issue, I raise it because in
> future we may want to increase this limit, at which point we should think about
> it, which is why I sort of hand-waved it away a bit.

I personally think this may still not be a huge issue, especially given the
fact that users can test and tune the limit.  But I'd like to hear more
opinions if available.

[1] https://github.com/sjp38/eval_proc_madvise/blob/master/eval_pmadv.c

Thanks,
SJ

> 
> > Thanks,
> > Liam
> >
>

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by SeongJae Park 1 year ago

On Tue,  4 Feb 2025 11:53:43 -0800 SeongJae Park <sj@kernel.org> wrote:

> On Fri, 31 Jan 2025 17:51:45 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote:
> > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]:
> > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:
> > > >
> > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > > > > > Optimize redundant mmap lock operations from process_madvise() by
> > > > > > directly doing the mmap locking first, and then the remaining works for
> > > > > > all ranges in the loop.
> > > > > >
> > > > > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > > >
> > > > > I wonder if this might increase lock contention because now all of the
> > > > > vector operations will hold the relevant mm lock without releasing after
> > > > > each operation?
[...]
> > 
> > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
> > high, but rather UIO_FASTIOV, which is limited to 8 entries.
> 
> process_madvise() indeed have iovec array of UIO_FASTIOV size, namely iovstack.
> But, if I understood the code correctly, iostack is used only for a fast path
> that the user requested advicing less than UIO_FASTIOV regions.
> 
> I actually confirmed I can make the loop itrate 1024 times, using my
> microbenchmark[1].  My step for the check was running the program with
> 'eval_pmadv $((4*1024*1024)) $((4*1024)) $((4*1024*1024))' command, and
> counting the number of the itration of the vector_madvise() main loop using
> printk().  Please let me know if I'm missing something.
> 
> > 
> > (Some have been surprised by this limitation...!)
> > 
> > So I think at this point scaling isn't a huge issue, I raise it because in
> > future we may want to increase this limit, at which point we should think about
> > it, which is why I sort of hand-waved it away a bit.
> 
> I personally think this may still not be a huge issue, especially given the
> fact that users can test and tune the limit.  But I'd like to hear more
> opinions if available.

Maybe I should wait more for the more opinions, but I just impatiently sent the
next spin of this patch series[1].  Because Davidlohr and Lorenzo promised
their tags based on the assumption of UIO_FASTIOV limit while I think that may
not be an effective limit, I didn't add their tags on the fourth patch of the
series.  Please let me know if you have any concern about that, either on this
thread or on the new patch series thread[1].

[1] https://lore.kernel.org/20250206061517.2958-1-sj@kernel.org

> 
> [1] https://github.com/sjp38/eval_proc_madvise/blob/master/eval_pmadv.c


Thanks,
SJ

[...]

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Davidlohr Bueso 1 year ago

On Fri, 31 Jan 2025, Lorenzo Stoakes wrote:

>Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather
>high, but rather UIO_FASTIOV, which is limited to 8 entries.
>
>(Some have been surprised by this limitation...!)

Ah if that is the case then I'm fine with this patch as is.

>
>So I think at this point scaling isn't a huge issue, I raise it because in
>future we may want to increase this limit, at which point we should think about
>it, which is why I sort of hand-waved it away a bit.

Agreed.

SJ, feel free to add my:

Acked-by: Davidlohr Bueso <dave@stgolabs.net>

Re: [RFC PATCH v2 4/4] mm/madvise: remove redundant mmap_lock operations from process_madvise()

Posted by Shakeel Butt 1 year ago

On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>