Optimize redundant mmap lock operations from process_madvise() by
directly doing the mmap locking first, and then the remaining works for
all ranges in the loop.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
mm/madvise.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 913936a5c353..1a796a03a4fb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
total_len = iov_iter_count(iter);
+ ret = madvise_lock(mm, behavior);
+ if (ret)
+ return ret;
+
while (iov_iter_count(iter)) {
- ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
- iter_iov_len(iter), behavior);
+ unsigned long start = (unsigned long)iter_iov_addr(iter);
+ size_t len_in = iter_iov_len(iter);
+ size_t len;
+
+ if (!is_valid_madvise(start, len_in, behavior)) {
+ ret = -EINVAL;
+ break;
+ }
+
+ len = PAGE_ALIGN(len_in);
+ if (start + len == start)
+ ret = 0;
+ else
+ ret = madvise_do_behavior(mm, start, len_in, len,
+ behavior);
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
@@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
break;
iov_iter_advance(iter, iter_iov_len(iter));
}
+ madvise_unlock(mm, behavior);
ret = (total_len - iov_iter_count(iter)) ? : ret;
--
2.39.5
On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> Optimize redundant mmap lock operations from process_madvise() by
> directly doing the mmap locking first, and then the remaining works for
> all ranges in the loop.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
I wonder if this might increase lock contention because now all of the
vector operations will hold the relevant mm lock without releasing after
each operation?
Probably it's ok given limited size of iov, but maybe in future we'd want
to set a limit on the ranges before we drop/reacquire lock?
I've tested this against my PIDFD_SELF changes [0] and the
process_madvise() invocation in the guard-pages tests which utilises
process_madvise() in a perhaps 'unusual' way so is a good test bed for
this, and all working fine.
Amazingly, this appears to be the only place (afaict) where
process_madivse() is actually tested...
Buuuut, I think this change is incorrect, I comment on this below. Should
be an easy fix though.
[0]:https://lore.kernel.org/all/cover.1738268370.git.lorenzo.stoakes@oracle.com/
> ---
> mm/madvise.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 913936a5c353..1a796a03a4fb 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
> total_len = iov_iter_count(iter);
>
> + ret = madvise_lock(mm, behavior);
> + if (ret)
> + return ret;
> +
> while (iov_iter_count(iter)) {
> - ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> - iter_iov_len(iter), behavior);
> + unsigned long start = (unsigned long)iter_iov_addr(iter);
This is nicer than just passing in.
> + size_t len_in = iter_iov_len(iter);
Equally so here...
> + size_t len;
> +
> + if (!is_valid_madvise(start, len_in, behavior)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + len = PAGE_ALIGN(len_in);
> + if (start + len == start)
> + ret = 0;
> + else
> + ret = madvise_do_behavior(mm, start, len_in, len,
> + behavior);
Very slight duplication here (I wonder if we can somehow wrap the 'if zero
length return 0' thing?).
But it doesn't really matter, probably better to spell it out, actually.
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
> @@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> break;
> iov_iter_advance(iter, iter_iov_len(iter));
> }
> + madvise_unlock(mm, behavior);
>
> ret = (total_len - iov_iter_count(iter)) ? : ret;
So I think this is now wrong because of the work I did recently. In this code:
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
* the operation in aggregate, and would be surprising to the
* user.
*
* As we have already dropped locks, it is safe to just loop and
* try again. We check for fatal signals in case we need exit
* early anyway.
*/
if (ret == -ERESTARTNOINTR) {
if (fatal_signal_pending(current)) {
ret = -EINTR;
break;
}
continue;
}
Note that it assumes the locks have been dropped before simply trying
again, as the only way this would happen is because of a race, and we may
end up stuck in a loop if we just hold on to the lock.
So I'd suggest updating this comment and changing the code like this:
if (ret == -ERESTARTNOINTR) {
if (fatal_signal_pending(current)) {
ret = -EINTR;
break;
}
+ /* Drop and reacquire lock to unwind race. */
+ madvise_unlock(mm, behaviour);
+ madvise_lock(mm, behaviour);
continue;
}
Which brings back the existing behaviour.
By the way I hate that this function swallows error codes. But that's not
your fault, and is now established user-facing behaviour so yeah. Big sigh.
>
> --
> 2.39.5
On Fri, 31 Jan 2025 16:53:17 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote:
> > Optimize redundant mmap lock operations from process_madvise() by
> > directly doing the mmap locking first, and then the remaining works for
> > all ranges in the loop.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
[...]
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1752,9 +1752,26 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
[...]
> > /*
> > * An madvise operation is attempting to restart the syscall,
> > * but we cannot proceed as it would not be correct to repeat
> > @@ -1776,6 +1793,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> > break;
> > iov_iter_advance(iter, iter_iov_len(iter));
> > }
> > + madvise_unlock(mm, behavior);
> >
> > ret = (total_len - iov_iter_count(iter)) ? : ret;
>
> So I think this is now wrong because of the work I did recently. In this code:
>
> /*
> * An madvise operation is attempting to restart the syscall,
> * but we cannot proceed as it would not be correct to repeat
> * the operation in aggregate, and would be surprising to the
> * user.
> *
> * As we have already dropped locks, it is safe to just loop and
> * try again. We check for fatal signals in case we need exit
> * early anyway.
> */
> if (ret == -ERESTARTNOINTR) {
> if (fatal_signal_pending(current)) {
> ret = -EINTR;
> break;
> }
> continue;
> }
>
> Note that it assumes the locks have been dropped before simply trying
> again, as the only way this would happen is because of a race, and we may
> end up stuck in a loop if we just hold on to the lock.
Nice catch!
>
> So I'd suggest updating this comment and changing the code like this:
>
> if (ret == -ERESTARTNOINTR) {
> if (fatal_signal_pending(current)) {
> ret = -EINTR;
> break;
> }
>
> + /* Drop and reacquire lock to unwind race. */
> + madvise_unlock(mm, behaviour);
> + madvise_lock(mm, behaviour);
> continue;
> }
>
> Which brings back the existing behaviour.
Thank you for this kind suggestion. I will update next version of this patch
in this way.
>
> By the way I hate that this function swallows error codes. But that's not
> your fault, and is now established user-facing behaviour so yeah. Big sigh.
>
> >
> > --
> > 2.39.5
Thanks,
SJ
On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: >On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: >> Optimize redundant mmap lock operations from process_madvise() by >> directly doing the mmap locking first, and then the remaining works for >> all ranges in the loop. >> >> Signed-off-by: SeongJae Park <sj@kernel.org> > >I wonder if this might increase lock contention because now all of the >vector operations will hold the relevant mm lock without releasing after >each operation? That was exactly my concern. While afaict the numbers presented in v1 are quite nice, this is ultimately a micro-benchmark, where no other unrelated threads are impacted by these new hold times. >Probably it's ok given limited size of iov, but maybe in future we'd want >to set a limit on the ranges before we drop/reacquire lock? imo, this should best be done in the same patch/series. Maybe extend the benchmark to use IOV_MAX and find a sweet spot? Thanks, Davidlohr
* Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > Optimize redundant mmap lock operations from process_madvise() by > > > directly doing the mmap locking first, and then the remaining works for > > > all ranges in the loop. > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > I wonder if this might increase lock contention because now all of the > > vector operations will hold the relevant mm lock without releasing after > > each operation? > > That was exactly my concern. While afaict the numbers presented in v1 > are quite nice, this is ultimately a micro-benchmark, where no other > unrelated threads are impacted by these new hold times. Indeed, I was also concerned about this scenario. But this method does have the added advantage of keeping the vma space in the same state as it was expected during the initial call - although the race does still exist on looking vs acting on the data. This would just remove the intermediate changes. > > > Probably it's ok given limited size of iov, but maybe in future we'd want > > to set a limit on the ranges before we drop/reacquire lock? > > imo, this should best be done in the same patch/series. Maybe extend > the benchmark to use IOV_MAX and find a sweet spot? Are you worried this is over-engineering for a problem that may never be an issue, or is there a particular usecase you have in mind? It is probably worth investigating, and maybe a potential usecase would help with the targeted sweet spot? Thanks, Liam
On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote: > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > > Optimize redundant mmap lock operations from process_madvise() by > > > > directly doing the mmap locking first, and then the remaining works for > > > > all ranges in the loop. > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > I wonder if this might increase lock contention because now all of the > > > vector operations will hold the relevant mm lock without releasing after > > > each operation? > > > > That was exactly my concern. While afaict the numbers presented in v1 > > are quite nice, this is ultimately a micro-benchmark, where no other > > unrelated threads are impacted by these new hold times. > > Indeed, I was also concerned about this scenario. > > But this method does have the added advantage of keeping the vma space > in the same state as it was expected during the initial call - although > the race does still exist on looking vs acting on the data. This would > just remove the intermediate changes. > > > > > > Probably it's ok given limited size of iov, but maybe in future we'd want > > > to set a limit on the ranges before we drop/reacquire lock? > > > > imo, this should best be done in the same patch/series. Maybe extend > > the benchmark to use IOV_MAX and find a sweet spot? > > Are you worried this is over-engineering for a problem that may never be > an issue, or is there a particular usecase you have in mind? > > It is probably worth investigating, and maybe a potential usecase would > help with the targeted sweet spot? > > Lorenzo already explained that it is not an issue at the moment. I think this is good discussion to have as I think we will be expanding the limit from UIO_FASTIOV to higher value (maybe UIO_MAXIOV) soon in the followup. At the moment, my gut feeling is that batch size of regions given to the syscall will depend on the advise parameter. For example, For MADV_[NO]HUGEPAGE which is a simple flag [re]set, a single write lock and possible a single tree traversal will be better than multiple write lock/unlock operations even for large batch size. Anyways we will need some experimental data to show that. JFYI SJ is planning to work on two improvements: (1) single tree traversal for all the given regions and (2) TLB flush batching for MADV_DONTNEED[_LOCKED] and MADV_FREE. Thanks everyone for your time and feedback.
On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote: > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > > Optimize redundant mmap lock operations from process_madvise() by > > > > directly doing the mmap locking first, and then the remaining works for > > > > all ranges in the loop. > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > I wonder if this might increase lock contention because now all of the > > > vector operations will hold the relevant mm lock without releasing after > > > each operation? > > > > That was exactly my concern. While afaict the numbers presented in v1 > > are quite nice, this is ultimately a micro-benchmark, where no other > > unrelated threads are impacted by these new hold times. > > Indeed, I was also concerned about this scenario. > > But this method does have the added advantage of keeping the vma space > in the same state as it was expected during the initial call - although > the race does still exist on looking vs acting on the data. This would > just remove the intermediate changes. > > > > > > Probably it's ok given limited size of iov, but maybe in future we'd want > > > to set a limit on the ranges before we drop/reacquire lock? > > > > imo, this should best be done in the same patch/series. Maybe extend > > the benchmark to use IOV_MAX and find a sweet spot? > > Are you worried this is over-engineering for a problem that may never be > an issue, or is there a particular usecase you have in mind? > > It is probably worth investigating, and maybe a potential usecase would > help with the targeted sweet spot? > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather high, but rather UIO_FASTIOV, which is limited to 8 entries. (Some have been surprised by this limitation...!) So I think at this point scaling isn't a huge issue, I raise it because in future we may want to increase this limit, at which point we should think about it, which is why I sort of hand-waved it away a bit. > Thanks, > Liam >
On Fri, Jan 31, 2025 at 05:51:45PM +0000, Lorenzo Stoakes wrote: > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote: > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > > > Optimize redundant mmap lock operations from process_madvise() by > > > > > directly doing the mmap locking first, and then the remaining works for > > > > > all ranges in the loop. > > > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > > > I wonder if this might increase lock contention because now all of the > > > > vector operations will hold the relevant mm lock without releasing after > > > > each operation? > > > > > > That was exactly my concern. While afaict the numbers presented in v1 > > > are quite nice, this is ultimately a micro-benchmark, where no other > > > unrelated threads are impacted by these new hold times. > > > > Indeed, I was also concerned about this scenario. > > > > But this method does have the added advantage of keeping the vma space > > in the same state as it was expected during the initial call - although > > the race does still exist on looking vs acting on the data. This would > > just remove the intermediate changes. > > > > > > > > > Probably it's ok given limited size of iov, but maybe in future we'd want > > > > to set a limit on the ranges before we drop/reacquire lock? > > > > > > imo, this should best be done in the same patch/series. Maybe extend > > > the benchmark to use IOV_MAX and find a sweet spot? > > > > Are you worried this is over-engineering for a problem that may never be > > an issue, or is there a particular usecase you have in mind? > > > > It is probably worth investigating, and maybe a potential usecase would > > help with the targeted sweet spot? > > > > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather > high, but rather UIO_FASTIOV, which is limited to 8 entries. > > (Some have been surprised by this limitation...!) Surprised, perhaps because I was wrong about this :) Apologies for that. SJ raised this in [0] and the non-RFC version of this series is over at [1]. [0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/ [1]: https://lore.kernel.org/all/20250206061517.2958-1-sj@kernel.org/ We should revisit this and determine whether the drop/reacquire lock is required, perhaps doing some experiments around heavy operations using UIO_MAXIOV entries? SJ - could you take a look at this please? > > So I think at this point scaling isn't a huge issue, I raise it because in > future we may want to increase this limit, at which point we should think about > it, which is why I sort of hand-waved it away a bit. Again as I said here, I suspect _probably_ this won't be too much of an issue - but it is absolutely one we need to address. > > > Thanks, > > Liam > > Cheers, Lorenzo
On Sat, 17 May 2025 20:28:49 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Fri, Jan 31, 2025 at 05:51:45PM +0000, Lorenzo Stoakes wrote: > > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote: > > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > > > > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > > > > Optimize redundant mmap lock operations from process_madvise() by > > > > > > directly doing the mmap locking first, and then the remaining works for > > > > > > all ranges in the loop. > > > > > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > > > > > I wonder if this might increase lock contention because now all of the > > > > > vector operations will hold the relevant mm lock without releasing after > > > > > each operation? > > > > > > > > That was exactly my concern. While afaict the numbers presented in v1 > > > > are quite nice, this is ultimately a micro-benchmark, where no other > > > > unrelated threads are impacted by these new hold times. > > > > > > Indeed, I was also concerned about this scenario. > > > > > > But this method does have the added advantage of keeping the vma space > > > in the same state as it was expected during the initial call - although > > > the race does still exist on looking vs acting on the data. This would > > > just remove the intermediate changes. > > > > > > > > > > > > Probably it's ok given limited size of iov, but maybe in future we'd want > > > > > to set a limit on the ranges before we drop/reacquire lock? > > > > > > > > imo, this should best be done in the same patch/series. Maybe extend > > > > the benchmark to use IOV_MAX and find a sweet spot? > > > > > > Are you worried this is over-engineering for a problem that may never be > > > an issue, or is there a particular usecase you have in mind? > > > > > > It is probably worth investigating, and maybe a potential usecase would > > > help with the targeted sweet spot? > > > > > > > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather > > high, but rather UIO_FASTIOV, which is limited to 8 entries. > > > > (Some have been surprised by this limitation...!) > > Surprised, perhaps because I was wrong about this :) Apologies for that. > > SJ raised this in [0] and the non-RFC version of this series is over at [1]. > > [0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/ > [1]: https://lore.kernel.org/all/20250206061517.2958-1-sj@kernel.org/ I actually mentioned[1] I think the real limit is UIO_MAXIOV but still that wouldn't be a real problem since users can tune the batching size. Actually jemalloc has made a change to use process_madvise() with up to 128 batching size. I impatiently sent[3] the next revision without giving you enough time to reply, though. > > We should revisit this and determine whether the drop/reacquire lock is > required, perhaps doing some experiments around heavy operations using > UIO_MAXIOV entries? > > SJ - could you take a look at this please? We had a chance to test this against a production workload, and found no visible regression. The workload is not intesively calling process_madvise() though. Our internal testing of kernels having this change also didn't find any problem so far, though process_madvise() calls from the internal testing is also not intensive to my best knowledge. So my thought about UIO_MAXIOV is same. I anticipate no issue (until someone yells ;) ) and didn't find an evidence of the problem. But also same to the previous discussion[1], I agree more testing would be good, while I have no good list of benchmarks for this. It would be nice if someone can give me the name of the benchmarks. > > > > > So I think at this point scaling isn't a huge issue, I raise it because in > > future we may want to increase this limit, at which point we should think about > > it, which is why I sort of hand-waved it away a bit. > > Again as I said here, I suspect _probably_ this won't be too much of an > issue - but it is absolutely one we need to address. Yes, I agree :) [1] https://lore.kernel.org/20250204195343.16500-1-sj@kernel.org [2] https://github.com/jemalloc/jemalloc/pull/2794/commits/c3604456d4c1f570348a [3] https://lore.kernel.org/20250206062801.3060-1-sj@kernel.org Thanks, SJ [...]
On Fri, 31 Jan 2025 17:51:45 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote: > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > > > Optimize redundant mmap lock operations from process_madvise() by > > > > > directly doing the mmap locking first, and then the remaining works for > > > > > all ranges in the loop. > > > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > > > I wonder if this might increase lock contention because now all of the > > > > vector operations will hold the relevant mm lock without releasing after > > > > each operation? > > > > > > That was exactly my concern. While afaict the numbers presented in v1 > > > are quite nice, this is ultimately a micro-benchmark, where no other > > > unrelated threads are impacted by these new hold times. > > > > Indeed, I was also concerned about this scenario. > > > > But this method does have the added advantage of keeping the vma space > > in the same state as it was expected during the initial call - although > > the race does still exist on looking vs acting on the data. This would > > just remove the intermediate changes. > > > > > > > > > Probably it's ok given limited size of iov, I think so. Also, users could adjust the batching size for their workloads. > > > > but maybe in future we'd want > > > > to set a limit on the ranges before we drop/reacquire lock? > > > > > > imo, this should best be done in the same patch/series. Maybe extend > > > the benchmark to use IOV_MAX and find a sweet spot? > > > > Are you worried this is over-engineering for a problem that may never be > > an issue, or is there a particular usecase you have in mind? > > > > It is probably worth investigating, and maybe a potential usecase would > > help with the targeted sweet spot? I think the sweet spot may depend on the workload and the advice type. So selection of the benchmark will be important. Do you have a benchmark in your mind? My humble microbenchmark[1] does only single-thread usage performance evaluation, so may not be appropriate to be used here. I actually do the evaluation with batching size up to the IOV_MAX (1024), but show no clear evidence of the sweet spot. > > > > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather > high, but rather UIO_FASTIOV, which is limited to 8 entries. process_madvise() indeed have iovec array of UIO_FASTIOV size, namely iovstack. But, if I understood the code correctly, iostack is used only for a fast path that the user requested advicing less than UIO_FASTIOV regions. I actually confirmed I can make the loop itrate 1024 times, using my microbenchmark[1]. My step for the check was running the program with 'eval_pmadv $((4*1024*1024)) $((4*1024)) $((4*1024*1024))' command, and counting the number of the itration of the vector_madvise() main loop using printk(). Please let me know if I'm missing something. > > (Some have been surprised by this limitation...!) > > So I think at this point scaling isn't a huge issue, I raise it because in > future we may want to increase this limit, at which point we should think about > it, which is why I sort of hand-waved it away a bit. I personally think this may still not be a huge issue, especially given the fact that users can test and tune the limit. But I'd like to hear more opinions if available. [1] https://github.com/sjp38/eval_proc_madvise/blob/master/eval_pmadv.c Thanks, SJ > > > Thanks, > > Liam > > >
On Tue, 4 Feb 2025 11:53:43 -0800 SeongJae Park <sj@kernel.org> wrote: > On Fri, 31 Jan 2025 17:51:45 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > On Fri, Jan 31, 2025 at 12:47:24PM -0500, Liam R. Howlett wrote: > > > * Davidlohr Bueso <dave@stgolabs.net> [250131 12:31]: > > > > On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: > > > > > > > > > On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > > > > > > Optimize redundant mmap lock operations from process_madvise() by > > > > > > directly doing the mmap locking first, and then the remaining works for > > > > > > all ranges in the loop. > > > > > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > > > > > I wonder if this might increase lock contention because now all of the > > > > > vector operations will hold the relevant mm lock without releasing after > > > > > each operation? [...] > > > > Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather > > high, but rather UIO_FASTIOV, which is limited to 8 entries. > > process_madvise() indeed have iovec array of UIO_FASTIOV size, namely iovstack. > But, if I understood the code correctly, iostack is used only for a fast path > that the user requested advicing less than UIO_FASTIOV regions. > > I actually confirmed I can make the loop itrate 1024 times, using my > microbenchmark[1]. My step for the check was running the program with > 'eval_pmadv $((4*1024*1024)) $((4*1024)) $((4*1024*1024))' command, and > counting the number of the itration of the vector_madvise() main loop using > printk(). Please let me know if I'm missing something. > > > > > (Some have been surprised by this limitation...!) > > > > So I think at this point scaling isn't a huge issue, I raise it because in > > future we may want to increase this limit, at which point we should think about > > it, which is why I sort of hand-waved it away a bit. > > I personally think this may still not be a huge issue, especially given the > fact that users can test and tune the limit. But I'd like to hear more > opinions if available. Maybe I should wait more for the more opinions, but I just impatiently sent the next spin of this patch series[1]. Because Davidlohr and Lorenzo promised their tags based on the assumption of UIO_FASTIOV limit while I think that may not be an effective limit, I didn't add their tags on the fourth patch of the series. Please let me know if you have any concern about that, either on this thread or on the new patch series thread[1]. [1] https://lore.kernel.org/20250206061517.2958-1-sj@kernel.org > > [1] https://github.com/sjp38/eval_proc_madvise/blob/master/eval_pmadv.c Thanks, SJ [...]
On Fri, 31 Jan 2025, Lorenzo Stoakes wrote: >Keep in mind process_madvise() is not limited by IOV_MAX, which can be rather >high, but rather UIO_FASTIOV, which is limited to 8 entries. > >(Some have been surprised by this limitation...!) Ah if that is the case then I'm fine with this patch as is. > >So I think at this point scaling isn't a huge issue, I raise it because in >future we may want to increase this limit, at which point we should think about >it, which is why I sort of hand-waved it away a bit. Agreed. SJ, feel free to add my: Acked-by: Davidlohr Bueso <dave@stgolabs.net>
On Thu, Jan 16, 2025 at 05:30:58PM -0800, SeongJae Park wrote: > Optimize redundant mmap lock operations from process_madvise() by > directly doing the mmap locking first, and then the remaining works for > all ranges in the loop. > > Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
© 2016 - 2025 Red Hat, Inc.