RE: [PATCH 0/2] execve scalability issues, part 1

David Laight posted 2 patches 2 years, 3 months ago
Only 0 patches received!
RE: [PATCH 0/2] execve scalability issues, part 1
Posted by David Laight 2 years, 3 months ago
From: Jan Kara
> Sent: Wednesday, August 23, 2023 10:49 AM
....
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -737,7 +737,11 @@ struct mm_struct {
> >
> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > /proc/PID/auxv */
> >
> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +               union {
> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +                       u64 *rss_stat_single;
> > +               };
> > +               bool    magic_flag_stuffed_elsewhere;

I wouldn't use a union to save a pointer - it is asking for trouble.

> >
> >                 struct linux_binfmt *binfmt;
> >
> >
> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > countes * 2 -- first set updated without any synchro by current
> > thread. Second set only to be modified by others and protected with
> > mm->arg_lock. The lock protects remote access to the union to begin
> > with.
> 
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?

A separate lock is worse than atomics.
(Although some 32bit arch may have issues with 64bit atomics.)

I think you'll be surprised just how slow atomic ops are.
Even when present in the local cache.
(Probably because any other copies have to be invalidated.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH 0/2] execve scalability issues, part 1
Posted by Mateusz Guzik 2 years, 3 months ago
On 8/23/23, David Laight <David.Laight@aculab.com> wrote:
> From: Jan Kara
>> Sent: Wednesday, August 23, 2023 10:49 AM
> ....
>> > --- a/include/linux/mm_types.h
>> > +++ b/include/linux/mm_types.h
>> > @@ -737,7 +737,11 @@ struct mm_struct {
>> >
>> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
>> > /proc/PID/auxv */
>> >
>> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
>> > +               union {
>> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
>> > +                       u64 *rss_stat_single;
>> > +               };
>> > +               bool    magic_flag_stuffed_elsewhere;
>
> I wouldn't use a union to save a pointer - it is asking for trouble.
>

I may need to abandon this bit anyway -- counter init adds counters to
a global list and I can't call easily call it like that.

>> >
>> >                 struct linux_binfmt *binfmt;
>> >
>> >
>> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
>> > countes * 2 -- first set updated without any synchro by current
>> > thread. Second set only to be modified by others and protected with
>> > mm->arg_lock. The lock protects remote access to the union to begin
>> > with.
>>
>> arg_lock seems a bit like a hack. How is it related to rss_stat? The
>> scheme
>> with two counters is clever but I'm not 100% convinced the complexity is
>> really worth it. I'm not sure the overhead of always using an atomic
>> counter would really be measurable as atomic counter ops in local CPU
>> cache
>> tend to be cheap. Did you try to measure the difference?
>
> A separate lock is worse than atomics.
> (Although some 32bit arch may have issues with 64bit atomics.)
>

But in my proposal the separate lock is used to facilitate *NOT* using
atomics by the most common consumer -- the only thread.

The lock is only used for the transition to multithreaded state for
updated by remote parties (both rare compared to updated by current).

> I think you'll be surprised just how slow atomic ops are.
> Even when present in the local cache.
> (Probably because any other copies have to be invalidated.)
>

Agreed. They have always been super expensive on x86-64 (and continue
to be). I keep running to claims they are not, I don't know where
that's coming from.

-- 
Mateusz Guzik <mjguzik gmail.com>