lib/objpool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
The objpool_push can only happen on local cpu node, so only the local
cpu can touch slot->tail and slot->last, which ensures the correctness
of using cmpxchg without lock prefix (using try_cmpxchg_local instead
of try_cmpxchg_acquire).
Testing with IACA found the lock version of pop/push pair costs 16.46
cycles and local-push version costs 15.63 cycles. Kretprobe throughput
is improved to 1.019 times of the lock version for x86_64 systems.
OS: Debian 10 X86_64, Linux 6.6rc6 with freelist
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
1T 2T 4T 8T 16T
lock: 29909085 59865637 119692073 239750369 478005250
local: 30297523 60532376 121147338 242598499 484620355
32T 48T 64T 96T 128T
lock: 957553042 1435814086 1680872925 2043126796 2165424198
local: 968526317 1454991286 1861053557 2059530343 2171732306
Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
lib/objpool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/objpool.c b/lib/objpool.c
index ce0087f64400..a032701beccb 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -166,7 +166,7 @@ objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
head = READ_ONCE(slot->head);
/* fault caught: something must be wrong */
WARN_ON_ONCE(tail - head > pool->nr_objs);
- } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
+ } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1));
/* now the tail position is reserved for the given obj */
WRITE_ONCE(slot->entries[tail & slot->mask], obj);
--
2.40.1
On Mon, Oct 23, 2023 at 07:24:52PM +0800, wuqiang.matt wrote: > The objpool_push can only happen on local cpu node, so only the local > cpu can touch slot->tail and slot->last, which ensures the correctness > of using cmpxchg without lock prefix (using try_cmpxchg_local instead > of try_cmpxchg_acquire). > > Testing with IACA found the lock version of pop/push pair costs 16.46 > cycles and local-push version costs 15.63 cycles. Kretprobe throughput > is improved to 1.019 times of the lock version for x86_64 systems. > > OS: Debian 10 X86_64, Linux 6.6rc6 with freelist > HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s > > 1T 2T 4T 8T 16T > lock: 29909085 59865637 119692073 239750369 478005250 > local: 30297523 60532376 121147338 242598499 484620355 > 32T 48T 64T 96T 128T > lock: 957553042 1435814086 1680872925 2043126796 2165424198 > local: 968526317 1454991286 1861053557 2059530343 2171732306 > > Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> This patch results in lib/objpool.c:169:12: error: implicit declaration of function 'arch_cmpxchg_local' is invalid in C99 or lib/objpool.c: In function 'objpool_try_add_slot': include/linux/atomic/atomic-arch-fallback.h:384:27: error: implicit declaration of function 'arch_cmpxchg_local' for various architectures (I have seen it with arc, hexagon, and openrisc so far). As usual, my apologies for the noise if this has already been reported and/or fixed. Guenter
On 2023/10/30 01:05, Guenter Roeck wrote: > On Mon, Oct 23, 2023 at 07:24:52PM +0800, wuqiang.matt wrote: >> The objpool_push can only happen on local cpu node, so only the local >> cpu can touch slot->tail and slot->last, which ensures the correctness >> of using cmpxchg without lock prefix (using try_cmpxchg_local instead >> of try_cmpxchg_acquire). >> >> Testing with IACA found the lock version of pop/push pair costs 16.46 >> cycles and local-push version costs 15.63 cycles. Kretprobe throughput >> is improved to 1.019 times of the lock version for x86_64 systems. >> >> OS: Debian 10 X86_64, Linux 6.6rc6 with freelist >> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s >> >> 1T 2T 4T 8T 16T >> lock: 29909085 59865637 119692073 239750369 478005250 >> local: 30297523 60532376 121147338 242598499 484620355 >> 32T 48T 64T 96T 128T >> lock: 957553042 1435814086 1680872925 2043126796 2165424198 >> local: 968526317 1454991286 1861053557 2059530343 2171732306 >> >> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> > > This patch results in > > lib/objpool.c:169:12: error: implicit declaration of function 'arch_cmpxchg_local' is invalid in C99 > > or > > lib/objpool.c: In function 'objpool_try_add_slot': > include/linux/atomic/atomic-arch-fallback.h:384:27: error: implicit declaration of function 'arch_cmpxchg_local' This patch was already reverted from probes/for-next by Masami Hiramatsu. Then we will rework it after the arch_cmpxchg_local issue is resolved. > for various architectures (I have seen it with arc, hexagon, and openrisc > so far). > > As usual, my apologies for the noise if this has already been reported > and/or fixed. We are working on it and the fix is in discussion. > Guenter Regards, wuqiang
On Mon, 23 Oct 2023 19:24:52 +0800 "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote: > The objpool_push can only happen on local cpu node, so only the local > cpu can touch slot->tail and slot->last, which ensures the correctness > of using cmpxchg without lock prefix (using try_cmpxchg_local instead > of try_cmpxchg_acquire). > > Testing with IACA found the lock version of pop/push pair costs 16.46 > cycles and local-push version costs 15.63 cycles. Kretprobe throughput > is improved to 1.019 times of the lock version for x86_64 systems. > > OS: Debian 10 X86_64, Linux 6.6rc6 with freelist > HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s > > 1T 2T 4T 8T 16T > lock: 29909085 59865637 119692073 239750369 478005250 > local: 30297523 60532376 121147338 242598499 484620355 > 32T 48T 64T 96T 128T > lock: 957553042 1435814086 1680872925 2043126796 2165424198 > local: 968526317 1454991286 1861053557 2059530343 2171732306 > Yeah, slot->tail is only used on the local CPU. This looks good to me. Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Thanks! > Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> > --- > lib/objpool.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lib/objpool.c b/lib/objpool.c > index ce0087f64400..a032701beccb 100644 > --- a/lib/objpool.c > +++ b/lib/objpool.c > @@ -166,7 +166,7 @@ objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) > head = READ_ONCE(slot->head); > /* fault caught: something must be wrong */ > WARN_ON_ONCE(tail - head > pool->nr_objs); > - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); > + } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1)); > > /* now the tail position is reserved for the given obj */ > WRITE_ONCE(slot->entries[tail & slot->mask], obj); > -- > 2.40.1 > -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Mon, 23 Oct 2023 19:24:52 +0800 "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote: > The objpool_push can only happen on local cpu node, so only the local > cpu can touch slot->tail and slot->last, which ensures the correctness > of using cmpxchg without lock prefix (using try_cmpxchg_local instead > of try_cmpxchg_acquire). > > Testing with IACA found the lock version of pop/push pair costs 16.46 > cycles and local-push version costs 15.63 cycles. Kretprobe throughput > is improved to 1.019 times of the lock version for x86_64 systems. > > OS: Debian 10 X86_64, Linux 6.6rc6 with freelist > HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s > > 1T 2T 4T 8T 16T > lock: 29909085 59865637 119692073 239750369 478005250 > local: 30297523 60532376 121147338 242598499 484620355 > 32T 48T 64T 96T 128T > lock: 957553042 1435814086 1680872925 2043126796 2165424198 > local: 968526317 1454991286 1861053557 2059530343 2171732306 > > Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> > --- > lib/objpool.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lib/objpool.c b/lib/objpool.c > index ce0087f64400..a032701beccb 100644 > --- a/lib/objpool.c > +++ b/lib/objpool.c > @@ -166,7 +166,7 @@ objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) > head = READ_ONCE(slot->head); > /* fault caught: something must be wrong */ > WARN_ON_ONCE(tail - head > pool->nr_objs); > - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); > + } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1)); > > /* now the tail position is reserved for the given obj */ > WRITE_ONCE(slot->entries[tail & slot->mask], obj); I'm good with the change, but I don't like how "cpu" is passed to this function. It currently is only used in one location, which does: rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); Which makes this change fine. But there's nothing here to prevent someone for some reason passing another CPU to that function. If we are to make that change, I would be much more comfortable with removing "int cpu" as a parameter to objpool_try_add_slot() and adding: int cpu = raw_smp_processor_id(); Which now shows that this function *only* deals with the current CPU. -- Steve
On Mon, 23 Oct 2023 11:43:04 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 23 Oct 2023 19:24:52 +0800 > "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote: > > > The objpool_push can only happen on local cpu node, so only the local > > cpu can touch slot->tail and slot->last, which ensures the correctness > > of using cmpxchg without lock prefix (using try_cmpxchg_local instead > > of try_cmpxchg_acquire). > > > > Testing with IACA found the lock version of pop/push pair costs 16.46 > > cycles and local-push version costs 15.63 cycles. Kretprobe throughput > > is improved to 1.019 times of the lock version for x86_64 systems. > > > > OS: Debian 10 X86_64, Linux 6.6rc6 with freelist > > HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s > > > > 1T 2T 4T 8T 16T > > lock: 29909085 59865637 119692073 239750369 478005250 > > local: 30297523 60532376 121147338 242598499 484620355 > > 32T 48T 64T 96T 128T > > lock: 957553042 1435814086 1680872925 2043126796 2165424198 > > local: 968526317 1454991286 1861053557 2059530343 2171732306 > > > > Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> > > --- > > lib/objpool.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/lib/objpool.c b/lib/objpool.c > > index ce0087f64400..a032701beccb 100644 > > --- a/lib/objpool.c > > +++ b/lib/objpool.c > > @@ -166,7 +166,7 @@ objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) > > head = READ_ONCE(slot->head); > > /* fault caught: something must be wrong */ > > WARN_ON_ONCE(tail - head > pool->nr_objs); > > - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); > > + } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1)); > > > > /* now the tail position is reserved for the given obj */ > > WRITE_ONCE(slot->entries[tail & slot->mask], obj); > > I'm good with the change, but I don't like how "cpu" is passed to this > function. It currently is only used in one location, which does: > > rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); > > Which makes this change fine. But there's nothing here to prevent someone > for some reason passing another CPU to that function. > > If we are to make that change, I would be much more comfortable with > removing "int cpu" as a parameter to objpool_try_add_slot() and adding: > > int cpu = raw_smp_processor_id(); > > Which now shows that this function *only* deals with the current CPU. Oh indeed. It used to search all CPUs to push the object, but I asked him to stop that because there should be enough space to push it in the local ring. This is a remnant of that time. Wuqiang, can you make another patch to fix it? Thank you, > > -- Steve -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
On 2023/10/24 09:01, Masami Hiramatsu (Google) wrote:
> On Mon, 23 Oct 2023 11:43:04 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> On Mon, 23 Oct 2023 19:24:52 +0800
>> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
>>
>>> The objpool_push can only happen on local cpu node, so only the local
>>> cpu can touch slot->tail and slot->last, which ensures the correctness
>>> of using cmpxchg without lock prefix (using try_cmpxchg_local instead
>>> of try_cmpxchg_acquire).
>>>
>>> Testing with IACA found the lock version of pop/push pair costs 16.46
>>> cycles and local-push version costs 15.63 cycles. Kretprobe throughput
>>> is improved to 1.019 times of the lock version for x86_64 systems.
>>>
>>> OS: Debian 10 X86_64, Linux 6.6rc6 with freelist
>>> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>>>
>>> 1T 2T 4T 8T 16T
>>> lock: 29909085 59865637 119692073 239750369 478005250
>>> local: 30297523 60532376 121147338 242598499 484620355
>>> 32T 48T 64T 96T 128T
>>> lock: 957553042 1435814086 1680872925 2043126796 2165424198
>>> local: 968526317 1454991286 1861053557 2059530343 2171732306
>>>
>>> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
>>> ---
>>> lib/objpool.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/lib/objpool.c b/lib/objpool.c
>>> index ce0087f64400..a032701beccb 100644
>>> --- a/lib/objpool.c
>>> +++ b/lib/objpool.c
>>> @@ -166,7 +166,7 @@ objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>>> head = READ_ONCE(slot->head);
>>> /* fault caught: something must be wrong */
>>> WARN_ON_ONCE(tail - head > pool->nr_objs);
>>> - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
>>> + } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1));
>>>
>>> /* now the tail position is reserved for the given obj */
>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>
>> I'm good with the change, but I don't like how "cpu" is passed to this
>> function. It currently is only used in one location, which does:
>>
>> rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>>
>> Which makes this change fine. But there's nothing here to prevent someone
>> for some reason passing another CPU to that function.
>>
>> If we are to make that change, I would be much more comfortable with
>> removing "int cpu" as a parameter to objpool_try_add_slot() and adding:
>>
>> int cpu = raw_smp_processor_id();
>>
>> Which now shows that this function *only* deals with the current CPU.
>
> Oh indeed. It used to search all CPUs to push the object, but
> I asked him to stop that because there should be enough space to
> push it in the local ring. This is a remnant of that time.
Yes, good catch. Thanks for the explanation.
> Wuqiang, can you make another patch to fix it?
I'm thinking of removing the inline function objpool_try_add_slot and merging
its functionality to objpool_push, like the followings:
/* reclaim an object to object pool */
int objpool_push(void *obj, struct objpool_head *pool)
{
struct objpool_slot *slot;
uint32_t head, tail;
unsigned long flags;
/* disable local irq to avoid preemption & interruption */
raw_local_irq_save(flags);
slot = pool->cpu_slots[raw_smp_processor_id()];
/* loading tail and head as a local snapshot, tail first */
tail = READ_ONCE(slot->tail);
do {
head = READ_ONCE(slot->head);
/* fault caught: something must be wrong */
WARN_ON_ONCE(tail - head > pool->nr_objs);
} while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1));
/* now the tail position is reserved for the given obj */
WRITE_ONCE(slot->entries[tail & slot->mask], obj);
/* update sequence to make this obj available for pop() */
smp_store_release(&slot->last, tail + 1);
raw_local_irq_restore(flags);
return 0;
}
I'll prepare a new patch for this improvement.
> Thank you,
>
>>
>> -- Steve
>
Thanks for your time,
wuqiang
On Tue, 24 Oct 2023 09:57:17 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> On 2023/10/24 09:01, Masami Hiramatsu (Google) wrote:
> > On Mon, 23 Oct 2023 11:43:04 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> >> On Mon, 23 Oct 2023 19:24:52 +0800
> >> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> >>
> >>> The objpool_push can only happen on local cpu node, so only the local
> >>> cpu can touch slot->tail and slot->last, which ensures the correctness
> >>> of using cmpxchg without lock prefix (using try_cmpxchg_local instead
> >>> of try_cmpxchg_acquire).
> >>>
> >>> Testing with IACA found the lock version of pop/push pair costs 16.46
> >>> cycles and local-push version costs 15.63 cycles. Kretprobe throughput
> >>> is improved to 1.019 times of the lock version for x86_64 systems.
> >>>
> >>> OS: Debian 10 X86_64, Linux 6.6rc6 with freelist
> >>> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
> >>>
> >>> 1T 2T 4T 8T 16T
> >>> lock: 29909085 59865637 119692073 239750369 478005250
> >>> local: 30297523 60532376 121147338 242598499 484620355
> >>> 32T 48T 64T 96T 128T
> >>> lock: 957553042 1435814086 1680872925 2043126796 2165424198
> >>> local: 968526317 1454991286 1861053557 2059530343 2171732306
> >>>
> >>> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
> >>> ---
> >>> lib/objpool.c | 2 +-
> >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/lib/objpool.c b/lib/objpool.c
> >>> index ce0087f64400..a032701beccb 100644
> >>> --- a/lib/objpool.c
> >>> +++ b/lib/objpool.c
> >>> @@ -166,7 +166,7 @@ objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> >>> head = READ_ONCE(slot->head);
> >>> /* fault caught: something must be wrong */
> >>> WARN_ON_ONCE(tail - head > pool->nr_objs);
> >>> - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
> >>> + } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1));
> >>>
> >>> /* now the tail position is reserved for the given obj */
> >>> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >>
> >> I'm good with the change, but I don't like how "cpu" is passed to this
> >> function. It currently is only used in one location, which does:
> >>
> >> rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >>
> >> Which makes this change fine. But there's nothing here to prevent someone
> >> for some reason passing another CPU to that function.
> >>
> >> If we are to make that change, I would be much more comfortable with
> >> removing "int cpu" as a parameter to objpool_try_add_slot() and adding:
> >>
> >> int cpu = raw_smp_processor_id();
> >>
> >> Which now shows that this function *only* deals with the current CPU.
> >
> > Oh indeed. It used to search all CPUs to push the object, but
> > I asked him to stop that because there should be enough space to
> > push it in the local ring. This is a remnant of that time.
>
> Yes, good catch. Thanks for the explanation.
>
> > Wuqiang, can you make another patch to fix it?
>
> I'm thinking of removing the inline function objpool_try_add_slot and merging
> its functionality to objpool_push, like the followings:
Looks good.
>
>
> /* reclaim an object to object pool */
> int objpool_push(void *obj, struct objpool_head *pool)
> {
> struct objpool_slot *slot;
> uint32_t head, tail;
> unsigned long flags;
>
> /* disable local irq to avoid preemption & interruption */
> raw_local_irq_save(flags);
>
> slot = pool->cpu_slots[raw_smp_processor_id()];
>
> /* loading tail and head as a local snapshot, tail first */
> tail = READ_ONCE(slot->tail);
>
> do {
> head = READ_ONCE(slot->head);
> /* fault caught: something must be wrong */
> WARN_ON_ONCE(tail - head > pool->nr_objs);
> } while (!try_cmpxchg_local(&slot->tail, &tail, tail + 1));
>
> /* now the tail position is reserved for the given obj */
> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> /* update sequence to make this obj available for pop() */
> smp_store_release(&slot->last, tail + 1);
>
> raw_local_irq_restore(flags);
>
> return 0;
> }
>
> I'll prepare a new patch for this improvement.
Thanks!
>
> > Thank you,
> >
> >>
> >> -- Steve
> >
>
> Thanks for your time,
> wuqiang
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
© 2016 - 2026 Red Hat, Inc.