[v4] bitops: Optimize fns() for improved performance

[PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Posted by Kuan-Wei Chiu 1 year, 9 months ago

Introduce a benchmark test for the fns(). It measures the total time
taken by fns() to process 1,000,000 test data generated using
get_random_bytes() for each n in the range [0, BITS_PER_LONG).

example:
test_bitops: fns:          5876762553 ns, 64000000 iterations

Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---

Changes in v4:
- Correct get_random_long() -> get_random_bytes() in the commit
  message.

 lib/test_bitops.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/lib/test_bitops.c b/lib/test_bitops.c
index 3b7bcbee84db..ed939f124417 100644
--- a/lib/test_bitops.c
+++ b/lib/test_bitops.c
@@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
 };
 #endif
 
+static unsigned long buf[1000000];
+
+static int __init test_fns(void)
+{
+	unsigned int i, n;
+	ktime_t time;
+
+	get_random_bytes(buf, sizeof(buf));
+	time = ktime_get();
+
+	for (n = 0; n < BITS_PER_LONG; n++)
+		for (i = 0; i < 1000000; i++)
+			fns(buf[i], n);
+
+	time = ktime_get() - time;
+	pr_err("fns:  %18llu ns, %6d iterations\n", time, BITS_PER_LONG * 1000000);
+
+	return 0;
+}
+
 static int __init test_bitops_startup(void)
 {
 	int i, bit_set;
@@ -94,6 +114,8 @@ static int __init test_bitops_startup(void)
 	if (bit_set != BITOPS_LAST)
 		pr_err("ERROR: FOUND SET BIT %d\n", bit_set);
 
+	test_fns();
+
 	pr_info("Completed bitops test\n");
 
 	return 0;
-- 
2.34.1

Re: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Posted by Yury Norov 1 year, 9 months ago

On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> Introduce a benchmark test for the fns(). It measures the total time
> taken by fns() to process 1,000,000 test data generated using
> get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> 
> example:
> test_bitops: fns:          5876762553 ns, 64000000 iterations

So... 5 seconds for a test sounds too much. I see the following patch
improves it dramatically, but in general let's stay in a range of
milliseconds. On other machines it may run much slower and trigger
a stall watchdog.

> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>

Suggested-by: Yury Norov <yury.norov@gmail.com>

> ---
> 
> Changes in v4:
> - Correct get_random_long() -> get_random_bytes() in the commit
>   message.
> 
>  lib/test_bitops.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> index 3b7bcbee84db..ed939f124417 100644
> --- a/lib/test_bitops.c
> +++ b/lib/test_bitops.c
> @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
>  };
>  #endif
>  
> +static unsigned long buf[1000000];

Can you make it __init, or allocate with kmalloc_array(), so that 64M
of memory will not last forever in the kernel?

> +static int __init test_fns(void)
> +{
> +	unsigned int i, n;
> +	ktime_t time;
> +
> +	get_random_bytes(buf, sizeof(buf));
> +	time = ktime_get();
> +
> +	for (n = 0; n < BITS_PER_LONG; n++)
> +		for (i = 0; i < 1000000; i++)
> +			fns(buf[i], n);

What concerns me here is that fns() is a in fact a const function, and
the whole loop may be eliminated. Can you make sure it's not your case
because 450x performance boost sounds a bit too much to me.

You can declare a "static volatile __used __init" variable to assign
the result of fns(), and ensure that the code is not eliminated

> +	time = ktime_get() - time;
> +	pr_err("fns:  %18llu ns, %6d iterations\n", time, BITS_PER_LONG * 1000000);

Here the number of iterations is always the same. What's the point to
print it?

Thanks,
Yury

RE: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Posted by David Laight 1 year, 9 months ago

From: Yury Norov
> Sent: 01 May 2024 17:30
> 
> On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> > Introduce a benchmark test for the fns(). It measures the total time
> > taken by fns() to process 1,000,000 test data generated using
> > get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> >
> > example:
> > test_bitops: fns:          5876762553 ns, 64000000 iterations
> 
> So... 5 seconds for a test sounds too much. I see the following patch
> improves it dramatically, but in general let's stay in a range of
> milliseconds. On other machines it may run much slower and trigger
> a stall watchdog.
> 
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> 
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> 
> > ---
> >
> > Changes in v4:
> > - Correct get_random_long() -> get_random_bytes() in the commit
> >   message.
> >
> >  lib/test_bitops.c | 22 ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> > index 3b7bcbee84db..ed939f124417 100644
> > --- a/lib/test_bitops.c
> > +++ b/lib/test_bitops.c
> > @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> >  };
> >  #endif
> >
> > +static unsigned long buf[1000000];
> 
> Can you make it __init, or allocate with kmalloc_array(), so that 64M
> of memory will not last forever in the kernel?
> 
> > +static int __init test_fns(void)
> > +{
> > +	unsigned int i, n;
> > +	ktime_t time;
> > +
> > +	get_random_bytes(buf, sizeof(buf));
> > +	time = ktime_get();
> > +
> > +	for (n = 0; n < BITS_PER_LONG; n++)
> > +		for (i = 0; i < 1000000; i++)
> > +			fns(buf[i], n);
> 
> What concerns me here is that fns() is a in fact a const function, and
> the whole loop may be eliminated. Can you make sure it's not your case
> because 450x performance boost sounds a bit too much to me.
> 
> You can declare a "static volatile __used __init" variable to assign
> the result of fns(), and ensure that the code is not eliminated

Yep, without 'c' this compiler to 'return 0'.

static inline unsigned long fns(unsigned long word, unsigned int n)
{
	while (word && n--)
		word &= word - 1;
	return word ? __builtin_ffs(word) : 8 * sizeof (long);
}

unsigned long buf[1000000];

volatile int c;

int  test_fns(void)
{
	unsigned int i, n;

	for (n = 0; n < 8*sizeof (long); n++)
		for (i = 0; i < 1000000; i++)
			c = fns(buf[i], n);
	return 0;
}

You are also hitting the random number generator.
It would be better to use a predictable sequence.
Then you could, for instance, add up all the fns() results
and check you get the expected value.

With a really trivial 'RNG' (like step a CRC one bit) you
could do it inside the loop and not nee a buffer at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Posted by Kuan-Wei Chiu 1 year, 9 months ago

On Sun, May 05, 2024 at 01:11:53PM +0000, David Laight wrote:
> From: Yury Norov
> > Sent: 01 May 2024 17:30
> > 
> > On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> > > Introduce a benchmark test for the fns(). It measures the total time
> > > taken by fns() to process 1,000,000 test data generated using
> > > get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> > >
> > > example:
> > > test_bitops: fns:          5876762553 ns, 64000000 iterations
> > 
> > So... 5 seconds for a test sounds too much. I see the following patch
> > improves it dramatically, but in general let's stay in a range of
> > milliseconds. On other machines it may run much slower and trigger
> > a stall watchdog.
> > 
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > 
> > Suggested-by: Yury Norov <yury.norov@gmail.com>
> > 
> > > ---
> > >
> > > Changes in v4:
> > > - Correct get_random_long() -> get_random_bytes() in the commit
> > >   message.
> > >
> > >  lib/test_bitops.c | 22 ++++++++++++++++++++++
> > >  1 file changed, 22 insertions(+)
> > >
> > > diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> > > index 3b7bcbee84db..ed939f124417 100644
> > > --- a/lib/test_bitops.c
> > > +++ b/lib/test_bitops.c
> > > @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> > >  };
> > >  #endif
> > >
> > > +static unsigned long buf[1000000];
> > 
> > Can you make it __init, or allocate with kmalloc_array(), so that 64M
> > of memory will not last forever in the kernel?
> > 
> > > +static int __init test_fns(void)
> > > +{
> > > +	unsigned int i, n;
> > > +	ktime_t time;
> > > +
> > > +	get_random_bytes(buf, sizeof(buf));
> > > +	time = ktime_get();
> > > +
> > > +	for (n = 0; n < BITS_PER_LONG; n++)
> > > +		for (i = 0; i < 1000000; i++)
> > > +			fns(buf[i], n);
> > 
> > What concerns me here is that fns() is a in fact a const function, and
> > the whole loop may be eliminated. Can you make sure it's not your case
> > because 450x performance boost sounds a bit too much to me.
> > 
> > You can declare a "static volatile __used __init" variable to assign
> > the result of fns(), and ensure that the code is not eliminated
> 
> Yep, without 'c' this compiler to 'return 0'.
> 
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
> 	while (word && n--)
> 		word &= word - 1;
> 	return word ? __builtin_ffs(word) : 8 * sizeof (long);
> }
> 
> unsigned long buf[1000000];
> 
> volatile int c;
> 
> int  test_fns(void)
> {
> 	unsigned int i, n;
> 
> 	for (n = 0; n < 8*sizeof (long); n++)
> 		for (i = 0; i < 1000000; i++)
> 			c = fns(buf[i], n);
> 	return 0;
> }
> 
> You are also hitting the random number generator.
> It would be better to use a predictable sequence.
> Then you could, for instance, add up all the fns() results
> and check you get the expected value.
> 
> With a really trivial 'RNG' (like step a CRC one bit) you
> could do it inside the loop and not nee a buffer at all.
> 
Hi David,

I do think that conducting correctness testing here is a good idea.
However, we are about to change the return value of fns() from return
BITS_PER_LONG to return >= BITS_PER_LONG [1][2] when the nth bit is not
found. Therefore, using a fixed input series here and checking the sum
of return values may not accurately test it. Do you know if there are
any other more suitable testing methods?

[1]: https://lore.kernel.org/all/20240502233204.2255158-3-yury.norov@gmail.com/
[2]: https://lore.kernel.org/all/20240502233204.2255158-4-yury.norov@gmail.com/

Regards,
Kuan-Wei

> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Re: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Posted by Yury Norov 1 year, 9 months ago

On Mon, May 06, 2024 at 02:48:14AM +0800, Kuan-Wei Chiu wrote:
> On Sun, May 05, 2024 at 01:11:53PM +0000, David Laight wrote:
> > From: Yury Norov
> > > Sent: 01 May 2024 17:30
> > > 
> > > On Wed, May 01, 2024 at 09:20:46PM +0800, Kuan-Wei Chiu wrote:
> > > > Introduce a benchmark test for the fns(). It measures the total time
> > > > taken by fns() to process 1,000,000 test data generated using
> > > > get_random_bytes() for each n in the range [0, BITS_PER_LONG).
> > > >
> > > > example:
> > > > test_bitops: fns:          5876762553 ns, 64000000 iterations
> > > 
> > > So... 5 seconds for a test sounds too much. I see the following patch
> > > improves it dramatically, but in general let's stay in a range of
> > > milliseconds. On other machines it may run much slower and trigger
> > > a stall watchdog.
> > > 
> > > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > > 
> > > Suggested-by: Yury Norov <yury.norov@gmail.com>
> > > 
> > > > ---
> > > >
> > > > Changes in v4:
> > > > - Correct get_random_long() -> get_random_bytes() in the commit
> > > >   message.
> > > >
> > > >  lib/test_bitops.c | 22 ++++++++++++++++++++++
> > > >  1 file changed, 22 insertions(+)
> > > >
> > > > diff --git a/lib/test_bitops.c b/lib/test_bitops.c
> > > > index 3b7bcbee84db..ed939f124417 100644
> > > > --- a/lib/test_bitops.c
> > > > +++ b/lib/test_bitops.c
> > > > @@ -50,6 +50,26 @@ static unsigned long order_comb_long[][2] = {
> > > >  };
> > > >  #endif
> > > >
> > > > +static unsigned long buf[1000000];
> > > 
> > > Can you make it __init, or allocate with kmalloc_array(), so that 64M
> > > of memory will not last forever in the kernel?
> > > 
> > > > +static int __init test_fns(void)
> > > > +{
> > > > +	unsigned int i, n;
> > > > +	ktime_t time;
> > > > +
> > > > +	get_random_bytes(buf, sizeof(buf));
> > > > +	time = ktime_get();
> > > > +
> > > > +	for (n = 0; n < BITS_PER_LONG; n++)
> > > > +		for (i = 0; i < 1000000; i++)
> > > > +			fns(buf[i], n);
> > > 
> > > What concerns me here is that fns() is a in fact a const function, and
> > > the whole loop may be eliminated. Can you make sure it's not your case
> > > because 450x performance boost sounds a bit too much to me.
> > > 
> > > You can declare a "static volatile __used __init" variable to assign
> > > the result of fns(), and ensure that the code is not eliminated
> > 
> > Yep, without 'c' this compiler to 'return 0'.
> > 
> > static inline unsigned long fns(unsigned long word, unsigned int n)
> > {
> > 	while (word && n--)
> > 		word &= word - 1;
> > 	return word ? __builtin_ffs(word) : 8 * sizeof (long);
> > }
> > 
> > unsigned long buf[1000000];
> > 
> > volatile int c;
> > 
> > int  test_fns(void)
> > {
> > 	unsigned int i, n;
> > 
> > 	for (n = 0; n < 8*sizeof (long); n++)
> > 		for (i = 0; i < 1000000; i++)
> > 			c = fns(buf[i], n);
> > 	return 0;
> > }
> > 
> > You are also hitting the random number generator.
> > It would be better to use a predictable sequence.
> > Then you could, for instance, add up all the fns() results
> > and check you get the expected value.
> > 
> > With a really trivial 'RNG' (like step a CRC one bit) you
> > could do it inside the loop and not nee a buffer at all.
> > 
> Hi David,
> 
> I do think that conducting correctness testing here is a good idea.
> However, we are about to change the return value of fns() from return
> BITS_PER_LONG to return >= BITS_PER_LONG [1][2] when the nth bit is not
> found. Therefore, using a fixed input series here and checking the sum
> of return values may not accurately test it. Do you know if there are
> any other more suitable testing methods?

Hi David,

Let's do one thing in a time. test_fns() is about performance testing,
and that's it. Kuan-Wei wrote it specifically for that.

He also chose to use random numbers to feed the fns(), and that's a
reasonable choice. I see nothing wrong in his approach with the array,
as well as what you proposed. But the test is done, and it works. If
you think it's worth testing the function with your approach, feel
free to submit your test, I'll take it just as well, and we'll have
even better coverage.

Regarding summing up all the fns() returns to check correctness - I
think it's a wrong idea. At first because current behavior when we
return BITS_PER_LONG is not a contract, just implementation detail.
And even now, on 64- and 32-bit arches fns() will sum up to different
numbers.

Second, this summing up doesn't sound reliable. Imagine a case when
the function fails one time returning a less-by-one number, and in the
following run - a greater-by-one. So the sum will be correct, and the
test will not catch the error.

We're running a test_find_nth_bit(), and to me it's enough for testing
fns(). But if you'd like, please send a more specific test.

Thanks,
Yury

RE: [PATCH v4 1/2] lib/test_bitops: Add benchmark test for fns()

Posted by David Laight 1 year, 9 months ago

...
> He also chose to use random numbers to feed the fns(), and that's a
> reasonable choice. I see nothing wrong in his approach with the array,
> as well as what you proposed. But the test is done, and it works. If
> you think it's worth testing the function with your approach, feel
> free to submit your test, I'll take it just as well, and we'll have
> even better coverage.

Accessing a very large array is going to be affected by cache line
and TLB misses.
It wont really be entirely affected by implementation of the
function itself.
This is especially relevant for something that does really apply
to large memory blocks.

OTOH you don't want to repeatedly call the version that used
a binary chop with constant data - the branch predictor
will learn the pattern instead of getting if wrong 50% of the time.

A compromise would be to repeatedly process a short random
buffer enough times to get something measurable.
Although it is perfectly possible to use a cycle counter
(but not the x86 TSC) to time single function calls and
get repeatable answers that can actually be compared to
calculated values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)