[PATCH] blk-wbt: Speed up integer square root in rwb_arm_timer

I Hsin Cheng posted 1 patch 1 year, 10 months ago
block/blk-wbt.c     |  2 +-
lib/math/int_sqrt.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+), 1 deletion(-)
[PATCH] blk-wbt: Speed up integer square root in rwb_arm_timer
Posted by I Hsin Cheng 1 year, 10 months ago
As the comments in rwb_arm_timer says, we should speed up the process of
integer square root with some variant versions. The implementation of
the variant integer square root "int_fast_sqrt()" is based on the
equation $lg(\sqrt{x}) = lg(x) / 2$ , where "lg" stands for logarithm
with base 2. After we take the first approximation by calculate
$2^{lg(x)/2}$ , we utilize Newton's method for three times in order to
enhance the precision.

To prove "int_fastsqrt" is indeed faster than the "int_sqrt", we take
some bench experiments to calculate the integer logarithm from 1 to 10^6
and use "perf stat --repeat 100" to measure the performance of each
function.

As the result shown, the origin version of integer square root, which is
"int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665
instructions, 2551,2990 branches and causes 1,0616 branch-misses.

At the same time, the variant version of integer square root, which is
"int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces,
5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses.
We can clearly see that "int_fastsqrt" performs faster and better result
so it's indeed a faster invariant of integer square root.

The experiments runs on x86_64 GNU/Linux Architecture and the CPU is
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
---
 block/blk-wbt.c     |  2 +-
 lib/math/int_sqrt.c | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 0bb613139..8d25c0e55 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -407,7 +407,7 @@ static void rwb_arm_timer(struct rq_wb *rwb)
 		 * though.
 		 */
 		rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
-					int_sqrt((rqd->scale_step + 1) << 8));
+					int_fastsqrt((rqd->scale_step + 1) << 8));
 	} else {
 		/*
 		 * For step < 0, we don't want to increase/decrease the
diff --git a/lib/math/int_sqrt.c b/lib/math/int_sqrt.c
index a8170bb91..e25ba179e 100644
--- a/lib/math/int_sqrt.c
+++ b/lib/math/int_sqrt.c
@@ -40,6 +40,27 @@ unsigned long int_sqrt(unsigned long x)
 }
 EXPORT_SYMBOL(int_sqrt);
 
+
+/**
+ * int_fastsqrt - faster invariant of int_sqrt
+ * @x: integer of which to calculate the sqrt
+ *
+ * Computes: floor(sqrt(x))
+ */
+
+unsigned long int_fastsqrt(unsigned long x)
+{
+	unsigned long y = ((31 - __builtin_clz(x | 1)) >> 1);
+	unsigned long z = 1 << y;
+
+	z = (z + (x / z)) >> 1;
+	z = (z + (x / z)) >> 1;
+	z = (z + (x / z)) >> 1;
+
+	return z;
+}
+EXPORT_SYMBOL(int_fastsqrt);
+
 #if BITS_PER_LONG < 64
 /**
  * int_sqrt64 - strongly typed int_sqrt function when minimum 64 bit input
-- 
2.34.1
Re: [PATCH] blk-wbt: Speed up integer square root in rwb_arm_timer
Posted by Bart Van Assche 1 year, 10 months ago
On 3/29/24 2:12 AM, I Hsin Cheng wrote:
> As the result shown, the origin version of integer square root, which is
> "int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665
> instructions, 2551,2990 branches and causes 1,0616 branch-misses.
> 
> At the same time, the variant version of integer square root, which is
> "int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces,
> 5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses.
> We can clearly see that "int_fastsqrt" performs faster and better result
> so it's indeed a faster invariant of integer square root.

I'm not sure that a 4% performance improvement is sufficient to replace
the int_sqrt() implementation. Additionally, why to add a second 
implementation of int_sqrt() instead of replacing the int_sqrt()
implementation in lib/math/int_sqrt.c?

 > The experiments runs on x86_64 GNU/Linux Architecture and the CPU is
 > Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.

Since int_sqrt() does not use divisions and since int_fastsqrt() uses
divisions, can all CPUs supported by the Linux kernel divide numbers as
quickly as the CPU mentioned above?

Thanks,

Bart.
Re: [PATCH] blk-wbt: Speed up integer square root in rwb_arm_timer
Posted by Jens Axboe 1 year, 10 months ago
On 3/29/24 12:15 PM, Bart Van Assche wrote:
> On 3/29/24 2:12 AM, I Hsin Cheng wrote:
>> As the result shown, the origin version of integer square root, which is
>> "int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665
>> instructions, 2551,2990 branches and causes 1,0616 branch-misses.
>>
>> At the same time, the variant version of integer square root, which is
>> "int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces,
>> 5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses.
>> We can clearly see that "int_fastsqrt" performs faster and better result
>> so it's indeed a faster invariant of integer square root.
> 
> I'm not sure that a 4% performance improvement is sufficient to
> replace the int_sqrt() implementation. Additionally, why to add a
> second implementation of int_sqrt() instead of replacing the
> int_sqrt() implementation in lib/math/int_sqrt.c?

That's the real question imho - if provides the same numbers and is
faster, why have two?

I ran a quick test because I was curious, and the precision is
definitely worse. The claim that it is floor(sqrt(val)) is not true.
Trivial example:

1005117225
	sqrt()		31703.58
	int_sqrt()	30703
	int_fastsqrt()	30821

whether this matters, probably not, but then again it's hard to care
about a slow path sqrt calculation. I'd certainly err on the side of
precision for that.

-- 
Jens Axboe