Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..729c07a4fc28 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level bitmap
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG - 1));
+ if (bit < BITS_PER_LONG)
+ return bit + bitbit * BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
--
2.43.0
Hello,
kernel test robot noticed a 6.3% improvement of will-it-scale.per_thread_ops on:
commit: b8decf0015a8b1ff02cdac61c0aa54355d8e73d7 ("[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-4-yu.ma@intel.com/
patch subject: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
testcase: will-it-scale
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:
nr_task: 100%
mode: thread
test: open3
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062152.7e5b5d6d-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/thread/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/open3/will-it-scale
commit:
5bb3423bf9 ("fs/file.c: conditionally clear full_fds")
b8decf0015 ("fs/file.c: add fast path in find_next_fd()")
5bb3423bf9f9d91e b8decf0015a8b1ff02cdac61c0a
---------------- ---------------------------
%stddev %change %stddev
\ | \
848151 +6.2% 901119 ± 2% will-it-scale.224.threads
3785 +6.3% 4022 ± 2% will-it-scale.per_thread_ops
848151 +6.2% 901119 ± 2% will-it-scale.workload
0.28 ± 4% +13.3% 0.32 ± 3% perf-stat.i.MPKI
31.31 ± 3% +2.0 33.28 perf-stat.i.cache-miss-rate%
14955855 ± 4% +13.6% 16995785 ± 4% perf-stat.i.cache-misses
49676581 +6.7% 53009444 ± 3% perf-stat.i.cache-references
43955 ± 4% -12.3% 38549 ± 4% perf-stat.i.cycles-between-cache-misses
0.28 ± 4% +13.4% 0.32 ± 4% perf-stat.overall.MPKI
29.84 ± 3% +1.9 31.78 ± 2% perf-stat.overall.cache-miss-rate%
43445 ± 4% -12.1% 38200 ± 4% perf-stat.overall.cycles-between-cache-misses
19005976 -5.4% 17972604 ± 2% perf-stat.overall.path-length
14869677 ± 4% +13.6% 16898438 ± 4% perf-stat.ps.cache-misses
49821402 +6.7% 53168235 ± 3% perf-stat.ps.cache-references
49.42 -0.1 49.34 perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.40 -0.1 49.32 perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
49.25 -0.1 49.18 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64
49.20 -0.1 49.13 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
49.33 -0.1 49.26 perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.28 -0.1 49.22 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
50.14 +0.0 50.18 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
50.17 +0.0 50.21 perf-profile.calltrace.cycles-pp.open64
0.64 ± 5% +0.1 0.75 ± 6% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.62 ± 5% +0.1 0.74 ± 6% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
49.42 -0.1 49.34 perf-profile.children.cycles-pp.alloc_fd
49.40 -0.1 49.32 perf-profile.children.cycles-pp.file_close_fd
0.06 -0.0 0.05 perf-profile.children.cycles-pp.file_close_fd_locked
0.15 ± 5% +0.0 0.17 ± 4% perf-profile.children.cycles-pp.init_file
0.22 ± 3% +0.0 0.25 ± 3% perf-profile.children.cycles-pp.alloc_empty_file
0.18 ± 6% +0.0 0.22 ± 6% perf-profile.children.cycles-pp.__fput
50.14 +0.0 50.18 perf-profile.children.cycles-pp.__x64_sys_openat
50.18 +0.0 50.22 perf-profile.children.cycles-pp.open64
0.18 ± 14% +0.0 0.23 ± 7% perf-profile.children.cycles-pp.do_dentry_open
0.30 ± 8% +0.1 0.36 ± 8% perf-profile.children.cycles-pp.do_open
0.64 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.do_filp_open
0.63 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.path_openat
0.06 -0.0 0.05 perf-profile.self.cycles-pp.file_close_fd_locked
0.16 ± 2% +0.0 0.18 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
0.08 ± 12% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.__fput
0.05 ± 7% +0.1 0.10 ± 4% perf-profile.self.cycles-pp.alloc_fd
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: > > Skip 2-levels searching via find_next_zero_bit() when there is free slot in the > word contains next_fd, as: > (1) next_fd indicates the lower bound for the first free fd. > (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up > searching. this is stale -- now the fast path searches up to 64 fds in the lower bitmap > (3) After fdt is expanded (the bitmap size doubled for each time of expansion), > it would never be shrunk. The search size increases but there are few open fds > available here. > > This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by > Jan Kara <jack@suse.cz>, which is more generic and scalable than previous > versions. I think this paragraph is droppable. You already got an ack from Jan below, so stating he agrees with the patch is redundant. As for me I don't think this warrants mentioning. Just remove it, perhaps Christian will be willing to massage it by himself to avoid another series posting. > And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by > 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. > > Reviewed-by: Jan Kara <jack@suse.cz> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Signed-off-by: Yu Ma <yu.ma@intel.com> > --- > fs/file.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/file.c b/fs/file.c > index 1be2a5bcc7c4..729c07a4fc28 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ > unsigned int maxbit = maxfd / BITS_PER_LONG; > unsigned int bitbit = start / BITS_PER_LONG; > + unsigned int bit; > + > + /* > + * Try to avoid looking at the second level bitmap > + */ > + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, > + start & (BITS_PER_LONG - 1)); > + if (bit < BITS_PER_LONG) > + return bit + bitbit * BITS_PER_LONG; > > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > if (bitbit >= maxfd) > -- > 2.43.0 > -- Mateusz Guzik <mjguzik gmail.com>
On 7/20/2024 1:53 AM, Mateusz Guzik wrote: > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the >> word contains next_fd, as: >> (1) next_fd indicates the lower bound for the first free fd. >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up >> searching. > this is stale -- now the fast path searches up to 64 fds in the lower bitmap Nope, this is still valid, as the searching size of the fast path inside of find_next_fd() is always 64, it will execute the fast path inside of find_next_zero_bit(). > >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion), >> it would never be shrunk. The search size increases but there are few open fds >> available here. >> >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous >> versions. > I think this paragraph is droppable. You already got an ack from Jan > below, so stating he agrees with the patch is redundant. As for me I > don't think this warrants mentioning. Just remove it, perhaps > Christian will be willing to massage it by himself to avoid another > series posting. The idea of fast path for the word contains next_fd is from you, although this patch is small, I think it is reasonable to record here out of my respect. Appreciate for your guide and comments on this patch, I've learned a lot on the way of resolving problems :) Regards Yu >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. >> >> Reviewed-by: Jan Kara <jack@suse.cz> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> >> Signed-off-by: Yu Ma <yu.ma@intel.com> >> --- >> fs/file.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/fs/file.c b/fs/file.c >> index 1be2a5bcc7c4..729c07a4fc28 100644 >> --- a/fs/file.c >> +++ b/fs/file.c >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ >> unsigned int maxbit = maxfd / BITS_PER_LONG; >> unsigned int bitbit = start / BITS_PER_LONG; >> + unsigned int bit; >> + >> + /* >> + * Try to avoid looking at the second level bitmap >> + */ >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, >> + start & (BITS_PER_LONG - 1)); >> + if (bit < BITS_PER_LONG) >> + return bit + bitbit * BITS_PER_LONG; >> >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; >> if (bitbit >= maxfd) >> -- >> 2.43.0 >> >
I think this is getting too much fluff traffic at this point, which is partially my fault. I'm buggering off. Overall the patchset looks good, I don't see any technical reasons to avoid merging it. On Sat, Jul 20, 2024 at 2:57 PM Ma, Yu <yu.ma@intel.com> wrote: > > > On 7/20/2024 1:53 AM, Mateusz Guzik wrote: > > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: > >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the > >> word contains next_fd, as: > >> (1) next_fd indicates the lower bound for the first free fd. > >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up > >> searching. > > this is stale -- now the fast path searches up to 64 fds in the lower bitmap > > Nope, this is still valid, as the searching size of the fast path inside > of find_next_fd() is always 64, it will execute the fast path inside of > find_next_zero_bit(). > > > > > >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion), > >> it would never be shrunk. The search size increases but there are few open fds > >> available here. > >> > >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by > >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous > >> versions. > > I think this paragraph is droppable. You already got an ack from Jan > > below, so stating he agrees with the patch is redundant. As for me I > > don't think this warrants mentioning. Just remove it, perhaps > > Christian will be willing to massage it by himself to avoid another > > series posting. > > The idea of fast path for the word contains next_fd is from you, > although this patch is small, I think it is reasonable to record here > out of my respect. Appreciate for your guide and comments on this patch, > I've learned a lot on the way of resolving problems :) > > > Regards > > Yu > > >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by > >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. > >> > >> Reviewed-by: Jan Kara <jack@suse.cz> > >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > >> Signed-off-by: Yu Ma <yu.ma@intel.com> > >> --- > >> fs/file.c | 9 +++++++++ > >> 1 file changed, 9 insertions(+) > >> > >> diff --git a/fs/file.c b/fs/file.c > >> index 1be2a5bcc7c4..729c07a4fc28 100644 > >> --- a/fs/file.c > >> +++ b/fs/file.c > >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ > >> unsigned int maxbit = maxfd / BITS_PER_LONG; > >> unsigned int bitbit = start / BITS_PER_LONG; > >> + unsigned int bit; > >> + > >> + /* > >> + * Try to avoid looking at the second level bitmap > >> + */ > >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, > >> + start & (BITS_PER_LONG - 1)); > >> + if (bit < BITS_PER_LONG) > >> + return bit + bitbit * BITS_PER_LONG; > >> > >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > >> if (bitbit >= maxfd) > >> -- > >> 2.43.0 > >> > > -- Mateusz Guzik <mjguzik gmail.com>
© 2016 - 2025 Red Hat, Inc.