fs/file.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-)
pts/blogbench-1.1.0 is a benchmark designed to replicate the load of a real-world busy file server by multiple threads of random reads, writes, and rewrites. When running default configuration with multiple parallel threads, hot spin lock contention is observed from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock. These 3 patches are created to reduce the critical section of file_lock in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been improved by 32% for read and 15% for write with over 30% kernel cycles reduced on ICX 160 cores configuration with v6.8-rc6. Yu Ma (3): fs/file.c: add fast path in alloc_fd() fs/file.c: conditionally clear full_fds fs/file.c: move sanity_check from alloc_fd() to put_unused_fd() fs/file.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) -- 2.43.0
pts/blogbench-1.1.0 is a benchmark designed to replicate the load of a real-world busy file server by multiple threads of random reads, writes, and rewrites. When running default configuration with multiple parallel threads, hot spin lock contention is observed from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock. These 3 patches are created to reduce the critical section of file_lock in alloc_fd() and close_fd(). As a result, on top of patch 1, pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write on Intel ICX 160 cores configuration with v6.10-rc7. v4 -> v5: Revised patch 3 for some code style issues. v3 -> v4: 1. Rebased the patch set to v6.10-rc7 and updated performance results. 2. Updated patch 1 to revise the order of fd checking. 3. As proposed by Mateusz Guzik <mjguzik gmail.com>, and agreed by Jan Kara <jack@suse.cz> and Tim Chen <tim.c.chen@linux.intel.com>, updated patch 3 with a more generic and scalable fast path for the word contains next_fd. v2 -> v3: 1. Rebased the patch set to latest v6.10-rc6 and updated the performance results 2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com> 3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient and more concise than v2. v1 -> v2: 1. Rebased the patch set to latest v6.10-rc4 and updated the performance results 2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check for fast path 3. Updated patch 3 to remove sanity_check directly per the alignment with maintainer Yu Ma (3): fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() fs/file.c: conditionally clear full_fds fs/file.c: add fast path in find_next_fd() fs/file.c | 46 ++++++++++++++++++++++++++-------------------- 1 file changed, 26 insertions(+), 20 deletions(-) -- 2.43.0
On Wed, 17 Jul 2024 10:50:15 -0400, Yu Ma wrote:
> pts/blogbench-1.1.0 is a benchmark designed to replicate the
> load of a real-world busy file server by multiple threads of
> random reads, writes, and rewrites. When running default configuration
> with multiple parallel threads, hot spin lock contention is observed
> from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
>
> These 3 patches are created to reduce the critical section of file_lock
> in alloc_fd() and close_fd(). As a result, on top of patch 1,
> pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
> on Intel ICX 160 cores configuration with v6.10-rc7.
>
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
https://git.kernel.org/vfs/vfs/c/19f4a3f5712a
[2/3] fs/file.c: conditionally clear full_fds
https://git.kernel.org/vfs/vfs/c/b483266658a8
[3/3] fs/file.c: add fast path in find_next_fd()
https://git.kernel.org/vfs/vfs/c/3603a42c8c03
On Mon, Jul 22, 2024 at 05:02:04PM +0200, Christian Brauner wrote: > Applied to the vfs.misc branch of the vfs/vfs.git tree. > Patches in the vfs.misc branch should appear in linux-next soon. > > Please report any outstanding bugs that were missed during review in a > new review to the original patch series allowing us to drop it. > > It's encouraged to provide Acked-bys and Reviewed-bys even though the > patch has now been applied. If possible patch trailers will be updated. > > Note that commit hashes shown below are subject to change due to rebase, > trailer updates or similar. If in doubt, please check the listed branch. > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git > branch: vfs.misc > > [1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() > https://git.kernel.org/vfs/vfs/c/19f4a3f5712a > [2/3] fs/file.c: conditionally clear full_fds > https://git.kernel.org/vfs/vfs/c/b483266658a8 > [3/3] fs/file.c: add fast path in find_next_fd() > https://git.kernel.org/vfs/vfs/c/3603a42c8c03 Hmm... Something fishy's going on - those are not reachable by any branches. As the matter of fact, none of the branches contain _anything_ recent in fs/file.c or include/linux/fdtable.h; the only thing I see for include/linux/file.h is (dubious, IMO) DEFINE_FREE(fput,...) - this IS_ERR_OR_NULL in there needs a review of potential users. I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) and I'm going to apply those 3 unless anyone objects.
> Hmm... Something fishy's going on - those are not reachable by any branches. Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have to play around with --onto. > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) > and I'm going to apply those 3 unless anyone objects. Fine since they aren't in that branch. Otherwise I generally prefer to just merge a common branch. But that's rarely needed since Linus handles merge conflicts anyway and doesn't mind.
On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: > > Hmm... Something fishy's going on - those are not reachable by any branches. > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have > to play around with --onto. > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) > > and I'm going to apply those 3 unless anyone objects. > > Fine since they aren't in that branch. Otherwise I generally prefer to > just merge a common branch. If it's going to be rebased anyway, I don't see much difference from cherry-pick, TBH...
On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote: > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: > > > Hmm... Something fishy's going on - those are not reachable by any branches. > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have > > to play around with --onto. > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) > > > and I'm going to apply those 3 unless anyone objects. > > > > Fine since they aren't in that branch. Otherwise I generally prefer to > > just merge a common branch. > > If it's going to be rebased anyway, I don't see much difference from cherry-pick, > TBH... Yeah, but I generally don't rebase after -rc1 anymore unles there's really annoying conflicts.
On 8/5/2024 2:56 PM, Christian Brauner wrote: > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote: >> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: >>>> Hmm... Something fishy's going on - those are not reachable by any branches. >>> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have >>> to play around with --onto. >>> >>>> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) >>>> and I'm going to apply those 3 unless anyone objects. >>> Fine since they aren't in that branch. Otherwise I generally prefer to >>> just merge a common branch. >> If it's going to be rebased anyway, I don't see much difference from cherry-pick, >> TBH... > Yeah, but I generally don't rebase after -rc1 anymore unles there's > really annoying conflicts. Thanks Christian and Al for your time and efforts. I'm not familiar with the merging process, may i know about when these patches could be seen in master ? Regards Yu
On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote: > > On 8/5/2024 2:56 PM, Christian Brauner wrote: > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote: > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: > > > > > Hmm... Something fishy's going on - those are not reachable by any branches. > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have > > > > to play around with --onto. > > > > > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) > > > > > and I'm going to apply those 3 unless anyone objects. > > > > Fine since they aren't in that branch. Otherwise I generally prefer to > > > > just merge a common branch. > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick, > > > TBH... > > Yeah, but I generally don't rebase after -rc1 anymore unles there's > > really annoying conflicts. > > Thanks Christian and Al for your time and efforts. I'm not familiar with the > merging process, may i know about when these patches could be seen in master It's in work.fdtable in my tree, will post that series tonight and add to #for-next
On Mon 12-08-24 03:40:44, Al Viro wrote: > On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote: > > > > On 8/5/2024 2:56 PM, Christian Brauner wrote: > > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote: > > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: > > > > > > Hmm... Something fishy's going on - those are not reachable by any branches. > > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have > > > > > to play around with --onto. > > > > > > > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) > > > > > > and I'm going to apply those 3 unless anyone objects. > > > > > Fine since they aren't in that branch. Otherwise I generally prefer to > > > > > just merge a common branch. > > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick, > > > > TBH... > > > Yeah, but I generally don't rebase after -rc1 anymore unles there's > > > really annoying conflicts. > > > > Thanks Christian and Al for your time and efforts. I'm not familiar with the > > merging process, may i know about when these patches could be seen in master > > It's in work.fdtable in my tree, will post that series tonight and add to #for-next Al, it seems you didn't push the patches to Linus during the last merge window. Do you plan to push them during the coming one? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, Nov 06, 2024 at 06:44:23PM +0100, Jan Kara wrote: > On Mon 12-08-24 03:40:44, Al Viro wrote: > > On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote: > > > > > > On 8/5/2024 2:56 PM, Christian Brauner wrote: > > > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote: > > > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: > > > > > > > Hmm... Something fishy's going on - those are not reachable by any branches. > > > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have > > > > > > to play around with --onto. > > > > > > > > > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) > > > > > > > and I'm going to apply those 3 unless anyone objects. > > > > > > Fine since they aren't in that branch. Otherwise I generally prefer to > > > > > > just merge a common branch. > > > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick, > > > > > TBH... > > > > Yeah, but I generally don't rebase after -rc1 anymore unles there's > > > > really annoying conflicts. > > > > > > Thanks Christian and Al for your time and efforts. I'm not familiar with the > > > merging process, may i know about when these patches could be seen in master > > > > It's in work.fdtable in my tree, will post that series tonight and add to #for-next > > Al, it seems you didn't push the patches to Linus during the last merge > window. Do you plan to push them during the coming one? Yes, I do. I can merge that branch into viro/vfs.git#for-next, but that would be redundant - note that vfs/vfs.git#workd.fdtable is identical to it (same sha1 of head) and vfs/vfs.git#vfs.file contains a merge from it and vfs/vfs.git#vfs.all merges #vfs.file. So it's already in linux-next in the current form and until something else gets added to the branch I see no point.
On 8/12/2024 10:40 AM, Al Viro wrote: > On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote: >> On 8/5/2024 2:56 PM, Christian Brauner wrote: >>> On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote: >>>> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote: >>>>>> Hmm... Something fishy's going on - those are not reachable by any branches. >>>>> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have >>>>> to play around with --onto. >>>>> >>>>>> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable) >>>>>> and I'm going to apply those 3 unless anyone objects. >>>>> Fine since they aren't in that branch. Otherwise I generally prefer to >>>>> just merge a common branch. >>>> If it's going to be rebased anyway, I don't see much difference from cherry-pick, >>>> TBH... >>> Yeah, but I generally don't rebase after -rc1 anymore unles there's >>> really annoying conflicts. >> Thanks Christian and Al for your time and efforts. I'm not familiar with the >> merging process, may i know about when these patches could be seen in master > It's in work.fdtable in my tree, will post that series tonight and add to #for-next Thanks Al for confirmation :)
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..e1b9d6df7941 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (likely(fd < fdt->max_fds))
fd = find_next_fd(fdt, fd);
/*
@@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
- if (fd >= end)
+ if (unlikely(fd >= end))
goto out;
- error = expand_files(files, fd);
- if (error < 0)
- goto out;
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
if (start <= files->next_fd)
files->next_fd = fd + 1;
@@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
- BUG_ON(fdt->fd[fd] != NULL);
+ WARN_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
--
2.43.0
On Wed, Jul 17, 2024 at 10:50:16AM -0400, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
>
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
With that change you can't get 0 from expand_files() here, so the
last goto should be unconditional. The only case when expand_files()
returns 0 is when it has found the descriptor already being covered
by fdt; since fdt->max_fds is stabilized by ->files_lock we are
holding here, comparison in expand_files() will give the same
result as it just had.
IOW, that goto repeat should be unconditional. The fun part here is
that this was the only caller that distinguished between 0 and 1...
On 8/15/2024 5:38 AM, Al Viro wrote:
> On Wed, Jul 17, 2024 at 10:50:16AM -0400, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>> allocated fd is NULL. Remove this sanity check since it can be assured by
>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>> likely/unlikely and expand_file() call avoidance to reduce the work under
>> file_lock.
>> + if (unlikely(fd >= fdt->max_fds)) {
>> + error = expand_files(files, fd);
>> + if (error < 0)
>> + goto out;
>>
>> - /*
>> - * If we needed to expand the fs array we
>> - * might have blocked - try again.
>> - */
>> - if (error)
>> - goto repeat;
>> + /*
>> + * If we needed to expand the fs array we
>> + * might have blocked - try again.
>> + */
>> + if (error)
>> + goto repeat;
> With that change you can't get 0 from expand_files() here, so the
> last goto should be unconditional. The only case when expand_files()
> returns 0 is when it has found the descriptor already being covered
> by fdt; since fdt->max_fds is stabilized by ->files_lock we are
> holding here, comparison in expand_files() will give the same
> result as it just had.
>
> IOW, that goto repeat should be unconditional. The fun part here is
> that this was the only caller that distinguished between 0 and 1...
Yes, thanks Al, fully agree with you. The if (error) could be removed
here as the above unlikely would make sure no 0 return here. Should I
submit another version of patch set to update it, or you may help to
update it directly during merge? I'm not very familiar with the rules,
please let me know if I'd update and I'll take action soon. Thanks again
for your careful check here.
On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote: > Yes, thanks Al, fully agree with you. The if (error) could be removed here > as the above unlikely would make sure no 0 return here. Should I submit > another version of patch set to update it, or you may help to update it > directly during merge? I'm not very familiar with the rules, please let me > know if I'd update and I'll take action soon. Thanks again for your careful > check here. See the current work.fdtable...
On Thu, Aug 15, 2024 at 5:45 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote: > > > Yes, thanks Al, fully agree with you. The if (error) could be removed here > > as the above unlikely would make sure no 0 return here. Should I submit > > another version of patch set to update it, or you may help to update it > > directly during merge? I'm not very familiar with the rules, please let me > > know if I'd update and I'll take action soon. Thanks again for your careful > > check here. > > See the current work.fdtable... this patchset seems to have fallen through the cracks. anything holding up the merge? -- Mateusz Guzik <mjguzik gmail.com>
On Thu, Oct 31, 2024 at 08:42:18AM +0100, Mateusz Guzik wrote: > On Thu, Aug 15, 2024 at 5:45 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote: > > > > > Yes, thanks Al, fully agree with you. The if (error) could be removed here > > > as the above unlikely would make sure no 0 return here. Should I submit > > > another version of patch set to update it, or you may help to update it > > > directly during merge? I'm not very familiar with the rules, please let me > > > know if I'd update and I'll take action soon. Thanks again for your careful > > > check here. > > > > See the current work.fdtable... > > this patchset seems to have fallen through the cracks. anything > holding up the merge? It's in next for this merge window.
On 8/15/2024 11:45 AM, Al Viro wrote: > On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote: > >> Yes, thanks Al, fully agree with you. The if (error) could be removed here >> as the above unlikely would make sure no 0 return here. Should I submit >> another version of patch set to update it, or you may help to update it >> directly during merge? I'm not very familiar with the rules, please let me >> know if I'd update and I'll take action soon. Thanks again for your careful >> check here. > See the current work.fdtable... Saw it, thanks :)
Hello,
kernel test robot noticed a 1.2% improvement of will-it-scale.per_process_ops on:
commit: f1139c8e66d5c618aad04a93a2378ad9586464f9 ("[PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-2-yu.ma@intel.com/
patch subject: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
testcase: will-it-scale
test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
parameters:
nr_task: 100%
mode: process
test: dup1
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062146.832faa23-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-srf-2sp1/dup1/will-it-scale
commit:
5f30e082ab ("Merge branch 'vfs.iomap' into vfs.all")
f1139c8e66 ("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
5f30e082ab8b3431 f1139c8e66d5c618aad04a93a23
---------------- ---------------------------
%stddev %change %stddev
\ | \
377983 ± 69% +74.1% 658036 ± 17% numa-meminfo.node0.AnonPages
18.17 ± 10% -48.6% 9.33 ± 35% perf-c2c.DRAM.local
8.796e+08 +1.2% 8.903e+08 will-it-scale.256.processes
3436082 +1.2% 3477810 will-it-scale.per_process_ops
8.796e+08 +1.2% 8.903e+08 will-it-scale.workload
1.517e+11 -4.3% 1.452e+11 perf-stat.i.branch-instructions
0.03 ± 8% +0.0 0.04 ± 36% perf-stat.i.branch-miss-rate%
0.93 +3.9% 0.96 perf-stat.i.cpi
7.13e+11 -3.5% 6.88e+11 perf-stat.i.instructions
1.08 -3.4% 1.04 perf-stat.i.ipc
0.93 +3.4% 0.96 perf-stat.overall.cpi
1.08 -3.3% 1.04 perf-stat.overall.ipc
245130 -4.4% 234451 perf-stat.overall.path-length
1.512e+11 -4.3% 1.447e+11 perf-stat.ps.branch-instructions
7.106e+11 -3.5% 6.857e+11 perf-stat.ps.instructions
2.156e+14 -3.2% 2.087e+14 perf-stat.total.instructions
14.90 -0.7 14.20 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
12.01 -0.7 11.32 perf-profile.calltrace.cycles-pp.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
16.54 -0.7 15.88 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.dup
6.44 -0.6 5.89 perf-profile.calltrace.cycles-pp.alloc_fd.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
2.86 -0.0 2.82 perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.__close
8.94 -0.0 8.90 perf-profile.calltrace.cycles-pp.filp_flush.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
7.76 -0.0 7.72 perf-profile.calltrace.cycles-pp.locks_remove_posix.filp_flush.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.58 -0.0 2.54 perf-profile.calltrace.cycles-pp.__fput_sync.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
1.11 -0.0 1.10 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
1.33 +0.0 1.35 perf-profile.calltrace.cycles-pp.testcase
0.54 +0.0 0.56 perf-profile.calltrace.cycles-pp.x64_sys_call.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
0.79 +0.0 0.82 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.dup
1.33 +0.0 1.37 perf-profile.calltrace.cycles-pp.close@plt
2.73 +0.1 2.78 perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.05 +0.1 1.11 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
4.35 +0.1 4.42 perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
22.18 +0.3 22.51 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__close
21.50 ± 2% +1.5 23.02 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.dup
12.10 -0.7 11.39 perf-profile.children.cycles-pp.__x64_sys_dup
34.79 -0.7 34.12 perf-profile.children.cycles-pp.do_syscall_64
38.04 -0.6 37.42 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
6.48 -0.6 5.90 perf-profile.children.cycles-pp.alloc_fd
1.86 -0.5 1.41 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.57 -0.1 0.47 perf-profile.children.cycles-pp.fd_install
9.11 -0.0 9.07 perf-profile.children.cycles-pp.filp_flush
7.93 -0.0 7.89 perf-profile.children.cycles-pp.locks_remove_posix
2.61 -0.0 2.58 perf-profile.children.cycles-pp.__fput_sync
1.16 +0.0 1.18 perf-profile.children.cycles-pp.x64_sys_call
0.05 +0.0 0.07 ± 13% perf-profile.children.cycles-pp.clockevents_program_event
0.51 +0.0 0.53 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
2.17 +0.0 2.20 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
5.72 +0.0 5.75 perf-profile.children.cycles-pp._raw_spin_lock
2.10 +0.0 2.13 perf-profile.children.cycles-pp.testcase
2.02 +0.0 2.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.13 ± 2% +0.0 0.17 perf-profile.children.cycles-pp.dup@plt
4.38 +0.1 4.46 perf-profile.children.cycles-pp.file_close_fd
23.00 +0.1 23.11 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
59.27 +0.5 59.73 perf-profile.children.cycles-pp.__close
28.73 +1.1 29.80 perf-profile.children.cycles-pp.entry_SYSCALL_64
1.86 -0.5 1.41 perf-profile.self.cycles-pp.syscall_return_via_sysret
2.28 -0.2 2.12 perf-profile.self.cycles-pp.alloc_fd
0.54 -0.1 0.43 perf-profile.self.cycles-pp.fd_install
7.87 -0.0 7.83 perf-profile.self.cycles-pp.locks_remove_posix
2.47 -0.0 2.44 perf-profile.self.cycles-pp.__fput_sync
1.23 +0.0 1.24 perf-profile.self.cycles-pp.file_close_fd_locked
1.09 +0.0 1.11 perf-profile.self.cycles-pp.x64_sys_call
0.51 +0.0 0.53 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
1.29 +0.0 1.32 perf-profile.self.cycles-pp.testcase
5.66 +0.0 5.69 perf-profile.self.cycles-pp._raw_spin_lock
1.95 +0.0 1.99 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
2.85 +0.0 2.90 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.02 ±141% +0.0 0.06 ± 13% perf-profile.self.cycles-pp.ktime_get
0.00 +0.1 0.07 perf-profile.self.cycles-pp.dup@plt
22.93 +0.1 23.05 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
10.11 +0.2 10.34 perf-profile.self.cycles-pp.dup
13.70 +0.3 13.98 perf-profile.self.cycles-pp.entry_SYSCALL_64
9.84 ± 3% +0.7 10.51 perf-profile.self.cycles-pp.__close
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index e1b9d6df7941..1be2a5bcc7c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..729c07a4fc28 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level bitmap
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG - 1));
+ if (bit < BITS_PER_LONG)
+ return bit + bitbit * BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
--
2.43.0
Hello,
kernel test robot noticed a 6.3% improvement of will-it-scale.per_thread_ops on:
commit: b8decf0015a8b1ff02cdac61c0aa54355d8e73d7 ("[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-4-yu.ma@intel.com/
patch subject: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
testcase: will-it-scale
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:
nr_task: 100%
mode: thread
test: open3
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062152.7e5b5d6d-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/thread/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/open3/will-it-scale
commit:
5bb3423bf9 ("fs/file.c: conditionally clear full_fds")
b8decf0015 ("fs/file.c: add fast path in find_next_fd()")
5bb3423bf9f9d91e b8decf0015a8b1ff02cdac61c0a
---------------- ---------------------------
%stddev %change %stddev
\ | \
848151 +6.2% 901119 ± 2% will-it-scale.224.threads
3785 +6.3% 4022 ± 2% will-it-scale.per_thread_ops
848151 +6.2% 901119 ± 2% will-it-scale.workload
0.28 ± 4% +13.3% 0.32 ± 3% perf-stat.i.MPKI
31.31 ± 3% +2.0 33.28 perf-stat.i.cache-miss-rate%
14955855 ± 4% +13.6% 16995785 ± 4% perf-stat.i.cache-misses
49676581 +6.7% 53009444 ± 3% perf-stat.i.cache-references
43955 ± 4% -12.3% 38549 ± 4% perf-stat.i.cycles-between-cache-misses
0.28 ± 4% +13.4% 0.32 ± 4% perf-stat.overall.MPKI
29.84 ± 3% +1.9 31.78 ± 2% perf-stat.overall.cache-miss-rate%
43445 ± 4% -12.1% 38200 ± 4% perf-stat.overall.cycles-between-cache-misses
19005976 -5.4% 17972604 ± 2% perf-stat.overall.path-length
14869677 ± 4% +13.6% 16898438 ± 4% perf-stat.ps.cache-misses
49821402 +6.7% 53168235 ± 3% perf-stat.ps.cache-references
49.42 -0.1 49.34 perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.40 -0.1 49.32 perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
49.25 -0.1 49.18 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64
49.20 -0.1 49.13 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
49.33 -0.1 49.26 perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.28 -0.1 49.22 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
50.14 +0.0 50.18 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
50.17 +0.0 50.21 perf-profile.calltrace.cycles-pp.open64
0.64 ± 5% +0.1 0.75 ± 6% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.62 ± 5% +0.1 0.74 ± 6% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
49.42 -0.1 49.34 perf-profile.children.cycles-pp.alloc_fd
49.40 -0.1 49.32 perf-profile.children.cycles-pp.file_close_fd
0.06 -0.0 0.05 perf-profile.children.cycles-pp.file_close_fd_locked
0.15 ± 5% +0.0 0.17 ± 4% perf-profile.children.cycles-pp.init_file
0.22 ± 3% +0.0 0.25 ± 3% perf-profile.children.cycles-pp.alloc_empty_file
0.18 ± 6% +0.0 0.22 ± 6% perf-profile.children.cycles-pp.__fput
50.14 +0.0 50.18 perf-profile.children.cycles-pp.__x64_sys_openat
50.18 +0.0 50.22 perf-profile.children.cycles-pp.open64
0.18 ± 14% +0.0 0.23 ± 7% perf-profile.children.cycles-pp.do_dentry_open
0.30 ± 8% +0.1 0.36 ± 8% perf-profile.children.cycles-pp.do_open
0.64 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.do_filp_open
0.63 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.path_openat
0.06 -0.0 0.05 perf-profile.self.cycles-pp.file_close_fd_locked
0.16 ± 2% +0.0 0.18 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
0.08 ± 12% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.__fput
0.05 ± 7% +0.1 0.10 ± 4% perf-profile.self.cycles-pp.alloc_fd
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: > > Skip 2-levels searching via find_next_zero_bit() when there is free slot in the > word contains next_fd, as: > (1) next_fd indicates the lower bound for the first free fd. > (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up > searching. this is stale -- now the fast path searches up to 64 fds in the lower bitmap > (3) After fdt is expanded (the bitmap size doubled for each time of expansion), > it would never be shrunk. The search size increases but there are few open fds > available here. > > This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by > Jan Kara <jack@suse.cz>, which is more generic and scalable than previous > versions. I think this paragraph is droppable. You already got an ack from Jan below, so stating he agrees with the patch is redundant. As for me I don't think this warrants mentioning. Just remove it, perhaps Christian will be willing to massage it by himself to avoid another series posting. > And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by > 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. > > Reviewed-by: Jan Kara <jack@suse.cz> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Signed-off-by: Yu Ma <yu.ma@intel.com> > --- > fs/file.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/file.c b/fs/file.c > index 1be2a5bcc7c4..729c07a4fc28 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ > unsigned int maxbit = maxfd / BITS_PER_LONG; > unsigned int bitbit = start / BITS_PER_LONG; > + unsigned int bit; > + > + /* > + * Try to avoid looking at the second level bitmap > + */ > + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, > + start & (BITS_PER_LONG - 1)); > + if (bit < BITS_PER_LONG) > + return bit + bitbit * BITS_PER_LONG; > > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > if (bitbit >= maxfd) > -- > 2.43.0 > -- Mateusz Guzik <mjguzik gmail.com>
On 7/20/2024 1:53 AM, Mateusz Guzik wrote: > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the >> word contains next_fd, as: >> (1) next_fd indicates the lower bound for the first free fd. >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up >> searching. > this is stale -- now the fast path searches up to 64 fds in the lower bitmap Nope, this is still valid, as the searching size of the fast path inside of find_next_fd() is always 64, it will execute the fast path inside of find_next_zero_bit(). > >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion), >> it would never be shrunk. The search size increases but there are few open fds >> available here. >> >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous >> versions. > I think this paragraph is droppable. You already got an ack from Jan > below, so stating he agrees with the patch is redundant. As for me I > don't think this warrants mentioning. Just remove it, perhaps > Christian will be willing to massage it by himself to avoid another > series posting. The idea of fast path for the word contains next_fd is from you, although this patch is small, I think it is reasonable to record here out of my respect. Appreciate for your guide and comments on this patch, I've learned a lot on the way of resolving problems :) Regards Yu >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. >> >> Reviewed-by: Jan Kara <jack@suse.cz> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> >> Signed-off-by: Yu Ma <yu.ma@intel.com> >> --- >> fs/file.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/fs/file.c b/fs/file.c >> index 1be2a5bcc7c4..729c07a4fc28 100644 >> --- a/fs/file.c >> +++ b/fs/file.c >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ >> unsigned int maxbit = maxfd / BITS_PER_LONG; >> unsigned int bitbit = start / BITS_PER_LONG; >> + unsigned int bit; >> + >> + /* >> + * Try to avoid looking at the second level bitmap >> + */ >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, >> + start & (BITS_PER_LONG - 1)); >> + if (bit < BITS_PER_LONG) >> + return bit + bitbit * BITS_PER_LONG; >> >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; >> if (bitbit >= maxfd) >> -- >> 2.43.0 >> >
I think this is getting too much fluff traffic at this point, which is partially my fault. I'm buggering off. Overall the patchset looks good, I don't see any technical reasons to avoid merging it. On Sat, Jul 20, 2024 at 2:57 PM Ma, Yu <yu.ma@intel.com> wrote: > > > On 7/20/2024 1:53 AM, Mateusz Guzik wrote: > > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: > >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the > >> word contains next_fd, as: > >> (1) next_fd indicates the lower bound for the first free fd. > >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up > >> searching. > > this is stale -- now the fast path searches up to 64 fds in the lower bitmap > > Nope, this is still valid, as the searching size of the fast path inside > of find_next_fd() is always 64, it will execute the fast path inside of > find_next_zero_bit(). > > > > > >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion), > >> it would never be shrunk. The search size increases but there are few open fds > >> available here. > >> > >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by > >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous > >> versions. > > I think this paragraph is droppable. You already got an ack from Jan > > below, so stating he agrees with the patch is redundant. As for me I > > don't think this warrants mentioning. Just remove it, perhaps > > Christian will be willing to massage it by himself to avoid another > > series posting. > > The idea of fast path for the word contains next_fd is from you, > although this patch is small, I think it is reasonable to record here > out of my respect. Appreciate for your guide and comments on this patch, > I've learned a lot on the way of resolving problems :) > > > Regards > > Yu > > >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by > >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. > >> > >> Reviewed-by: Jan Kara <jack@suse.cz> > >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > >> Signed-off-by: Yu Ma <yu.ma@intel.com> > >> --- > >> fs/file.c | 9 +++++++++ > >> 1 file changed, 9 insertions(+) > >> > >> diff --git a/fs/file.c b/fs/file.c > >> index 1be2a5bcc7c4..729c07a4fc28 100644 > >> --- a/fs/file.c > >> +++ b/fs/file.c > >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ > >> unsigned int maxbit = maxfd / BITS_PER_LONG; > >> unsigned int bitbit = start / BITS_PER_LONG; > >> + unsigned int bit; > >> + > >> + /* > >> + * Try to avoid looking at the second level bitmap > >> + */ > >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, > >> + start & (BITS_PER_LONG - 1)); > >> + if (bit < BITS_PER_LONG) > >> + return bit + bitbit * BITS_PER_LONG; > >> > >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > >> if (bitbit >= maxfd) > >> -- > >> 2.43.0 > >> > > -- Mateusz Guzik <mjguzik gmail.com>
pts/blogbench-1.1.0 is a benchmark designed to replicate the load of a real-world busy file server by multiple threads of random reads, writes, and rewrites. When running default configuration with multiple parallel threads, hot spin lock contention is observed from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock. These 3 patches are created to reduce the critical section of file_lock in alloc_fd() and close_fd(). As a result, on top of patch 1, pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write on Intel ICX 160 cores configuration with v6.10-rc7. v3 -> v4: 1. Rebased the patch set to v6.10-rc7 and updated performance results. 2. Updated patch 1 to revise the order of fd checking. 3. As proposed by Mateusz Guzik <mjguzik gmail.com>, and agreed by Jan Kara <jack@suse.cz> and Tim Chen <tim.c.chen@linux.intel.com>, updated patch 3 with a more generic and scalable fast path for the word contains next_fd. v2 -> v3: 1. Rebased the patch set to latest v6.10-rc6 and updated the performance results 2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com> 3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient and more concise than v2. v1 -> v2: 1. Rebased the patch set to latest v6.10-rc4 and updated the performance results 2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check for fast path 3. Updated patch 3 to remove sanity_check directly per the alignment with maintainer Yu Ma (3): fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() fs/file.c: conditionally clear full_fds fs/file.c: add fast path in find_next_fd() fs/file.c | 50 +++++++++++++++++++++++++++++--------------------- 1 file changed, 29 insertions(+), 21 deletions(-) -- 2.43.0
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..e1b9d6df7941 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (likely(fd < fdt->max_fds))
fd = find_next_fd(fdt, fd);
/*
@@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
- if (fd >= end)
+ if (unlikely(fd >= end))
goto out;
- error = expand_files(files, fd);
- if (error < 0)
- goto out;
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
if (start <= files->next_fd)
files->next_fd = fd + 1;
@@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
- BUG_ON(fdt->fd[fd] != NULL);
+ WARN_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
--
2.43.0
On Fri 12-07-24 22:39:15, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/file.c | 33 ++++++++++++++-------------------
> 1 file changed, 14 insertions(+), 19 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..e1b9d6df7941 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (likely(fd < fdt->max_fds))
> fd = find_next_fd(fdt, fd);
>
> /*
> @@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> * will limit the total number of files that can be opened.
> */
> error = -EMFILE;
> - if (fd >= end)
> + if (unlikely(fd >= end))
> goto out;
>
> - error = expand_files(files, fd);
> - if (error < 0)
> - goto out;
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
>
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
> + }
>
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
> @@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
> out:
> spin_unlock(&files->file_lock);
> @@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
> rcu_read_unlock_sched();
> spin_lock(&files->file_lock);
> fdt = files_fdtable(files);
> - BUG_ON(fdt->fd[fd] != NULL);
> + WARN_ON(fdt->fd[fd] != NULL);
> rcu_assign_pointer(fdt->fd[fd], file);
> spin_unlock(&files->file_lock);
> return;
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index e1b9d6df7941..1be2a5bcc7c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..a3ce6ba30c8c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -488,9 +488,20 @@ struct files_struct init_files = {
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
+ unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level bitmap
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG -1));
+ if (bit < BITS_PER_LONG) {
+ return bit + bitbit * BITS_PER_LONG;
+ }
+
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
- unsigned int bitbit = start / BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
--
2.43.0
On Fri 12-07-24 22:39:17, Yu Ma wrote:
> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> word contains next_fd, as:
> (1) next_fd indicates the lower bound for the first free fd.
> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
>
> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Looks good. Just some code style nits below.
> diff --git a/fs/file.c b/fs/file.c
> index 1be2a5bcc7c4..a3ce6ba30c8c 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,9 +488,20 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bitbit = start / BITS_PER_LONG;
> + unsigned int bit;
> +
> + /*
> + * Try to avoid looking at the second level bitmap
> + */
> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> + start & (BITS_PER_LONG -1));
^^ Either
(BITS_PER_LONG-1) or (BITS_PER_LONG - 1) please. Your combination looks
particularly weird :)
> + if (bit < BITS_PER_LONG) {
> + return bit + bitbit * BITS_PER_LONG;
> + }
No need for braces around the above block.
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
We keep declarations at the beginning of the block. Usually it keeps the
code more readable and the compiler should be clever enough to perform the
loads & arithmetics only when needed.
After fixing these style nits feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 7/16/2024 7:19 PM, Jan Kara wrote:
> On Fri 12-07-24 22:39:17, Yu Ma wrote:
>> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
>> word contains next_fd, as:
>> (1) next_fd indicates the lower bound for the first free fd.
>> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>>
>> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
>> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
>> versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
>> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Looks good. Just some code style nits below.
Copy that, thanks Honza, I'll revise and send out updated version soon.
>
>> diff --git a/fs/file.c b/fs/file.c
>> index 1be2a5bcc7c4..a3ce6ba30c8c 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,9 +488,20 @@ struct files_struct init_files = {
>>
>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> {
>> + unsigned int bitbit = start / BITS_PER_LONG;
>> + unsigned int bit;
>> +
>> + /*
>> + * Try to avoid looking at the second level bitmap
>> + */
>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> + start & (BITS_PER_LONG -1));
> ^^ Either
> (BITS_PER_LONG-1) or (BITS_PER_LONG - 1) please. Your combination looks
> particularly weird :)
>
>> + if (bit < BITS_PER_LONG) {
>> + return bit + bitbit * BITS_PER_LONG;
>> + }
> No need for braces around the above block.
>
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
> We keep declarations at the beginning of the block. Usually it keeps the
> code more readable and the compiler should be clever enough to perform the
> loads & arithmetics only when needed.
>
> After fixing these style nits feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> Honza
Yes, I'll polish this part of code accordingly, thanks for all the
comments here :)
pts/blogbench-1.1.0 is a benchmark designed to replicate the load of a real-world busy file server by multiple threads of random reads, writes, and rewrites. When running default configuration with multiple parallel threads, hot spin lock contention is observed from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock. These 3 patches are created to reduce the critical section of file_lock in alloc_fd() and close_fd(). As a result, on top of patch 1, pts/blogbench-1.1.0 has been improved by 28% for read and 12% for write on Intel ICX 160 cores configuration with v6.10-rc6. v2 -> v3: 1. Rebased the patch set to latest v6.10-rc6 and updated the performance results 2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com> 3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient and more concise than v2. v1 -> v2: 1. Rebased the patch set to latest v6.10-rc4 and updated the performance results 2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check for fast path 3. Updated patch 3 to remove sanity_check directly per the alignment with maintainer Yu Ma (3): fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() fs/file.c: conditionally clear full_fds fs/file.c: add fast path in find_next_fd() fs/file.c | 47 ++++++++++++++++++++++++----------------------- 1 file changed, 24 insertions(+), 23 deletions(-) -- 2.43.0
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 38 ++++++++++++++++----------------------
1 file changed, 16 insertions(+), 22 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..5178b246e54b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (likely(fd < fdt->max_fds))
fd = find_next_fd(fdt, fd);
+ error = -EMFILE;
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
+
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
- error = -EMFILE;
- if (fd >= end)
- goto out;
-
- error = expand_files(files, fd);
- if (error < 0)
+ if (unlikely(fd >= end))
goto out;
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
-
if (start <= files->next_fd)
files->next_fd = fd + 1;
@@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
- BUG_ON(fdt->fd[fd] != NULL);
+ WARN_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
--
2.43.0
On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 38 ++++++++++++++++----------------------
> 1 file changed, 16 insertions(+), 22 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..5178b246e54b 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (likely(fd < fdt->max_fds))
> fd = find_next_fd(fdt, fd);
>
> + error = -EMFILE;
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
> + }
So this ends up removing the expand_files() above the fd >= end check
which means that you can end up expanding the files_struct even though
the request fd is past the provided end. That seems odd. What's the
reason for that reordering?
> +
> /*
> * N.B. For clone tasks sharing a files structure, this test
> * will limit the total number of files that can be opened.
> */
> - error = -EMFILE;
> - if (fd >= end)
> - goto out;
> -
> - error = expand_files(files, fd);
> - if (error < 0)
> + if (unlikely(fd >= end))
> goto out;
>
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> -
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
> @@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
> out:
> spin_unlock(&files->file_lock);
> @@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
> rcu_read_unlock_sched();
> spin_lock(&files->file_lock);
> fdt = files_fdtable(files);
> - BUG_ON(fdt->fd[fd] != NULL);
> + WARN_ON(fdt->fd[fd] != NULL);
> rcu_assign_pointer(fdt->fd[fd], file);
> spin_unlock(&files->file_lock);
> return;
> --
> 2.43.0
>
On Wed 03-07-24 16:34:49, Christian Brauner wrote:
> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > allocated fd is NULL. Remove this sanity check since it can be assured by
> > exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> > likely/unlikely and expand_file() call avoidance to reduce the work under
> > file_lock.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 38 ++++++++++++++++----------------------
> > 1 file changed, 16 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..5178b246e54b 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > if (fd < files->next_fd)
> > fd = files->next_fd;
> >
> > - if (fd < fdt->max_fds)
> > + if (likely(fd < fdt->max_fds))
> > fd = find_next_fd(fdt, fd);
> >
> > + error = -EMFILE;
> > + if (unlikely(fd >= fdt->max_fds)) {
> > + error = expand_files(files, fd);
> > + if (error < 0)
> > + goto out;
> > + /*
> > + * If we needed to expand the fs array we
> > + * might have blocked - try again.
> > + */
> > + if (error)
> > + goto repeat;
> > + }
>
> So this ends up removing the expand_files() above the fd >= end check
> which means that you can end up expanding the files_struct even though
> the request fd is past the provided end. That seems odd. What's the
> reason for that reordering?
Yeah, not only that but also:
> > /*
> > * N.B. For clone tasks sharing a files structure, this test
> > * will limit the total number of files that can be opened.
> > */
> > - error = -EMFILE;
> > - if (fd >= end)
> > - goto out;
> > -
> > - error = expand_files(files, fd);
> > - if (error < 0)
> > + if (unlikely(fd >= end))
> > goto out;
We could then exit here with error set to 0 instead of -EMFILE. So this
needs a bit of work. But otherwise the patch looks good to me.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 7/4/2024 6:11 PM, Jan Kara wrote:
> On Wed 03-07-24 16:34:49, Christian Brauner wrote:
>> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
>>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>>> allocated fd is NULL. Remove this sanity check since it can be assured by
>>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>>> likely/unlikely and expand_file() call avoidance to reduce the work under
>>> file_lock.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>> fs/file.c | 38 ++++++++++++++++----------------------
>>> 1 file changed, 16 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..5178b246e54b 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>> if (fd < files->next_fd)
>>> fd = files->next_fd;
>>>
>>> - if (fd < fdt->max_fds)
>>> + if (likely(fd < fdt->max_fds))
>>> fd = find_next_fd(fdt, fd);
>>>
>>> + error = -EMFILE;
>>> + if (unlikely(fd >= fdt->max_fds)) {
>>> + error = expand_files(files, fd);
>>> + if (error < 0)
>>> + goto out;
>>> + /*
>>> + * If we needed to expand the fs array we
>>> + * might have blocked - try again.
>>> + */
>>> + if (error)
>>> + goto repeat;
>>> + }
>> So this ends up removing the expand_files() above the fd >= end check
>> which means that you can end up expanding the files_struct even though
>> the request fd is past the provided end. That seems odd. What's the
>> reason for that reordering?
> Yeah, not only that but also:
>
>>> /*
>>> * N.B. For clone tasks sharing a files structure, this test
>>> * will limit the total number of files that can be opened.
>>> */
>>> - error = -EMFILE;
>>> - if (fd >= end)
>>> - goto out;
>>> -
>>> - error = expand_files(files, fd);
>>> - if (error < 0)
>>> + if (unlikely(fd >= end))
>>> goto out;
> We could then exit here with error set to 0 instead of -EMFILE. So this
> needs a bit of work. But otherwise the patch looks good to me.
>
> Honza
Do you mean that we return 0 here is fd >=end, I'm afraid that might
broke the original design of this function. And all the callers of it
are using ret < 0 for error handling, if ret=0, that should mean the fd
allocated is 0 ...
On Thu 04-07-24 22:45:32, Ma, Yu wrote:
>
> On 7/4/2024 6:11 PM, Jan Kara wrote:
> > On Wed 03-07-24 16:34:49, Christian Brauner wrote:
> > > On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> > > > likely/unlikely and expand_file() call avoidance to reduce the work under
> > > > file_lock.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > > fs/file.c | 38 ++++++++++++++++----------------------
> > > > 1 file changed, 16 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a3b72aa64f11..5178b246e54b 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > > if (fd < files->next_fd)
> > > > fd = files->next_fd;
> > > > - if (fd < fdt->max_fds)
> > > > + if (likely(fd < fdt->max_fds))
> > > > fd = find_next_fd(fdt, fd);
> > > > + error = -EMFILE;
> > > > + if (unlikely(fd >= fdt->max_fds)) {
> > > > + error = expand_files(files, fd);
> > > > + if (error < 0)
> > > > + goto out;
> > > > + /*
> > > > + * If we needed to expand the fs array we
> > > > + * might have blocked - try again.
> > > > + */
> > > > + if (error)
> > > > + goto repeat;
> > > > + }
> > > So this ends up removing the expand_files() above the fd >= end check
> > > which means that you can end up expanding the files_struct even though
> > > the request fd is past the provided end. That seems odd. What's the
> > > reason for that reordering?
> > Yeah, not only that but also:
> >
> > > > /*
> > > > * N.B. For clone tasks sharing a files structure, this test
> > > > * will limit the total number of files that can be opened.
> > > > */
> > > > - error = -EMFILE;
> > > > - if (fd >= end)
> > > > - goto out;
> > > > -
> > > > - error = expand_files(files, fd);
> > > > - if (error < 0)
> > > > + if (unlikely(fd >= end))
> > > > goto out;
> > We could then exit here with error set to 0 instead of -EMFILE. So this
> > needs a bit of work. But otherwise the patch looks good to me.
> >
> > Honza
>
> Do you mean that we return 0 here is fd >=end, I'm afraid that might broke
> the original design of this function. And all the callers of it are using
> ret < 0 for error handling, if ret=0, that should mean the fd allocated is 0
> ...
What I meant is that after your changes alloc_fd() could return 0 in fd >=
end case. It could happen if we went through expand_files() which then
returned 0. Anyway, please just fix the ordering of checks and we should be
fine.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 7/3/2024 10:34 PM, Christian Brauner wrote:
> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>> allocated fd is NULL. Remove this sanity check since it can be assured by
>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>> likely/unlikely and expand_file() call avoidance to reduce the work under
>> file_lock.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 38 ++++++++++++++++----------------------
>> 1 file changed, 16 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a3b72aa64f11..5178b246e54b 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> - if (fd < fdt->max_fds)
>> + if (likely(fd < fdt->max_fds))
>> fd = find_next_fd(fdt, fd);
>>
>> + error = -EMFILE;
>> + if (unlikely(fd >= fdt->max_fds)) {
>> + error = expand_files(files, fd);
>> + if (error < 0)
>> + goto out;
>> + /*
>> + * If we needed to expand the fs array we
>> + * might have blocked - try again.
>> + */
>> + if (error)
>> + goto repeat;
>> + }
> So this ends up removing the expand_files() above the fd >= end check
> which means that you can end up expanding the files_struct even though
> the request fd is past the provided end. That seems odd. What's the
> reason for that reordering?
Yes, you are right, thanks Christian. This incorrect reordering here is
due to historical versions with fast path inside. I'll update the order
back.
>> +
>> /*
>> * N.B. For clone tasks sharing a files structure, this test
>> * will limit the total number of files that can be opened.
>> */
>> - error = -EMFILE;
>> - if (fd >= end)
>> - goto out;
>> -
>> - error = expand_files(files, fd);
>> - if (error < 0)
>> + if (unlikely(fd >= end))
>> goto out;
>>
>> - /*
>> - * If we needed to expand the fs array we
>> - * might have blocked - try again.
>> - */
>> - if (error)
>> - goto repeat;
>> -
>> if (start <= files->next_fd)
>> files->next_fd = fd + 1;
>>
>> @@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> else
>> __clear_close_on_exec(fd, fdt);
>> error = fd;
>> -#if 1
>> - /* Sanity check */
>> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
>> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
>> - rcu_assign_pointer(fdt->fd[fd], NULL);
>> - }
>> -#endif
>>
>> out:
>> spin_unlock(&files->file_lock);
>> @@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
>> rcu_read_unlock_sched();
>> spin_lock(&files->file_lock);
>> fdt = files_fdtable(files);
>> - BUG_ON(fdt->fd[fd] != NULL);
>> + WARN_ON(fdt->fd[fd] != NULL);
>> rcu_assign_pointer(fdt->fd[fd], file);
>> spin_unlock(&files->file_lock);
>> return;
>> --
>> 2.43.0
>>
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc6.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 5178b246e54b..a15317db3119 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.
Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
update the fast path from alloc_fd() to find_next_fd(). With which, on top of
patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
Intel ICX 160 cores configuration with v6.10-rc6.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index a15317db3119..f25eca311f51 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -488,6 +488,11 @@ struct files_struct init_files = {
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
+ unsigned int bit;
+ bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
+ if (bit < BITS_PER_LONG)
+ return bit;
+
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
--
2.43.0
On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bit;
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> +
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> --
> 2.43.0
>
I had something like this in mind:
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..4d3307e39db7 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
*fdt, unsigned int start)
unsigned int maxfd = fdt->max_fds; /* always multiple of
BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level map.
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG - 1));
+ if (bit < BITS_PER_LONG) {
+ return bit + bitbit * BITS_PER_LONG;
+ }
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit,
bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
can you please test it out. I expect it to provide a tiny improvement
over your patch.
--
Mateusz Guzik <mjguzik gmail.com>
On Thu 04-07-24 19:44:10, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
> >
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> > update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> > patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> > Intel ICX 160 cores configuration with v6.10-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a15317db3119..f25eca311f51 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -488,6 +488,11 @@ struct files_struct init_files = {
> >
> > static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > {
> > + unsigned int bit;
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > +
> > unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> > --
> > 2.43.0
> >
>
> I had something like this in mind:
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..4d3307e39db7 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> *fdt, unsigned int start)
> unsigned int maxfd = fdt->max_fds; /* always multiple of
> BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> + unsigned int bit;
> +
> + /*
> + * Try to avoid looking at the second level map.
> + */
> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> + start & (BITS_PER_LONG - 1));
> + if (bit < BITS_PER_LONG) {
> + return bit + bitbit * BITS_PER_LONG;
> + }
Drat, you're right. I missed that Ma did not add the proper offset to
open_fds. *This* is what I meant :)
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 7/5/2024 5:55 AM, Jan Kara wrote:
> On Thu 04-07-24 19:44:10, Mateusz Guzik wrote:
>> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>>> when we look for an available fd slot. Skip 2-levels searching via
>>> find_next_zero_bit() for this common fast path.
>>>
>>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>>> free slot is available there, as:
>>> (1) The fd allocation algorithm would always allocate fd from small to large.
>>> Lower bits in open_fds bitmap would be used much more frequently than higher
>>> bits.
>>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>>> it would never be shrunk. The search size increases but there are few open fds
>>> available here.
>>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>>> searching.
>>>
>>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>>> Intel ICX 160 cores configuration with v6.10-rc6.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>> fs/file.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a15317db3119..f25eca311f51 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>>
>>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>> {
>>> + unsigned int bit;
>>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>>> + if (bit < BITS_PER_LONG)
>>> + return bit;
>>> +
>>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>> unsigned int bitbit = start / BITS_PER_LONG;
>>> --
>>> 2.43.0
>>>
>> I had something like this in mind:
>> diff --git a/fs/file.c b/fs/file.c
>> index a3b72aa64f11..4d3307e39db7 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>> *fdt, unsigned int start)
>> unsigned int maxfd = fdt->max_fds; /* always multiple of
>> BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
>> + unsigned int bit;
>> +
>> + /*
>> + * Try to avoid looking at the second level map.
>> + */
>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> + start & (BITS_PER_LONG - 1));
>> + if (bit < BITS_PER_LONG) {
>> + return bit + bitbit * BITS_PER_LONG;
>> + }
> Drat, you're right. I missed that Ma did not add the proper offset to
> open_fds. *This* is what I meant :)
>
> Honza
Just tried this on v6.10-rc6, the improvement on top of patch 1 and
patch 2 is 7% for read and 3% for write, less than just check first word.
Per my understanding, its performance would be better if we can find
free bit in the same word of next_fd with high possibility, but next_fd
just represents the lowest possible free bit. If fds are open/close
frequently and randomly, that might not always be the case, next_fd may
be distributed randomly, for example, 0-65 are occupied, fd=3 is
returned, next_fd will be set to 3, next time when 3 is allocated,
next_fd will be set to 4, while the actual first free bit is 66 , when
66 is allocated, and fd=5 is returned, then the above process would be
went through again.
Yu
On 7/5/2024 3:56 PM, Ma, Yu wrote:
> I had something like this in mind:
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..4d3307e39db7 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>>> *fdt, unsigned int start)
>>> unsigned int maxfd = fdt->max_fds; /* always multiple of
>>> BITS_PER_LONG */
>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>> unsigned int bitbit = start / BITS_PER_LONG;
>>> + unsigned int bit;
>>> +
>>> + /*
>>> + * Try to avoid looking at the second level map.
>>> + */
>>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>>> + start & (BITS_PER_LONG - 1));
>>> + if (bit < BITS_PER_LONG) {
>>> + return bit + bitbit * BITS_PER_LONG;
>>> + }
>> Drat, you're right. I missed that Ma did not add the proper offset to
>> open_fds. *This* is what I meant :)
>>
>> Honza
>
> Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> patch 2 is 7% for read and 3% for write, less than just check first word.
>
> Per my understanding, its performance would be better if we can find
> free bit in the same word of next_fd with high possibility, but
> next_fd just represents the lowest possible free bit. If fds are
> open/close frequently and randomly, that might not always be the case,
> next_fd may be distributed randomly, for example, 0-65 are occupied,
> fd=3 is returned, next_fd will be set to 3, next time when 3 is
> allocated, next_fd will be set to 4, while the actual first free bit
> is 66 , when 66 is allocated, and fd=5 is returned, then the above
> process would be went through again.
>
> Yu
>
Hi Guzik, Honza,
Do we have any more comment or idea regarding to the fast path? Thanks
for your time and any feedback :)
Regards
Yu
Right, forgot to respond.
I suspect the different result is either because of mere variance
between reboots or blogbench using significantly less than 100 fds at
any given time -- I don't have an easy way to test at your scale at
the moment. You could probably test that by benching both approaches
while switching them at runtime with a static_branch. However, I don't
know if that effort is warranted atm.
So happens I'm busy with other stuff and it is not my call to either
block or let this in, so I'm buggering off.
On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 7/5/2024 3:56 PM, Ma, Yu wrote:
> > I had something like this in mind:
> >>> diff --git a/fs/file.c b/fs/file.c
> >>> index a3b72aa64f11..4d3307e39db7 100644
> >>> --- a/fs/file.c
> >>> +++ b/fs/file.c
> >>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> >>> *fdt, unsigned int start)
> >>> unsigned int maxfd = fdt->max_fds; /* always multiple of
> >>> BITS_PER_LONG */
> >>> unsigned int maxbit = maxfd / BITS_PER_LONG;
> >>> unsigned int bitbit = start / BITS_PER_LONG;
> >>> + unsigned int bit;
> >>> +
> >>> + /*
> >>> + * Try to avoid looking at the second level map.
> >>> + */
> >>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> >>> + start & (BITS_PER_LONG - 1));
> >>> + if (bit < BITS_PER_LONG) {
> >>> + return bit + bitbit * BITS_PER_LONG;
> >>> + }
> >> Drat, you're right. I missed that Ma did not add the proper offset to
> >> open_fds. *This* is what I meant :)
> >>
> >> Honza
> >
> > Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> > patch 2 is 7% for read and 3% for write, less than just check first word.
> >
> > Per my understanding, its performance would be better if we can find
> > free bit in the same word of next_fd with high possibility, but
> > next_fd just represents the lowest possible free bit. If fds are
> > open/close frequently and randomly, that might not always be the case,
> > next_fd may be distributed randomly, for example, 0-65 are occupied,
> > fd=3 is returned, next_fd will be set to 3, next time when 3 is
> > allocated, next_fd will be set to 4, while the actual first free bit
> > is 66 , when 66 is allocated, and fd=5 is returned, then the above
> > process would be went through again.
> >
> > Yu
> >
> Hi Guzik, Honza,
>
> Do we have any more comment or idea regarding to the fast path? Thanks
> for your time and any feedback :)
>
>
> Regards
>
> Yu
>
--
Mateusz Guzik <mjguzik gmail.com>
On Tue, 2024-07-09 at 12:17 +0200, Mateusz Guzik wrote:
> Right, forgot to respond.
>
> I suspect the different result is either because of mere variance
> between reboots or blogbench using significantly less than 100 fds at
> any given time -- I don't have an easy way to test at your scale at
> the moment. You could probably test that by benching both approaches
> while switching them at runtime with a static_branch. However, I don't
> know if that effort is warranted atm.
>
> So happens I'm busy with other stuff and it is not my call to either
> block or let this in, so I'm buggering off.
>
> On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
> >
> >
> > On 7/5/2024 3:56 PM, Ma, Yu wrote:
> > > I had something like this in mind:
> > > > > diff --git a/fs/file.c b/fs/file.c
> > > > > index a3b72aa64f11..4d3307e39db7 100644
> > > > > --- a/fs/file.c
> > > > > +++ b/fs/file.c
> > > > > @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> > > > > *fdt, unsigned int start)
> > > > > unsigned int maxfd = fdt->max_fds; /* always multiple of
> > > > > BITS_PER_LONG */
> > > > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > > unsigned int bitbit = start / BITS_PER_LONG;
> > > > > + unsigned int bit;
> > > > > +
> > > > > + /*
> > > > > + * Try to avoid looking at the second level map.
> > > > > + */
> > > > > + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> > > > > + start & (BITS_PER_LONG - 1));
> > > > > + if (bit < BITS_PER_LONG) {
> > > > > + return bit + bitbit * BITS_PER_LONG;
> > > > > + }
I think this approach based on next_fd quick check is more generic and scalable.
It just happen for blogbench, just checking the first 64 bit allow a quicker
skip to the two level search where this approach, next_fd may be left
in a 64 word that actually has no open bits and we are doing useless search
in find_next_zero_bit(). Perhaps we should check full_fds_bits to make sure
there are empty slots before we do
find_next_zero_bit() fast path. Something like
if (!test_bit(bitbit, fdt->full_fds_bits)) {
bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
start & (BITS_PER_LONG - 1));
if (bit < BITS_PER_LONG)
return bit + bitbit * BITS_PER_LONG;
}
Tim
> > > > Drat, you're right. I missed that Ma did not add the proper offset to
> > > > open_fds. *This* is what I meant :)
> > > >
> > > > Honza
> > >
> > > Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> > > patch 2 is 7% for read and 3% for write, less than just check first word.
> > >
> > > Per my understanding, its performance would be better if we can find
> > > free bit in the same word of next_fd with high possibility, but
> > > next_fd just represents the lowest possible free bit. If fds are
> > > open/close frequently and randomly, that might not always be the case,
> > > next_fd may be distributed randomly, for example, 0-65 are occupied,
> > > fd=3 is returned, next_fd will be set to 3, next time when 3 is
> > > allocated, next_fd will be set to 4, while the actual first free bit
> > > is 66 , when 66 is allocated, and fd=5 is returned, then the above
> > > process would be went through again.
> > >
> > > Yu
> > >
> > Hi Guzik, Honza,
> >
> > Do we have any more comment or idea regarding to the fast path? Thanks
> > for your time and any feedback :)
> >
> >
> > Regards
> >
> > Yu
> >
>
>
On 7/11/2024 7:40 AM, Tim Chen wrote:
> On Tue, 2024-07-09 at 12:17 +0200, Mateusz Guzik wrote:
>> Right, forgot to respond.
>>
>> I suspect the different result is either because of mere variance
>> between reboots or blogbench using significantly less than 100 fds at
>> any given time -- I don't have an easy way to test at your scale at
>> the moment. You could probably test that by benching both approaches
>> while switching them at runtime with a static_branch. However, I don't
>> know if that effort is warranted atm.
>>
>> So happens I'm busy with other stuff and it is not my call to either
>> block or let this in, so I'm buggering off.
>>
>> On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
>>>
>>> On 7/5/2024 3:56 PM, Ma, Yu wrote:
>>>> I had something like this in mind:
>>>>>> diff --git a/fs/file.c b/fs/file.c
>>>>>> index a3b72aa64f11..4d3307e39db7 100644
>>>>>> --- a/fs/file.c
>>>>>> +++ b/fs/file.c
>>>>>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>>>>>> *fdt, unsigned int start)
>>>>>> unsigned int maxfd = fdt->max_fds; /* always multiple of
>>>>>> BITS_PER_LONG */
>>>>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>>>>> unsigned int bitbit = start / BITS_PER_LONG;
>>>>>> + unsigned int bit;
>>>>>> +
>>>>>> + /*
>>>>>> + * Try to avoid looking at the second level map.
>>>>>> + */
>>>>>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>>>>>> + start & (BITS_PER_LONG - 1));
>>>>>> + if (bit < BITS_PER_LONG) {
>>>>>> + return bit + bitbit * BITS_PER_LONG;
>>>>>> + }
> I think this approach based on next_fd quick check is more generic and scalable.
>
> It just happen for blogbench, just checking the first 64 bit allow a quicker
> skip to the two level search where this approach, next_fd may be left
> in a 64 word that actually has no open bits and we are doing useless search
> in find_next_zero_bit(). Perhaps we should check full_fds_bits to make sure
> there are empty slots before we do
> find_next_zero_bit() fast path. Something like
>
> if (!test_bit(bitbit, fdt->full_fds_bits)) {
> bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> start & (BITS_PER_LONG - 1));
> if (bit < BITS_PER_LONG)
> return bit + bitbit * BITS_PER_LONG;
> }
> Tim
Yes, agree that it scales better, I'll update v4 with fast path for the
word contains next_fd and send out for review soon
>>>>> Drat, you're right. I missed that Ma did not add the proper offset to
>>>>> open_fds. *This* is what I meant :)
>>>>>
>>>>> Honza
>>>> Just tried this on v6.10-rc6, the improvement on top of patch 1 and
>>>> patch 2 is 7% for read and 3% for write, less than just check first word.
>>>>
>>>> Per my understanding, its performance would be better if we can find
>>>> free bit in the same word of next_fd with high possibility, but
>>>> next_fd just represents the lowest possible free bit. If fds are
>>>> open/close frequently and randomly, that might not always be the case,
>>>> next_fd may be distributed randomly, for example, 0-65 are occupied,
>>>> fd=3 is returned, next_fd will be set to 3, next time when 3 is
>>>> allocated, next_fd will be set to 4, while the actual first free bit
>>>> is 66 , when 66 is allocated, and fd=5 is returned, then the above
>>>> process would be went through again.
>>>>
>>>> Yu
>>>>
>>> Hi Guzik, Honza,
>>>
>>> Do we have any more comment or idea regarding to the fast path? Thanks
>>> for your time and any feedback :)
>>>
>>>
>>> Regards
>>>
>>> Yu
>>>
>>
On Wed 03-07-24 10:33:11, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Nice! The patch looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
One style nit below:
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bit;
Empty line here please to separate variable declaration and code...
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> +
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 7/4/2024 6:03 PM, Jan Kara wrote:
> On Wed 03-07-24 10:33:11, Yu Ma wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>> Intel ICX 160 cores configuration with v6.10-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Nice! The patch looks good to me. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> One style nit below:
>
>> diff --git a/fs/file.c b/fs/file.c
>> index a15317db3119..f25eca311f51 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>
>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> {
>> + unsigned int bit;
> Empty line here please to separate variable declaration and code...
Thanks Honza, copy that :)
>
>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>> + if (bit < BITS_PER_LONG)
>> + return bit;
>> +
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
> Honza
On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bit;
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> +
The rest of the patchset looks good on cursory read.
As for this one, the suggestion was to make it work across the entire range.
Today I wont have time to write and test what we proposed, but will
probably find some time tomorrow. Perhaps Jan will do the needful(tm)
in the meantime.
That said, please stay tuned for a patch. :)
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> --
> 2.43.0
>
--
Mateusz Guzik <mjguzik gmail.com>
On Wed 03-07-24 16:17:01, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
> >
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> > update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> > patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> > Intel ICX 160 cores configuration with v6.10-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a15317db3119..f25eca311f51 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -488,6 +488,11 @@ struct files_struct init_files = {
> >
> > static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > {
> > + unsigned int bit;
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > +
>
> The rest of the patchset looks good on cursory read.
>
> As for this one, the suggestion was to make it work across the entire range.
I'm not sure what do you exactly mean by "make it work across the entire
range" because what Ma has implemented is exactly what I originally had in
mind - i.e., search the first word of open_fds starting from next_fd (note
that 'start' in this function is already set to max(start, next_fd)), if
that fails, go through the two level bitmap.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 7/3/2024 10:17 PM, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>> Intel ICX 160 cores configuration with v6.10-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a15317db3119..f25eca311f51 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>
>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> {
>> + unsigned int bit;
>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>> + if (bit < BITS_PER_LONG)
>> + return bit;
>> +
> The rest of the patchset looks good on cursory read.
>
> As for this one, the suggestion was to make it work across the entire range.
>
> Today I wont have time to write and test what we proposed, but will
> probably find some time tomorrow. Perhaps Jan will do the needful(tm)
> in the meantime.
>
> That said, please stay tuned for a patch. :)
Sure, understood, Guzik, thanks for the quick feedback and consideration
of chances to make it better and more versatile. I'll also give a try to
double check previous proposal on entire fds range.
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
>> --
>> 2.43.0
>>
>
pts/blogbench-1.1.0 is a benchmark designed to replicate the load of a real-world busy file server by multiple threads of random reads, writes, and rewrites. When running default configuration with multiple parallel threads, hot spin lock contention is observed from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock. These 3 patches are created to reduce the critical section of file_lock in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been improved by 32% for read and 17% for write with over 30% kernel cycles reduced on ICX 160 cores configuration with v6.10-rc4. v1 -> v2: 1. Rebased the patch set to latest v6.10-rc4 and updated the performance results. 2. Updated patch 1 to resolve the bug identified by Mateusz Guzik with rlimit check and ensure allocated fd is greater than start. 3. Updated patch 3 to remove sanity_check directly per the alignment with maintainer Yu Ma (3): fs/file.c: add fast path in alloc_fd() fs/file.c: conditionally clear full_fds fs/file.c: remove sanity_check from alloc_fd() fs/file.c | 46 ++++++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 22 deletions(-) base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f -- 2.43.0
There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.
Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) find_next_zero_bit() itself has a fast path inside to speed up searching
when size<=64.
Besides, "!start" is added to fast path condition to ensure the allocated fd is
greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
(1) Allocating a new fd (the most common usage scenario) via
get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
(2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
old_fd's index in fdt, which is only called by syscall fcntl.
With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 35 +++++++++++++++++++++--------------
1 file changed, 21 insertions(+), 14 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..50e900a47107 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ error = -EMFILE;
+ if (likely(fd < fdt->max_fds)) {
+ if (~fdt->open_fds[0] && !start) {
+ fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
+ goto fastreturn;
+ }
fd = find_next_fd(fdt, fd);
+ }
+
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
+fastreturn:
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
- error = -EMFILE;
- if (fd >= end)
+ if (unlikely(fd >= end))
goto out;
- error = expand_files(files, fd);
- if (error < 0)
- goto out;
-
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
-
if (start <= files->next_fd)
files->next_fd = fd + 1;
--
2.43.0
On Sat 22-06-24 11:49:02, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> when size<=64.
>
> Besides, "!start" is added to fast path condition to ensure the allocated fd is
> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> (1) Allocating a new fd (the most common usage scenario) via
> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> old_fd's index in fdt, which is only called by syscall fcntl.
>
> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 35 +++++++++++++++++++++--------------
> 1 file changed, 21 insertions(+), 14 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..50e900a47107 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + error = -EMFILE;
> + if (likely(fd < fdt->max_fds)) {
> + if (~fdt->open_fds[0] && !start) {
> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
So I don't think this is quite correct. If files->next_fd is set, we could
end up calling find_next_zero_bit() starting from quite high offset causing
a regression? Also because we don't expand in this case, we could cause access
beyond end of fdtable?
Finally, AFAIU this speeds up the lookup for cases where fd < 64 is
available at the cost of cases where the first long is full (there we
unnecessarily load open_fds[0] into cache). Did you check if the cost is
visible (e.g. by making blogbench occupy first 64 fds before starting its
load)?
Honza
> + goto fastreturn;
> + }
> fd = find_next_fd(fdt, fd);
> + }
> +
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
> + }
>
> +fastreturn:
> /*
> * N.B. For clone tasks sharing a files structure, this test
> * will limit the total number of files that can be opened.
> */
> - error = -EMFILE;
> - if (fd >= end)
> + if (unlikely(fd >= end))
> goto out;
>
> - error = expand_files(files, fd);
> - if (error < 0)
> - goto out;
> -
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> -
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Tue 25-06-24 13:52:57, Jan Kara wrote:
> On Sat 22-06-24 11:49:02, Yu Ma wrote:
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> > when size<=64.
> >
> > Besides, "!start" is added to fast path condition to ensure the allocated fd is
> > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> > (1) Allocating a new fd (the most common usage scenario) via
> > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> > old_fd's index in fdt, which is only called by syscall fcntl.
> >
> > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 35 +++++++++++++++++++++--------------
> > 1 file changed, 21 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..50e900a47107 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > if (fd < files->next_fd)
> > fd = files->next_fd;
> >
> > - if (fd < fdt->max_fds)
> > + error = -EMFILE;
> > + if (likely(fd < fdt->max_fds)) {
> > + if (~fdt->open_fds[0] && !start) {
> > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>
> So I don't think this is quite correct. If files->next_fd is set, we could
> end up calling find_next_zero_bit() starting from quite high offset causing
> a regression? Also because we don't expand in this case, we could cause access
> beyond end of fdtable?
OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
be false. As such the above condition could be rewritten as:
if (!start && files->next_fd < BITS_PER_LONG)
to avoid loading the first bitmap long if we know it is full? Or we could
maybe go as far as:
if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
goto fastreturn;
because AFAIU this should work in exactly the same cases as your code?
Honza
> > + goto fastreturn;
> > + }
> > fd = find_next_fd(fdt, fd);
> > + }
> > +
> > + if (unlikely(fd >= fdt->max_fds)) {
> > + error = expand_files(files, fd);
> > + if (error < 0)
> > + goto out;
> > + /*
> > + * If we needed to expand the fs array we
> > + * might have blocked - try again.
> > + */
> > + if (error)
> > + goto repeat;
> > + }
> >
> > +fastreturn:
> > /*
> > * N.B. For clone tasks sharing a files structure, this test
> > * will limit the total number of files that can be opened.
> > */
> > - error = -EMFILE;
> > - if (fd >= end)
> > + if (unlikely(fd >= end))
> > goto out;
> >
> > - error = expand_files(files, fd);
> > - if (error < 0)
> > - goto out;
> > -
> > - /*
> > - * If we needed to expand the fs array we
> > - * might have blocked - try again.
> > - */
> > - if (error)
> > - goto repeat;
> > -
> > if (start <= files->next_fd)
> > files->next_fd = fd + 1;
> >
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 6/25/2024 8:53 PM, Jan Kara wrote:
> On Tue 25-06-24 13:52:57, Jan Kara wrote:
>> On Sat 22-06-24 11:49:02, Yu Ma wrote:
>>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>>> when we look for an available fd slot. Skip 2-levels searching via
>>> find_next_zero_bit() for this common fast path.
>>>
>>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>>> free slot is available there, as:
>>> (1) The fd allocation algorithm would always allocate fd from small to large.
>>> Lower bits in open_fds bitmap would be used much more frequently than higher
>>> bits.
>>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>>> it would never be shrunk. The search size increases but there are few open fds
>>> available here.
>>> (3) find_next_zero_bit() itself has a fast path inside to speed up searching
>>> when size<=64.
>>>
>>> Besides, "!start" is added to fast path condition to ensure the allocated fd is
>>> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
>>> (1) Allocating a new fd (the most common usage scenario) via
>>> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
>>> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
>>> old_fd's index in fdt, which is only called by syscall fcntl.
>>>
>>> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
>>> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>> fs/file.c | 35 +++++++++++++++++++++--------------
>>> 1 file changed, 21 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..50e900a47107 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>> if (fd < files->next_fd)
>>> fd = files->next_fd;
>>>
>>> - if (fd < fdt->max_fds)
>>> + error = -EMFILE;
>>> + if (likely(fd < fdt->max_fds)) {
>>> + if (~fdt->open_fds[0] && !start) {
>>> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>> So I don't think this is quite correct. If files->next_fd is set, we could
>> end up calling find_next_zero_bit() starting from quite high offset causing
>> a regression? Also because we don't expand in this case, we could cause access
>> beyond end of fdtable?
> OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
> the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
> be false. As such the above condition could be rewritten as:
>
> if (!start && files->next_fd < BITS_PER_LONG)
>
> to avoid loading the first bitmap long if we know it is full? Or we could
> maybe go as far as:
>
> if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
> goto fastreturn;
>
> because AFAIU this should work in exactly the same cases as your code?
>
> Honza
Thanks Honza for the good concern and suggestions here, while both above
conditions are not enough to ensure that there is available fd in the
first 64 bits of open_fds. As next_fd just means there is no available
fd before next_fd, just imagine that fd from 0 to 66 are already
occupied, now fd=3 is returned back, then next_fd would be set as 3 per
fd recycling logic (i.e. in __put_unused_fd()), next time when
alloc_fd() being called, it would return fd=3 to the caller and set
next_fd=4. Then next time when alloc_fd() being called again,
next_fd==4, but actually it's already been occupied. So
find_next_zero_bit() is needed to find the real 0 bit anyway. The
conditions should either be like it is in patch or if (!start &&
!test_bit(0, fdt->full_fds_bits)), the latter should also have the
bitmap loading cost, but another point is that a bit in full_fds_bits
represents 64 bits in open_fds, no matter fd >64 or not, full_fds_bits
should be loaded any way, maybe we can modify the condition to use
full_fds_bits ?
>>> + goto fastreturn;
>>> + }
>>> fd = find_next_fd(fdt, fd);
>>> + }
>>> +
>>> + if (unlikely(fd >= fdt->max_fds)) {
>>> + error = expand_files(files, fd);
>>> + if (error < 0)
>>> + goto out;
>>> + /*
>>> + * If we needed to expand the fs array we
>>> + * might have blocked - try again.
>>> + */
>>> + if (error)
>>> + goto repeat;
>>> + }
>>>
>>> +fastreturn:
>>> /*
>>> * N.B. For clone tasks sharing a files structure, this test
>>> * will limit the total number of files that can be opened.
>>> */
>>> - error = -EMFILE;
>>> - if (fd >= end)
>>> + if (unlikely(fd >= end))
>>> goto out;
>>>
>>> - error = expand_files(files, fd);
>>> - if (error < 0)
>>> - goto out;
>>> -
>>> - /*
>>> - * If we needed to expand the fs array we
>>> - * might have blocked - try again.
>>> - */
>>> - if (error)
>>> - goto repeat;
>>> -
>>> if (start <= files->next_fd)
>>> files->next_fd = fd + 1;
>>>
>>> --
>>> 2.43.0
>>>
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>>
On Tue 25-06-24 23:33:34, Ma, Yu wrote:
> On 6/25/2024 8:53 PM, Jan Kara wrote:
> > On Tue 25-06-24 13:52:57, Jan Kara wrote:
> > > On Sat 22-06-24 11:49:02, Yu Ma wrote:
> > > > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > > > when we look for an available fd slot. Skip 2-levels searching via
> > > > find_next_zero_bit() for this common fast path.
> > > >
> > > > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > > > free slot is available there, as:
> > > > (1) The fd allocation algorithm would always allocate fd from small to large.
> > > > Lower bits in open_fds bitmap would be used much more frequently than higher
> > > > bits.
> > > > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > > > it would never be shrunk. The search size increases but there are few open fds
> > > > available here.
> > > > (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> > > > when size<=64.
> > > >
> > > > Besides, "!start" is added to fast path condition to ensure the allocated fd is
> > > > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> > > > (1) Allocating a new fd (the most common usage scenario) via
> > > > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> > > > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> > > > old_fd's index in fdt, which is only called by syscall fcntl.
> > > >
> > > > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> > > > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > > fs/file.c | 35 +++++++++++++++++++++--------------
> > > > 1 file changed, 21 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a3b72aa64f11..50e900a47107 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > > if (fd < files->next_fd)
> > > > fd = files->next_fd;
> > > > - if (fd < fdt->max_fds)
> > > > + error = -EMFILE;
> > > > + if (likely(fd < fdt->max_fds)) {
> > > > + if (~fdt->open_fds[0] && !start) {
> > > > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> > > So I don't think this is quite correct. If files->next_fd is set, we could
> > > end up calling find_next_zero_bit() starting from quite high offset causing
> > > a regression? Also because we don't expand in this case, we could cause access
> > > beyond end of fdtable?
> > OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
> > the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
> > be false. As such the above condition could be rewritten as:
> >
> > if (!start && files->next_fd < BITS_PER_LONG)
> >
> > to avoid loading the first bitmap long if we know it is full? Or we could
> > maybe go as far as:
> >
> > if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
> > goto fastreturn;
> >
> > because AFAIU this should work in exactly the same cases as your code?
> >
> > Honza
>
> Thanks Honza for the good concern and suggestions here, while both above
> conditions are not enough to ensure that there is available fd in the first
> 64 bits of open_fds. As next_fd just means there is no available fd before
> next_fd, just imagine that fd from 0 to 66 are already occupied, now fd=3 is
> returned back, then next_fd would be set as 3 per fd recycling logic (i.e.
> in __put_unused_fd()), next time when alloc_fd() being called, it would
> return fd=3 to the caller and set next_fd=4. Then next time when alloc_fd()
> being called again, next_fd==4, but actually it's already been occupied. So
> find_next_zero_bit() is needed to find the real 0 bit anyway.
Indeed, thanks for correcting me! next_fd is just a lower bound for the
first free fd.
> The conditions
> should either be like it is in patch or if (!start && !test_bit(0,
> fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> but another point is that a bit in full_fds_bits represents 64 bits in
> open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> maybe we can modify the condition to use full_fds_bits ?
So maybe I'm wrong but I think the biggest benefit of your code compared to
plain find_next_fd() is exactly in that we don't have to load full_fds_bits
into cache. So I'm afraid that using full_fds_bits in the condition would
destroy your performance gains. Thinking about this with a fresh head how
about putting implementing your optimization like:
--- a/fs/file.c
+++ b/fs/file.c
@@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ /*
+ * Optimistically search the first long of the open_fds bitmap. It
+ * saves us from loading full_fds_bits into cache in the common case
+ * and because BITS_PER_LONG > start >= files->next_fd, we have quite
+ * a good chance there's a bit free in there.
+ */
+ if (start < BITS_PER_LONG) {
+ unsigned int bit;
+
+ bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
+ if (bit < BITS_PER_LONG)
+ return bit;
+ }
+
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
return maxfd;
Plus your optimizations with likely / unlikely. This way the code flow in
alloc_fd() stays more readable, we avoid loading the first open_fds long
into cache if it is full, and we should get all the performance benefits?
Honza
> > > > + goto fastreturn;
> > > > + }
> > > > fd = find_next_fd(fdt, fd);
> > > > + }
> > > > +
> > > > + if (unlikely(fd >= fdt->max_fds)) {
> > > > + error = expand_files(files, fd);
> > > > + if (error < 0)
> > > > + goto out;
> > > > + /*
> > > > + * If we needed to expand the fs array we
> > > > + * might have blocked - try again.
> > > > + */
> > > > + if (error)
> > > > + goto repeat;
> > > > + }
> > > > +fastreturn:
> > > > /*
> > > > * N.B. For clone tasks sharing a files structure, this test
> > > > * will limit the total number of files that can be opened.
> > > > */
> > > > - error = -EMFILE;
> > > > - if (fd >= end)
> > > > + if (unlikely(fd >= end))
> > > > goto out;
> > > > - error = expand_files(files, fd);
> > > > - if (error < 0)
> > > > - goto out;
> > > > -
> > > > - /*
> > > > - * If we needed to expand the fs array we
> > > > - * might have blocked - try again.
> > > > - */
> > > > - if (error)
> > > > - goto repeat;
> > > > -
> > > > if (start <= files->next_fd)
> > > > files->next_fd = fd + 1;
> > > > --
> > > > 2.43.0
> > > >
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > >
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> So maybe I'm wrong but I think the biggest benefit of your code compared to
> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> into cache. So I'm afraid that using full_fds_bits in the condition would
> destroy your performance gains. Thinking about this with a fresh head how
> about putting implementing your optimization like:
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
>
> + /*
> + * Optimistically search the first long of the open_fds bitmap. It
> + * saves us from loading full_fds_bits into cache in the common case
> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> + * a good chance there's a bit free in there.
> + */
> + if (start < BITS_PER_LONG) {
> + unsigned int bit;
> +
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> + }
> +
> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> if (bitbit >= maxfd)
> return maxfd;
>
> Plus your optimizations with likely / unlikely. This way the code flow in
> alloc_fd() stays more readable, we avoid loading the first open_fds long
> into cache if it is full, and we should get all the performance benefits?
>
Huh.
So when I read the patch previously I assumed this is testing the bit
word for the map containing next_fd (whatever it is), avoiding looking
at the higher level bitmap and inlining the op (instead of calling the
fully fledged func for bit scans).
I did not mentally register this is in fact only checking for the
beginning of the range of the entire thing. So apologies from my end
as based on my feedback some work was done and I'm going to ask to
further redo it.
blogbench spawns 100 or so workers, say total fd count hovers just
above 100. say this lines up with about half of more cases in practice
for that benchmark.
Even so, that's a benchmark-specific optimization. A busy web server
can have literally tens of thousands of fds open (and this is a pretty
mundane case), making the 0-63 range not particularly interesting.
That aside I think the patchset is in the wrong order -- first patch
tries to not look at the higher level bitmap, while second reduces
stores made there. This makes it quite unclear how much is it worth to
reduce looking there if atomics are conditional.
So here is what I propose in terms of the patches:
1. NULL check removal, sprinkling of likely/unlikely and expand_files
call avoidance; no measurements done vs stock kernel for some effort
saving, just denote in the commit message there is less work under the
lock and treat it as baseline
2. conditional higher level bitmap clear as submitted; benchmarked against 1
3. open_fds check within the range containing fd, avoiding higher
level bitmap if a free slot is found. this should not result in any
func calls if successful; benchmarked against the above
Optionally the bitmap routines can grow variants which always inline
and are used here. If so that would probably land between 1 and 2 on
the list.
You noted you know about blogbench bugs and have them fixed. Would be
good to post a link to a pull request or some other spot for a
reference.
I'll be best if the vfs folk comment on what they want here.
--
Mateusz Guzik <mjguzik gmail.com>
On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> >
> > + /*
> > + * Optimistically search the first long of the open_fds bitmap. It
> > + * saves us from loading full_fds_bits into cache in the common case
> > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > + * a good chance there's a bit free in there.
> > + */
> > + if (start < BITS_PER_LONG) {
> > + unsigned int bit;
> > +
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > + }
> > +
> > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> > if (bitbit >= maxfd)
> > return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
>
> Huh.
>
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
>
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.
>
> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
>
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.
>
> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
>
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above
>
> Optionally the bitmap routines can grow variants which always inline
> and are used here. If so that would probably land between 1 and 2 on
> the list.
>
> You noted you know about blogbench bugs and have them fixed. Would be
> good to post a link to a pull request or some other spot for a
> reference.
>
> I'll be best if the vfs folk comment on what they want here.
Optimizing only the < BIT_PER_LONG seems less desirable then making it
work for arbitrary next_fd. Imho, it'll also be easier to follow if
everything follows the same logic.
On 6/27/2024 11:33 PM, Christian Brauner wrote:
> On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
>> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
>>> So maybe I'm wrong but I think the biggest benefit of your code compared to
>>> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
>>> into cache. So I'm afraid that using full_fds_bits in the condition would
>>> destroy your performance gains. Thinking about this with a fresh head how
>>> about putting implementing your optimization like:
>>>
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>> unsigned int bitbit = start / BITS_PER_LONG;
>>>
>>> + /*
>>> + * Optimistically search the first long of the open_fds bitmap. It
>>> + * saves us from loading full_fds_bits into cache in the common case
>>> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
>>> + * a good chance there's a bit free in there.
>>> + */
>>> + if (start < BITS_PER_LONG) {
>>> + unsigned int bit;
>>> +
>>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>>> + if (bit < BITS_PER_LONG)
>>> + return bit;
>>> + }
>>> +
>>> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>>> if (bitbit >= maxfd)
>>> return maxfd;
>>>
>>> Plus your optimizations with likely / unlikely. This way the code flow in
>>> alloc_fd() stays more readable, we avoid loading the first open_fds long
>>> into cache if it is full, and we should get all the performance benefits?
>>>
>> Huh.
>>
>> So when I read the patch previously I assumed this is testing the bit
>> word for the map containing next_fd (whatever it is), avoiding looking
>> at the higher level bitmap and inlining the op (instead of calling the
>> fully fledged func for bit scans).
>>
>> I did not mentally register this is in fact only checking for the
>> beginning of the range of the entire thing. So apologies from my end
>> as based on my feedback some work was done and I'm going to ask to
>> further redo it.
>>
>> blogbench spawns 100 or so workers, say total fd count hovers just
>> above 100. say this lines up with about half of more cases in practice
>> for that benchmark.
>>
>> Even so, that's a benchmark-specific optimization. A busy web server
>> can have literally tens of thousands of fds open (and this is a pretty
>> mundane case), making the 0-63 range not particularly interesting.
>>
>> That aside I think the patchset is in the wrong order -- first patch
>> tries to not look at the higher level bitmap, while second reduces
>> stores made there. This makes it quite unclear how much is it worth to
>> reduce looking there if atomics are conditional.
>>
>> So here is what I propose in terms of the patches:
>> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
>> call avoidance; no measurements done vs stock kernel for some effort
>> saving, just denote in the commit message there is less work under the
>> lock and treat it as baseline
>> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
>> 3. open_fds check within the range containing fd, avoiding higher
>> level bitmap if a free slot is found. this should not result in any
>> func calls if successful; benchmarked against the above
>>
>> Optionally the bitmap routines can grow variants which always inline
>> and are used here. If so that would probably land between 1 and 2 on
>> the list.
>>
>> You noted you know about blogbench bugs and have them fixed. Would be
>> good to post a link to a pull request or some other spot for a
>> reference.
>>
>> I'll be best if the vfs folk comment on what they want here.
> Optimizing only the < BIT_PER_LONG seems less desirable then making it
> work for arbitrary next_fd. Imho, it'll also be easier to follow if
> everything follows the same logic.
Sorry that this message is a bit long. Thanks for your time to review.
Firstly sincerely thanks all for the hot discussion and kind suggestions
with your expertise to make the patch set better. At least, we already
reached some agreements on removing sanity_check and adding conditional
clear (p.s. I'll revise the bug_on to warn_on in fd_install() as
aligned). I fully agree with Guzik's suggestion to resort the patches.
As the remaining focus of discussion is around fast path, I suggest that
we submit patch 1 & 2 (after reorder) for up-streaming firstly (with
data remeasured on latest kernel version accordingly), then we focus on
discussion for fast path.
For this fast path idea, here I summarized some info for further
discussion, why I still think it is valuable:
1. The original intention for fast path is to reduce func calls and
avoid unnecessary load/store on the members sharing the same cacheline
(such as file_lock, next_fd and the 3 bitmaps. BTW, we've tried to add
__cacheline_aligned_in_smp for next_fd and fd array, no improvement
observed), specially, yes, specially, all these operations are inside of
critical section of file_lock.
2. For fast path implementation, the essential and simple point is to
directly return an available bit if there is free bit in [0-63]. I'd
emphasize that it does not only improve low number of open fds (even it
is the majority case on system as Honza agreed), but also improve the
cases that lots of fds open/close frequently with short task (as per the
algorithm, lower bits will be prioritized to allocate after being
recycled). Not only blogbench, a synthetic benchmark, but also the
realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
performance case for __alloc_fd()"), which literally introduced this
2-levels bitmap searching algorithm to vfs as we see now. We may ask
Eric for help to see whether it's possible to let us have some data on
it. Besides, for those lots of fds are allocated and occupied for
not-short time, the lock contention would be much less than the
scenarios we're talking about here, then the impact of change would be
much less.
3. Now we talk about the extra cost of fast path based on the patch we
submitted. To be honest, before fast path, we firstly found that
alloc_fd() is only called in two scenarios, as I mentioned in commit:
(1) called by __get_unused_fd_flags() to find available fd start from
bit 0, which is the most common usage. It means start==0 for alloc_fd(),
and with this premise, alloc_fd() logic can be simpler, two branches for
comparing to next_fd can be reduced; (2) dup a fd via dup_fd() to find a
fd start from old_fd, which means "start" is not zero, but it is only
called by fcntl. Then the first impression came into our mind is that
why we sacrifice the performance of absolutely majority cases for this
specific dup_fd. So we revised __get_unused_fd_flags() to not call
alloc_fd() directly, but with the same logic as alloc_fd() by omitted
the branches related to "start". Based on this version, we then found
fast path would be possibly more efficient than the stock 2-levels
bitmap searching based on the considerations stated in item 2 and commit
message of this thread. Leaving aside the benefit, the extra cost is an
conditional branch, but with 2 branches related to "start" has been
reduced, it is still profitable, not even to say the cost can be
alleviated by branch predictor. However, with this draft version, the
code of __get_unused_fd_flags() duplicates a lot with alloc_fd(), then
we change to current version for concise code. What I want to say is
that there is space to make it faster with cost less than stock. For
whether to use open_fds[0] as conditions for fast path, we think it's OK
as all bitmaps are almost on the same cacheline, and we finaly need to
write open_fds to allocate bit anyway.
4. Based on patches order as suggested by Guzik, we've re-measured the
data on latest kernel 6.10-rc5, removing sanity_check and add
likely/unlikely would have 6% gain for read, and 2% for write. Combined
with conditional clear, it would have 14% gain for read, and 8% for
write. If with fast path, it might have another ~15% gain to read (we do
not re-measure this one yet due to time, will make up soon).
On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> 2. For fast path implementation, the essential and simple point is to
> directly return an available bit if there is free bit in [0-63]. I'd
> emphasize that it does not only improve low number of open fds (even it
> is the majority case on system as Honza agreed), but also improve the
> cases that lots of fds open/close frequently with short task (as per the
> algorithm, lower bits will be prioritized to allocate after being
> recycled). Not only blogbench, a synthetic benchmark, but also the
> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> performance case for __alloc_fd()"), which literally introduced this
> 2-levels bitmap searching algorithm to vfs as we see now.
I don't understand how using next_fd instead is supposed to be inferior.
Maybe I should clarify that by API contract the kernel must return the
lowest free fd it can find. To that end it maintains the next_fd field
as a hint to hopefully avoid some of the search work.
In the stock kernel the first thing done in alloc_fd is setting it as
a starting point:
fdt = files_fdtable(files);
fd = start;
if (fd < files->next_fd)
fd = files->next_fd;
that is all the calls which come here with 0 start their search from
next_fd position.
Suppose you implemented the patch as suggested by me and next_fd fits
the range of 0-63. Then you get the benefit of lower level bitmap
check just like in the patch you submitted, but without having to
first branch on whether you happen to be in that range.
Suppose next_fd is somewhere higher up, say 80. With your general
approach the optimization wont be done whatsoever or it will be
attempted at the 0-63 range when it is an invariant it finds no free
fds.
With what I'm suggesting the general idea of taking a peek at the
lower level bitmap can be applied across the entire fd space. Some
manual mucking will be needed to make sure this never pulls more than
one cacheline, easiest way out I see would be to align next_fd to
BITS_PER_LONG for the bitmap search purposes.
Outside of the scope of this patchset, but definitely worth
considering, is an observation that this still pulls an entire
cacheline worth of a bitmap (assuming it grew). If one could assume
that the size is always a multiply of 64 bytes (which it is after
first expansion) the 64 byte scan could be entirely inlined -- there
is quite a bit of fd fields in this range we may as well scan in hopes
of avoiding looking at the higher level bitmap, after all we already
paid for fetching it. This would take the optimization to its logical
conclusion.
Perhaps it would be ok to special-case the lower bitmap to start with
64 bytes so that there would be no need to branch on it.
On 6/28/2024 3:59 AM, Mateusz Guzik wrote:
> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
>> 2. For fast path implementation, the essential and simple point is to
>> directly return an available bit if there is free bit in [0-63]. I'd
>> emphasize that it does not only improve low number of open fds (even it
>> is the majority case on system as Honza agreed), but also improve the
>> cases that lots of fds open/close frequently with short task (as per the
>> algorithm, lower bits will be prioritized to allocate after being
>> recycled). Not only blogbench, a synthetic benchmark, but also the
>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
>> performance case for __alloc_fd()"), which literally introduced this
>> 2-levels bitmap searching algorithm to vfs as we see now.
> I don't understand how using next_fd instead is supposed to be inferior.
>
> Maybe I should clarify that by API contract the kernel must return the
> lowest free fd it can find. To that end it maintains the next_fd field
> as a hint to hopefully avoid some of the search work.
>
> In the stock kernel the first thing done in alloc_fd is setting it as
> a starting point:
> fdt = files_fdtable(files);
> fd = start;
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> that is all the calls which come here with 0 start their search from
> next_fd position.
>
> Suppose you implemented the patch as suggested by me and next_fd fits
> the range of 0-63. Then you get the benefit of lower level bitmap
> check just like in the patch you submitted, but without having to
> first branch on whether you happen to be in that range.
>
> Suppose next_fd is somewhere higher up, say 80. With your general
> approach the optimization wont be done whatsoever or it will be
> attempted at the 0-63 range when it is an invariant it finds no free
> fds.
>
> With what I'm suggesting the general idea of taking a peek at the
> lower level bitmap can be applied across the entire fd space. Some
> manual mucking will be needed to make sure this never pulls more than
> one cacheline, easiest way out I see would be to align next_fd to
> BITS_PER_LONG for the bitmap search purposes.
Some misunderstanding here, Guzik, I thought you felt not so worth for
fast path in previous feedback, so the whole message sent just wanna say
we still think the original idea is reasonable. Back to the point here,
the way to implement it in find_next_fd() by searching the word with
next_fd makes sense and OK to me. It's efficient, concise and should
bring us the expected benefits. I'll re-measure the data for reference
based on the code proposed by you and Honza.
> Outside of the scope of this patchset, but definitely worth
> considering, is an observation that this still pulls an entire
> cacheline worth of a bitmap (assuming it grew). If one could assume
> that the size is always a multiply of 64 bytes (which it is after
> first expansion) the 64 byte scan could be entirely inlined -- there
> is quite a bit of fd fields in this range we may as well scan in hopes
> of avoiding looking at the higher level bitmap, after all we already
> paid for fetching it. This would take the optimization to its logical
> conclusion.
>
> Perhaps it would be ok to special-case the lower bitmap to start with
> 64 bytes so that there would be no need to branch on it.
On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> > 2. For fast path implementation, the essential and simple point is to
> > directly return an available bit if there is free bit in [0-63]. I'd
> > emphasize that it does not only improve low number of open fds (even it
> > is the majority case on system as Honza agreed), but also improve the
> > cases that lots of fds open/close frequently with short task (as per the
> > algorithm, lower bits will be prioritized to allocate after being
> > recycled). Not only blogbench, a synthetic benchmark, but also the
> > realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> > performance case for __alloc_fd()"), which literally introduced this
> > 2-levels bitmap searching algorithm to vfs as we see now.
>
> I don't understand how using next_fd instead is supposed to be inferior.
>
> Maybe I should clarify that by API contract the kernel must return the
> lowest free fd it can find. To that end it maintains the next_fd field
> as a hint to hopefully avoid some of the search work.
>
> In the stock kernel the first thing done in alloc_fd is setting it as
> a starting point:
> fdt = files_fdtable(files);
> fd = start;
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> that is all the calls which come here with 0 start their search from
> next_fd position.
Yup.
> Suppose you implemented the patch as suggested by me and next_fd fits
> the range of 0-63. Then you get the benefit of lower level bitmap
> check just like in the patch you submitted, but without having to
> first branch on whether you happen to be in that range.
>
> Suppose next_fd is somewhere higher up, say 80. With your general
> approach the optimization wont be done whatsoever or it will be
> attempted at the 0-63 range when it is an invariant it finds no free
> fds.
>
> With what I'm suggesting the general idea of taking a peek at the
> lower level bitmap can be applied across the entire fd space. Some
> manual mucking will be needed to make sure this never pulls more than
> one cacheline, easiest way out I see would be to align next_fd to
> BITS_PER_LONG for the bitmap search purposes.
Well, all you need to do is to call:
bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
BITS_PER_LONG, start & (BITS_PER_LONG-1));
if (bit < BITS_PER_LONG)
return bit + (start & ~(BITS_PER_LONG - 1));
in find_next_fd(). Not sure if this is what you meant by aligning next_fd
to BITS_PER_LONG...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 6/28/2024 5:12 PM, Jan Kara wrote:
> On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
>> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
>>> 2. For fast path implementation, the essential and simple point is to
>>> directly return an available bit if there is free bit in [0-63]. I'd
>>> emphasize that it does not only improve low number of open fds (even it
>>> is the majority case on system as Honza agreed), but also improve the
>>> cases that lots of fds open/close frequently with short task (as per the
>>> algorithm, lower bits will be prioritized to allocate after being
>>> recycled). Not only blogbench, a synthetic benchmark, but also the
>>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
>>> performance case for __alloc_fd()"), which literally introduced this
>>> 2-levels bitmap searching algorithm to vfs as we see now.
>> I don't understand how using next_fd instead is supposed to be inferior.
>>
>> Maybe I should clarify that by API contract the kernel must return the
>> lowest free fd it can find. To that end it maintains the next_fd field
>> as a hint to hopefully avoid some of the search work.
>>
>> In the stock kernel the first thing done in alloc_fd is setting it as
>> a starting point:
>> fdt = files_fdtable(files);
>> fd = start;
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> that is all the calls which come here with 0 start their search from
>> next_fd position.
> Yup.
>
>> Suppose you implemented the patch as suggested by me and next_fd fits
>> the range of 0-63. Then you get the benefit of lower level bitmap
>> check just like in the patch you submitted, but without having to
>> first branch on whether you happen to be in that range.
>>
>> Suppose next_fd is somewhere higher up, say 80. With your general
>> approach the optimization wont be done whatsoever or it will be
>> attempted at the 0-63 range when it is an invariant it finds no free
>> fds.
>>
>> With what I'm suggesting the general idea of taking a peek at the
>> lower level bitmap can be applied across the entire fd space. Some
>> manual mucking will be needed to make sure this never pulls more than
>> one cacheline, easiest way out I see would be to align next_fd to
>> BITS_PER_LONG for the bitmap search purposes.
> Well, all you need to do is to call:
>
> bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
> BITS_PER_LONG, start & (BITS_PER_LONG-1));
> if (bit < BITS_PER_LONG)
> return bit + (start & ~(BITS_PER_LONG - 1));
>
>
> in find_next_fd(). Not sure if this is what you meant by aligning next_fd
> to BITS_PER_LONG...
>
> Honza
So the idea here is to take a peek at the word contains next_fd, return
directly if get free bit there. Not sure about the efficiency here if
open/close fd frequently and next_fd is distributed very randomly. Just
give a quick try for this part of code, seems kernel failed to boot
up😳, kind of out of expectation ...
How about you post a working version of the patchset, even if it only
looks at 0-63 and we are going to massage it afterwards.
On Sat, Jun 29, 2024 at 5:41 PM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 6/28/2024 5:12 PM, Jan Kara wrote:
> > On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
> >> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> >>> 2. For fast path implementation, the essential and simple point is to
> >>> directly return an available bit if there is free bit in [0-63]. I'd
> >>> emphasize that it does not only improve low number of open fds (even it
> >>> is the majority case on system as Honza agreed), but also improve the
> >>> cases that lots of fds open/close frequently with short task (as per the
> >>> algorithm, lower bits will be prioritized to allocate after being
> >>> recycled). Not only blogbench, a synthetic benchmark, but also the
> >>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> >>> performance case for __alloc_fd()"), which literally introduced this
> >>> 2-levels bitmap searching algorithm to vfs as we see now.
> >> I don't understand how using next_fd instead is supposed to be inferior.
> >>
> >> Maybe I should clarify that by API contract the kernel must return the
> >> lowest free fd it can find. To that end it maintains the next_fd field
> >> as a hint to hopefully avoid some of the search work.
> >>
> >> In the stock kernel the first thing done in alloc_fd is setting it as
> >> a starting point:
> >> fdt = files_fdtable(files);
> >> fd = start;
> >> if (fd < files->next_fd)
> >> fd = files->next_fd;
> >>
> >> that is all the calls which come here with 0 start their search from
> >> next_fd position.
> > Yup.
> >
> >> Suppose you implemented the patch as suggested by me and next_fd fits
> >> the range of 0-63. Then you get the benefit of lower level bitmap
> >> check just like in the patch you submitted, but without having to
> >> first branch on whether you happen to be in that range.
> >>
> >> Suppose next_fd is somewhere higher up, say 80. With your general
> >> approach the optimization wont be done whatsoever or it will be
> >> attempted at the 0-63 range when it is an invariant it finds no free
> >> fds.
> >>
> >> With what I'm suggesting the general idea of taking a peek at the
> >> lower level bitmap can be applied across the entire fd space. Some
> >> manual mucking will be needed to make sure this never pulls more than
> >> one cacheline, easiest way out I see would be to align next_fd to
> >> BITS_PER_LONG for the bitmap search purposes.
> > Well, all you need to do is to call:
> >
> > bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
> > BITS_PER_LONG, start & (BITS_PER_LONG-1));
> > if (bit < BITS_PER_LONG)
> > return bit + (start & ~(BITS_PER_LONG - 1));
> >
> >
> > in find_next_fd(). Not sure if this is what you meant by aligning next_fd
> > to BITS_PER_LONG...
> >
> > Honza
>
> So the idea here is to take a peek at the word contains next_fd, return
> directly if get free bit there. Not sure about the efficiency here if
> open/close fd frequently and next_fd is distributed very randomly. Just
> give a quick try for this part of code, seems kernel failed to boot
> up😳, kind of out of expectation ...
>
--
Mateusz Guzik <mjguzik gmail.com>
On Wed 26-06-24 21:13:07, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> >
> > + /*
> > + * Optimistically search the first long of the open_fds bitmap. It
> > + * saves us from loading full_fds_bits into cache in the common case
> > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > + * a good chance there's a bit free in there.
> > + */
> > + if (start < BITS_PER_LONG) {
> > + unsigned int bit;
> > +
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > + }
> > +
> > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> > if (bitbit >= maxfd)
> > return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
>
> Huh.
>
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
>
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.
Well, just confirming the fact that the way the code was written was
somewhat confusing ;)
> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
>
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.
I agree this optimization helps only processes with low number of open fds.
On the other hand that is usually the majority of the processes on the
system so the optimization makes sense to me. That being said your idea of
searching the word with next_fd makes sense as well...
> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
>
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above
Yeah, I guess this ordering is the most obvious -> the least obvious so it
makes sense to me.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
>
>
> Indeed, thanks for correcting me! next_fd is just a lower bound for the
> first free fd.
>
> > The conditions
> > should either be like it is in patch or if (!start && !test_bit(0,
> > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > but another point is that a bit in full_fds_bits represents 64 bits in
> > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > maybe we can modify the condition to use full_fds_bits ?
>
> So maybe I'm wrong but I think the biggest benefit of your code compared to
> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> into cache. So I'm afraid that using full_fds_bits in the condition would
> destroy your performance gains. Thinking about this with a fresh head how
> about putting implementing your optimization like:
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
>
> + /*
> + * Optimistically search the first long of the open_fds bitmap. It
> + * saves us from loading full_fds_bits into cache in the common case
> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> + * a good chance there's a bit free in there.
> + */
> + if (start < BITS_PER_LONG) {
> + unsigned int bit;
> +
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
Say start is 31 (< BITS_PER_LONG)
bit found here could be 32 and greater than start. Do we care if we return bit > start?
Tim
> + if (bit < BITS_PER_LONG)
> + return bit;
> + }
> +
> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> if (bitbit >= maxfd)
> return maxfd;
>
> Plus your optimizations with likely / unlikely. This way the code flow in
> alloc_fd() stays more readable, we avoid loading the first open_fds long
> into cache if it is full, and we should get all the performance benefits?
>
> Honza
>
>
> > > > > + goto fastreturn;
> > > > > + }
> > > > > fd = find_next_fd(fdt, fd);
> > > > > + }
> > > > > +
> > > > > + if (unlikely(fd >= fdt->max_fds)) {
> > > > > + error = expand_files(files, fd);
> > > > > + if (error < 0)
> > > > > + goto out;
> > > > > + /*
> > > > > + * If we needed to expand the fs array we
> > > > > + * might have blocked - try again.
> > > > > + */
> > > > > + if (error)
> > > > > + goto repeat;
> > > > > + }
> > > > > +fastreturn:
> > > > > /*
> > > > > * N.B. For clone tasks sharing a files structure, this test
> > > > > * will limit the total number of files that can be opened.
> > > > > */
> > > > > - error = -EMFILE;
> > > > > - if (fd >= end)
> > > > > + if (unlikely(fd >= end))
> > > > > goto out;
> > > > > - error = expand_files(files, fd);
> > > > > - if (error < 0)
> > > > > - goto out;
> > > > > -
> > > > > - /*
> > > > > - * If we needed to expand the fs array we
> > > > > - * might have blocked - try again.
> > > > > - */
> > > > > - if (error)
> > > > > - goto repeat;
> > > > > -
> > > > > if (start <= files->next_fd)
> > > > > files->next_fd = fd + 1;
> > > > > --
> > > > > 2.43.0
> > > > >
> > > > --
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR
> > > >
> >
On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> >
> >
> > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > first free fd.
> >
> > > The conditions
> > > should either be like it is in patch or if (!start && !test_bit(0,
> > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > maybe we can modify the condition to use full_fds_bits ?
> >
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> >
> > + /*
> > + * Optimistically search the first long of the open_fds bitmap. It
> > + * saves us from loading full_fds_bits into cache in the common case
> > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > + * a good chance there's a bit free in there.
> > + */
> > + if (start < BITS_PER_LONG) {
> > + unsigned int bit;
> > +
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>
> Say start is 31 (< BITS_PER_LONG)
> bit found here could be 32 and greater than start. Do we care if we return bit > start?
Sorry, I mean to say that we could find a bit like 30 that is less than start instead
of the other way round.
Tim
>
On Wed 26-06-24 09:52:50, Tim Chen wrote:
> On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > >
> > >
> > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > first free fd.
> > >
> > > > The conditions
> > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > maybe we can modify the condition to use full_fds_bits ?
> > >
> > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > destroy your performance gains. Thinking about this with a fresh head how
> > > about putting implementing your optimization like:
> > >
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > unsigned int bitbit = start / BITS_PER_LONG;
> > >
> > > + /*
> > > + * Optimistically search the first long of the open_fds bitmap. It
> > > + * saves us from loading full_fds_bits into cache in the common case
> > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > + * a good chance there's a bit free in there.
> > > + */
> > > + if (start < BITS_PER_LONG) {
> > > + unsigned int bit;
> > > +
> > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> >
> > Say start is 31 (< BITS_PER_LONG)
> > bit found here could be 32 and greater than start. Do we care if we return bit > start?
>
> Sorry, I mean to say that we could find a bit like 30 that is less than
> start instead of the other way round.
Well, I propose calling find_next_zero_bit() with offset set to 'start' so
it cannot possibly happen that the returned bit number is smaller than
start... But maybe I'm missing something?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Thu, 2024-06-27 at 14:09 +0200, Jan Kara wrote:
> On Wed 26-06-24 09:52:50, Tim Chen wrote:
> > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > >
> > > >
> > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > > first free fd.
> > > >
> > > > > The conditions
> > > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > > maybe we can modify the condition to use full_fds_bits ?
> > > >
> > > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > > destroy your performance gains. Thinking about this with a fresh head how
> > > > about putting implementing your optimization like:
> > > >
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > unsigned int bitbit = start / BITS_PER_LONG;
> > > >
> > > > + /*
> > > > + * Optimistically search the first long of the open_fds bitmap. It
> > > > + * saves us from loading full_fds_bits into cache in the common case
> > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > > + * a good chance there's a bit free in there.
> > > > + */
> > > > + if (start < BITS_PER_LONG) {
> > > > + unsigned int bit;
> > > > +
> > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > >
> > > Say start is 31 (< BITS_PER_LONG)
> > > bit found here could be 32 and greater than start. Do we care if we return bit > start?
> >
> > Sorry, I mean to say that we could find a bit like 30 that is less than
> > start instead of the other way round.
>
> Well, I propose calling find_next_zero_bit() with offset set to 'start' so
> it cannot possibly happen that the returned bit number is smaller than
> start... But maybe I'm missing something?
Yes, you're right. the search begin at start bit so it is okay.
Tim
>
> Honza
On Thu, Jun 27, 2024 at 2:09 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 26-06-24 09:52:50, Tim Chen wrote:
> > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > >
> > > >
> > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > > first free fd.
> > > >
> > > > > The conditions
> > > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > > maybe we can modify the condition to use full_fds_bits ?
> > > >
> > > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > > destroy your performance gains. Thinking about this with a fresh head how
> > > > about putting implementing your optimization like:
> > > >
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > unsigned int bitbit = start / BITS_PER_LONG;
> > > >
> > > > + /*
> > > > + * Optimistically search the first long of the open_fds bitmap. It
> > > > + * saves us from loading full_fds_bits into cache in the common case
> > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > > + * a good chance there's a bit free in there.
> > > > + */
> > > > + if (start < BITS_PER_LONG) {
> > > > + unsigned int bit;
> > > > +
> > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > >
> > > Say start is 31 (< BITS_PER_LONG)
> > > bit found here could be 32 and greater than start. Do we care if we return bit > start?
> >
> > Sorry, I mean to say that we could find a bit like 30 that is less than
> > start instead of the other way round.
>
> Well, I propose calling find_next_zero_bit() with offset set to 'start' so
> it cannot possibly happen that the returned bit number is smaller than
> start... But maybe I'm missing something?
>
You gate it with " if (start < BITS_PER_LONG)" which only covers the
small initital range, while I'm arguing this should work for any fd.
--
Mateusz Guzik <mjguzik gmail.com>
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Together with patch 1, they improves pts/blogbench-1.1.0 read for 27%, and write
for 14% on Intel ICX 160 cores configuration with v6.10-rc4.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 50e900a47107..b4d25f6d4c19 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
On Sat 22-06-24 11:49:03, Yu Ma wrote:
> 64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
> likely that a bit in full_fds_bits has been cleared before in
> __clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
> clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
> ("vfs: conditionally clear close-on-exec flag") for a similar optimization.
> Together with patch 1, they improves pts/blogbench-1.1.0 read for 27%, and write
> for 14% on Intel ICX 160 cores configuration with v6.10-rc4.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Nice. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/file.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 50e900a47107..b4d25f6d4c19 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
> {
> __clear_bit(fd, fdt->open_fds);
> - __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
> + fd /= BITS_PER_LONG;
> + if (test_bit(fd, fdt->full_fds_bits))
> + __clear_bit(fd, fdt->full_fds_bits);
> }
>
> static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 6/25/2024 7:54 PM, Jan Kara wrote:
> On Sat 22-06-24 11:49:03, Yu Ma wrote:
>> 64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
>> likely that a bit in full_fds_bits has been cleared before in
>> __clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
>> clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
>> ("vfs: conditionally clear close-on-exec flag") for a similar optimization.
>> Together with patch 1, they improves pts/blogbench-1.1.0 read for 27%, and write
>> for 14% on Intel ICX 160 cores configuration with v6.10-rc4.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Nice. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> Honza
Copy that, thanks Honza :)
>> ---
>> fs/file.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 50e900a47107..b4d25f6d4c19 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
>> static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
>> {
>> __clear_bit(fd, fdt->open_fds);
>> - __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
>> + fd /= BITS_PER_LONG;
>> + if (test_bit(fd, fdt->full_fds_bits))
>> + __clear_bit(fd, fdt->full_fds_bits);
>> }
>>
>> static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
>> --
>> 2.43.0
>>
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd.
Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index b4d25f6d4c19..1153b0b7ba3d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
--
2.43.0
On Sat 22-06-24 11:49:04, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd.
^^^ existing ^^^ initialization
Well, since this is a sanity check, it is expected it never hits. Yet
searching the web shows it has hit a few times in the past :). So would
wrapping this with unlikely() give a similar performance gain while keeping
debugability? If unlikely() does not help, I agree we can remove this since
fd_install() actually has the same check:
BUG_ON(fdt->fd[fd] != NULL);
and there we need the cacheline anyway so performance impact is minimal.
Now, this condition in alloc_fd() is nice that it does not take the kernel
down so perhaps we could change the BUG_ON to WARN() dumping similar kind
of info as alloc_fd()?
Honza
> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> 32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index b4d25f6d4c19..1153b0b7ba3d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
> out:
> spin_unlock(&files->file_lock);
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > allocated fd is NULL. Remove this sanity check since it can be assured by
> > exisitng zero initilization and NULL set when recycling fd.
> ^^^ existing ^^^ initialization
>
> Well, since this is a sanity check, it is expected it never hits. Yet
> searching the web shows it has hit a few times in the past :). So would
> wrapping this with unlikely() give a similar performance gain while keeping
> debugability? If unlikely() does not help, I agree we can remove this since
> fd_install() actually has the same check:
>
> BUG_ON(fdt->fd[fd] != NULL);
>
> and there we need the cacheline anyway so performance impact is minimal.
> Now, this condition in alloc_fd() is nice that it does not take the kernel
> down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> of info as alloc_fd()?
>
Christian suggested just removing it.
To my understanding the problem is not the branch per se, but the the
cacheline bounce of the fd array induced by reading the status.
Note the thing also nullifies the pointer, kind of defeating the
BUG_ON in fd_install.
I'm guessing it's not going to hurt to branch on it after releasing
the lock and forego nullifying, more or less:
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..d22b867db246 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
end, unsigned flags)
*/
error = -EMFILE;
if (fd >= end)
- goto out;
+ goto out_locked;
error = expand_files(files, fd);
if (error < 0)
- goto out;
+ goto out_locked;
/*
* If we needed to expand the fs array we
@@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
+ spin_unlock(&files->file_lock);
+
+ if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
}
-#endif
-out:
+ return error;
+
+out_locked:
spin_unlock(&files->file_lock);
return error;
}
> Honza
>
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 7 -------
> > 1 file changed, 7 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index b4d25f6d4c19..1153b0b7ba3d 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > - }
> > -#endif
> >
> > out:
> > spin_unlock(&files->file_lock);
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
Mateusz Guzik <mjguzik gmail.com>
On Tue, Jun 25, 2024 at 3:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > exisitng zero initilization and NULL set when recycling fd.
> > ^^^ existing ^^^ initialization
> >
> > Well, since this is a sanity check, it is expected it never hits. Yet
> > searching the web shows it has hit a few times in the past :). So would
> > wrapping this with unlikely() give a similar performance gain while keeping
> > debugability? If unlikely() does not help, I agree we can remove this since
> > fd_install() actually has the same check:
> >
> > BUG_ON(fdt->fd[fd] != NULL);
> >
> > and there we need the cacheline anyway so performance impact is minimal.
> > Now, this condition in alloc_fd() is nice that it does not take the kernel
> > down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> > of info as alloc_fd()?
> >
>
> Christian suggested just removing it.
>
> To my understanding the problem is not the branch per se, but the the
> cacheline bounce of the fd array induced by reading the status.
>
> Note the thing also nullifies the pointer, kind of defeating the
> BUG_ON in fd_install.
>
> I'm guessing it's not going to hurt to branch on it after releasing
> the lock and forego nullifying, more or less:
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..d22b867db246 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
> end, unsigned flags)
> */
> error = -EMFILE;
> if (fd >= end)
> - goto out;
> + goto out_locked;
>
> error = expand_files(files, fd);
> if (error < 0)
> - goto out;
> + goto out_locked;
>
> /*
> * If we needed to expand the fs array we
> @@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
> end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> + spin_unlock(&files->file_lock);
> +
> + if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
> printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> }
> -#endif
Now that I sent it it is of course not safe to deref without
protection from either rcu or the lock, so this would have to be
wrapped with rcu_read_lock, which makes it even less appealing.
Whacking the thing as in the submitted patch seems like the best way
forward here. :)
>
> -out:
> + return error;
> +
> +out_locked:
> spin_unlock(&files->file_lock);
> return error;
> }
>
>
>
> > Honza
> >
> > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > > 32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > ---
> > > fs/file.c | 7 -------
> > > 1 file changed, 7 deletions(-)
> > >
> > > diff --git a/fs/file.c b/fs/file.c
> > > index b4d25f6d4c19..1153b0b7ba3d 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > else
> > > __clear_close_on_exec(fd, fdt);
> > > error = fd;
> > > -#if 1
> > > - /* Sanity check */
> > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > > - }
> > > -#endif
> > >
> > > out:
> > > spin_unlock(&files->file_lock);
> > > --
> > > 2.43.0
> > >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
--
Mateusz Guzik <mjguzik gmail.com>
On Tue 25-06-24 15:11:23, Mateusz Guzik wrote:
> On Tue, Jun 25, 2024 at 3:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > exisitng zero initilization and NULL set when recycling fd.
> > > ^^^ existing ^^^ initialization
> > >
> > > Well, since this is a sanity check, it is expected it never hits. Yet
> > > searching the web shows it has hit a few times in the past :). So would
> > > wrapping this with unlikely() give a similar performance gain while keeping
> > > debugability? If unlikely() does not help, I agree we can remove this since
> > > fd_install() actually has the same check:
> > >
> > > BUG_ON(fdt->fd[fd] != NULL);
> > >
> > > and there we need the cacheline anyway so performance impact is minimal.
> > > Now, this condition in alloc_fd() is nice that it does not take the kernel
> > > down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> > > of info as alloc_fd()?
> > >
> >
> > Christian suggested just removing it.
> >
> > To my understanding the problem is not the branch per se, but the the
> > cacheline bounce of the fd array induced by reading the status.
> >
> > Note the thing also nullifies the pointer, kind of defeating the
> > BUG_ON in fd_install.
> >
> > I'm guessing it's not going to hurt to branch on it after releasing
> > the lock and forego nullifying, more or less:
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..d22b867db246 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
> > end, unsigned flags)
> > */
> > error = -EMFILE;
> > if (fd >= end)
> > - goto out;
> > + goto out_locked;
> >
> > error = expand_files(files, fd);
> > if (error < 0)
> > - goto out;
> > + goto out_locked;
> >
> > /*
> > * If we needed to expand the fs array we
> > @@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
> > end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > + spin_unlock(&files->file_lock);
> > +
> > + if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
> > printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > }
> > -#endif
>
> Now that I sent it it is of course not safe to deref without
> protection from either rcu or the lock, so this would have to be
> wrapped with rcu_read_lock, which makes it even less appealing.
>
> Whacking the thing as in the submitted patch seems like the best way
> forward here. :)
Yeah, as I wrote, I'm fine removing it, in particular if Christian is of
the same opinion. I was more musing about whether we should make the check
in fd_install() less aggressive since it is now more likely to trigger...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Tue, Jun 25, 2024 at 03:30:31PM GMT, Jan Kara wrote:
> On Tue 25-06-24 15:11:23, Mateusz Guzik wrote:
> > On Tue, Jun 25, 2024 at 3:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > > exisitng zero initilization and NULL set when recycling fd.
> > > > ^^^ existing ^^^ initialization
> > > >
> > > > Well, since this is a sanity check, it is expected it never hits. Yet
> > > > searching the web shows it has hit a few times in the past :). So would
> > > > wrapping this with unlikely() give a similar performance gain while keeping
> > > > debugability? If unlikely() does not help, I agree we can remove this since
> > > > fd_install() actually has the same check:
> > > >
> > > > BUG_ON(fdt->fd[fd] != NULL);
> > > >
> > > > and there we need the cacheline anyway so performance impact is minimal.
> > > > Now, this condition in alloc_fd() is nice that it does not take the kernel
> > > > down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> > > > of info as alloc_fd()?
> > > >
> > >
> > > Christian suggested just removing it.
> > >
> > > To my understanding the problem is not the branch per se, but the the
> > > cacheline bounce of the fd array induced by reading the status.
> > >
> > > Note the thing also nullifies the pointer, kind of defeating the
> > > BUG_ON in fd_install.
> > >
> > > I'm guessing it's not going to hurt to branch on it after releasing
> > > the lock and forego nullifying, more or less:
> > > diff --git a/fs/file.c b/fs/file.c
> > > index a3b72aa64f11..d22b867db246 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
> > > end, unsigned flags)
> > > */
> > > error = -EMFILE;
> > > if (fd >= end)
> > > - goto out;
> > > + goto out_locked;
> > >
> > > error = expand_files(files, fd);
> > > if (error < 0)
> > > - goto out;
> > > + goto out_locked;
> > >
> > > /*
> > > * If we needed to expand the fs array we
> > > @@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
> > > end, unsigned flags)
> > > else
> > > __clear_close_on_exec(fd, fdt);
> > > error = fd;
> > > -#if 1
> > > - /* Sanity check */
> > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > + spin_unlock(&files->file_lock);
> > > +
> > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
> > > printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > > }
> > > -#endif
> >
> > Now that I sent it it is of course not safe to deref without
> > protection from either rcu or the lock, so this would have to be
> > wrapped with rcu_read_lock, which makes it even less appealing.
> >
> > Whacking the thing as in the submitted patch seems like the best way
> > forward here. :)
>
> Yeah, as I wrote, I'm fine removing it, in particular if Christian is of
> the same opinion. I was more musing about whether we should make the check
> in fd_install() less aggressive since it is now more likely to trigger...
We could change it to WARN_ON() and then people can get BUG_ON()
behavior when they turn WARN into BUG which apparently is a thing that
we support.
© 2016 - 2025 Red Hat, Inc.