fs/file.c: optimize the critical section of

[PATCH 0/3] fs/file.c: optimize the critical section of

Posted by Yu Ma 1 year, 7 months ago

pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.

These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been
improved by 32% for read and 15% for write with over 30% kernel cycles
reduced on ICX 160 cores configuration with v6.8-rc6.

Yu Ma (3):
  fs/file.c: add fast path in alloc_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()

 fs/file.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

-- 
2.43.0

[PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Yu Ma 1 year, 6 months ago

pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.

These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, on top of patch 1,
pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
on Intel ICX 160 cores configuration with v6.10-rc7.

v4 -> v5:
Revised patch 3 for some code style issues.

v3 -> v4:
1. Rebased the patch set to v6.10-rc7 and updated performance results.
2. Updated patch 1 to revise the order of fd checking.
3. As proposed by Mateusz Guzik <mjguzik gmail.com>, and agreed by Jan Kara
<jack@suse.cz> and Tim Chen <tim.c.chen@linux.intel.com>, updated patch 3 with
a more generic and scalable fast path for the word contains next_fd.

v2 -> v3:
1. Rebased the patch set to latest v6.10-rc6 and updated the performance results
2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com>
3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by
Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient
and more concise than v2.

v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance results
2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check
for fast path
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer

Yu Ma (3):
  fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: add fast path in find_next_fd()

 fs/file.c | 46 ++++++++++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 20 deletions(-)

-- 
2.43.0

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Christian Brauner 1 year, 6 months ago

On Wed, 17 Jul 2024 10:50:15 -0400, Yu Ma wrote:
> pts/blogbench-1.1.0 is a benchmark designed to replicate the
> load of a real-world busy file server by multiple threads of
> random reads, writes, and rewrites. When running default configuration
> with multiple parallel threads, hot spin lock contention is observed
> from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
> 
> These 3 patches are created to reduce the critical section of file_lock
> in alloc_fd() and close_fd(). As a result, on top of patch 1,
> pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
> on Intel ICX 160 cores configuration with v6.10-rc7.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
      https://git.kernel.org/vfs/vfs/c/19f4a3f5712a
[2/3] fs/file.c: conditionally clear full_fds
      https://git.kernel.org/vfs/vfs/c/b483266658a8
[3/3] fs/file.c: add fast path in find_next_fd()
      https://git.kernel.org/vfs/vfs/c/3603a42c8c03

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Al Viro 1 year, 6 months ago

On Mon, Jul 22, 2024 at 05:02:04PM +0200, Christian Brauner wrote:

> Applied to the vfs.misc branch of the vfs/vfs.git tree.
> Patches in the vfs.misc branch should appear in linux-next soon.
> 
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
> 
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
> 
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.misc
> 
> [1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
>       https://git.kernel.org/vfs/vfs/c/19f4a3f5712a
> [2/3] fs/file.c: conditionally clear full_fds
>       https://git.kernel.org/vfs/vfs/c/b483266658a8
> [3/3] fs/file.c: add fast path in find_next_fd()
>       https://git.kernel.org/vfs/vfs/c/3603a42c8c03

Hmm...   Something fishy's going on - those are not reachable by any branches.
As the matter of fact, none of the branches contain _anything_ recent in
fs/file.c or include/linux/fdtable.h; the only thing I see for include/linux/file.h
is (dubious, IMO) DEFINE_FREE(fput,...) - this IS_ERR_OR_NULL in there needs
a review of potential users.

I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
and I'm going to apply those 3 unless anyone objects.

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Christian Brauner 1 year, 6 months ago

> Hmm...   Something fishy's going on - those are not reachable by any branches.

Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
to play around with --onto.

> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> and I'm going to apply those 3 unless anyone objects.

Fine since they aren't in that branch. Otherwise I generally prefer to
just merge a common branch. But that's rarely needed since Linus handles
merge conflicts anyway and doesn't mind.

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Al Viro 1 year, 6 months ago

On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > Hmm...   Something fishy's going on - those are not reachable by any branches.
> 
> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> to play around with --onto.
> 
> > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > and I'm going to apply those 3 unless anyone objects.
> 
> Fine since they aren't in that branch. Otherwise I generally prefer to
> just merge a common branch.

If it's going to be rebased anyway, I don't see much difference from cherry-pick,
TBH...

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Christian Brauner 1 year, 6 months ago

On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > Hmm...   Something fishy's going on - those are not reachable by any branches.
> > 
> > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > to play around with --onto.
> > 
> > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > and I'm going to apply those 3 unless anyone objects.
> > 
> > Fine since they aren't in that branch. Otherwise I generally prefer to
> > just merge a common branch.
> 
> If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> TBH...

Yeah, but I generally don't rebase after -rc1 anymore unles there's
really annoying conflicts.

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Ma, Yu 1 year, 5 months ago

On 8/5/2024 2:56 PM, Christian Brauner wrote:
> On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
>> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
>>>> Hmm...   Something fishy's going on - those are not reachable by any branches.
>>> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
>>> to play around with --onto.
>>>
>>>> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
>>>> and I'm going to apply those 3 unless anyone objects.
>>> Fine since they aren't in that branch. Otherwise I generally prefer to
>>> just merge a common branch.
>> If it's going to be rebased anyway, I don't see much difference from cherry-pick,
>> TBH...
> Yeah, but I generally don't rebase after -rc1 anymore unles there's
> really annoying conflicts.

Thanks Christian and Al for your time and efforts. I'm not familiar with 
the merging process, may i know about when these patches could be seen 
in master ?


Regards

Yu

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Al Viro 1 year, 5 months ago

On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
> 
> On 8/5/2024 2:56 PM, Christian Brauner wrote:
> > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > > > Hmm...   Something fishy's going on - those are not reachable by any branches.
> > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > > > to play around with --onto.
> > > > 
> > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > > > and I'm going to apply those 3 unless anyone objects.
> > > > Fine since they aren't in that branch. Otherwise I generally prefer to
> > > > just merge a common branch.
> > > If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> > > TBH...
> > Yeah, but I generally don't rebase after -rc1 anymore unles there's
> > really annoying conflicts.
> 
> Thanks Christian and Al for your time and efforts. I'm not familiar with the
> merging process, may i know about when these patches could be seen in master

It's in work.fdtable in my tree, will post that series tonight and add to #for-next

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Jan Kara 1 year, 3 months ago

On Mon 12-08-24 03:40:44, Al Viro wrote:
> On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
> > 
> > On 8/5/2024 2:56 PM, Christian Brauner wrote:
> > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > > > > Hmm...   Something fishy's going on - those are not reachable by any branches.
> > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > > > > to play around with --onto.
> > > > > 
> > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > > > > and I'm going to apply those 3 unless anyone objects.
> > > > > Fine since they aren't in that branch. Otherwise I generally prefer to
> > > > > just merge a common branch.
> > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> > > > TBH...
> > > Yeah, but I generally don't rebase after -rc1 anymore unles there's
> > > really annoying conflicts.
> > 
> > Thanks Christian and Al for your time and efforts. I'm not familiar with the
> > merging process, may i know about when these patches could be seen in master
> 
> It's in work.fdtable in my tree, will post that series tonight and add to #for-next

Al, it seems you didn't push the patches to Linus during the last merge
window. Do you plan to push them during the coming one?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Al Viro 1 year, 3 months ago

On Wed, Nov 06, 2024 at 06:44:23PM +0100, Jan Kara wrote:
> On Mon 12-08-24 03:40:44, Al Viro wrote:
> > On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
> > > 
> > > On 8/5/2024 2:56 PM, Christian Brauner wrote:
> > > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> > > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > > > > > Hmm...   Something fishy's going on - those are not reachable by any branches.
> > > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > > > > > to play around with --onto.
> > > > > > 
> > > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > > > > > and I'm going to apply those 3 unless anyone objects.
> > > > > > Fine since they aren't in that branch. Otherwise I generally prefer to
> > > > > > just merge a common branch.
> > > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> > > > > TBH...
> > > > Yeah, but I generally don't rebase after -rc1 anymore unles there's
> > > > really annoying conflicts.
> > > 
> > > Thanks Christian and Al for your time and efforts. I'm not familiar with the
> > > merging process, may i know about when these patches could be seen in master
> > 
> > It's in work.fdtable in my tree, will post that series tonight and add to #for-next
> 
> Al, it seems you didn't push the patches to Linus during the last merge
> window. Do you plan to push them during the coming one?

Yes, I do.  I can merge that branch into viro/vfs.git#for-next, but that would be
redundant - note that vfs/vfs.git#workd.fdtable is identical to it (same sha1 of
head) and vfs/vfs.git#vfs.file contains a merge from it and vfs/vfs.git#vfs.all
merges #vfs.file.  So it's already in linux-next in the current form and until
something else gets added to the branch I see no point.

Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Ma, Yu 1 year, 5 months ago

On 8/12/2024 10:40 AM, Al Viro wrote:
> On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
>> On 8/5/2024 2:56 PM, Christian Brauner wrote:
>>> On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
>>>> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
>>>>>> Hmm...   Something fishy's going on - those are not reachable by any branches.
>>>>> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
>>>>> to play around with --onto.
>>>>>
>>>>>> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
>>>>>> and I'm going to apply those 3 unless anyone objects.
>>>>> Fine since they aren't in that branch. Otherwise I generally prefer to
>>>>> just merge a common branch.
>>>> If it's going to be rebased anyway, I don't see much difference from cherry-pick,
>>>> TBH...
>>> Yeah, but I generally don't rebase after -rc1 anymore unles there's
>>> really annoying conflicts.
>> Thanks Christian and Al for your time and efforts. I'm not familiar with the
>> merging process, may i know about when these patches could be seen in master
> It's in work.fdtable in my tree, will post that series tonight and add to #for-next

Thanks Al for confirmation :)

[PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Yu Ma 1 year, 6 months ago

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..e1b9d6df7941 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (fd < files->next_fd)
 		fd = files->next_fd;
 
-	if (fd < fdt->max_fds)
+	if (likely(fd < fdt->max_fds))
 		fd = find_next_fd(fdt, fd);
 
 	/*
@@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	 * will limit the total number of files that can be opened.
 	 */
 	error = -EMFILE;
-	if (fd >= end)
+	if (unlikely(fd >= end))
 		goto out;
 
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
+	if (unlikely(fd >= fdt->max_fds)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
 
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
+		/*
+		 * If we needed to expand the fs array we
+		 * might have blocked - try again.
+		 */
+		if (error)
+			goto repeat;
+	}
 
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
@@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
 
 out:
 	spin_unlock(&files->file_lock);
@@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
 		rcu_read_unlock_sched();
 		spin_lock(&files->file_lock);
 		fdt = files_fdtable(files);
-		BUG_ON(fdt->fd[fd] != NULL);
+		WARN_ON(fdt->fd[fd] != NULL);
 		rcu_assign_pointer(fdt->fd[fd], file);
 		spin_unlock(&files->file_lock);
 		return;
-- 
2.43.0

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Al Viro 1 year, 5 months ago

On Wed, Jul 17, 2024 at 10:50:16AM -0400, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.

> +	if (unlikely(fd >= fdt->max_fds)) {
> +		error = expand_files(files, fd);
> +		if (error < 0)
> +			goto out;
>  
> -	/*
> -	 * If we needed to expand the fs array we
> -	 * might have blocked - try again.
> -	 */
> -	if (error)
> -		goto repeat;
> +		/*
> +		 * If we needed to expand the fs array we
> +		 * might have blocked - try again.
> +		 */
> +		if (error)
> +			goto repeat;

With that change you can't get 0 from expand_files() here, so the
last goto should be unconditional.  The only case when expand_files()
returns 0 is when it has found the descriptor already being covered
by fdt; since fdt->max_fds is stabilized by ->files_lock we are
holding here, comparison in expand_files() will give the same
result as it just had.

IOW, that goto repeat should be unconditional.  The fun part here is
that this was the only caller that distinguished between 0 and 1...

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Ma, Yu 1 year, 5 months ago

On 8/15/2024 5:38 AM, Al Viro wrote:
> On Wed, Jul 17, 2024 at 10:50:16AM -0400, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>> allocated fd is NULL. Remove this sanity check since it can be assured by
>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>> likely/unlikely and expand_file() call avoidance to reduce the work under
>> file_lock.
>> +	if (unlikely(fd >= fdt->max_fds)) {
>> +		error = expand_files(files, fd);
>> +		if (error < 0)
>> +			goto out;
>>   
>> -	/*
>> -	 * If we needed to expand the fs array we
>> -	 * might have blocked - try again.
>> -	 */
>> -	if (error)
>> -		goto repeat;
>> +		/*
>> +		 * If we needed to expand the fs array we
>> +		 * might have blocked - try again.
>> +		 */
>> +		if (error)
>> +			goto repeat;
> With that change you can't get 0 from expand_files() here, so the
> last goto should be unconditional.  The only case when expand_files()
> returns 0 is when it has found the descriptor already being covered
> by fdt; since fdt->max_fds is stabilized by ->files_lock we are
> holding here, comparison in expand_files() will give the same
> result as it just had.
>
> IOW, that goto repeat should be unconditional.  The fun part here is
> that this was the only caller that distinguished between 0 and 1...

Yes, thanks Al, fully agree with you. The if (error) could be removed 
here as the above unlikely would make sure no 0 return here. Should I 
submit another version of patch set to update it, or you may help to 
update it directly during merge? I'm not very familiar with the rules, 
please let me know if I'd update and I'll take action soon. Thanks again 
for your careful check here.

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Al Viro 1 year, 5 months ago

On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:

> Yes, thanks Al, fully agree with you. The if (error) could be removed here
> as the above unlikely would make sure no 0 return here. Should I submit
> another version of patch set to update it, or you may help to update it
> directly during merge? I'm not very familiar with the rules, please let me
> know if I'd update and I'll take action soon. Thanks again for your careful
> check here.

See the current work.fdtable...

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Mateusz Guzik 1 year, 3 months ago

On Thu, Aug 15, 2024 at 5:45 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
>
> > Yes, thanks Al, fully agree with you. The if (error) could be removed here
> > as the above unlikely would make sure no 0 return here. Should I submit
> > another version of patch set to update it, or you may help to update it
> > directly during merge? I'm not very familiar with the rules, please let me
> > know if I'd update and I'll take action soon. Thanks again for your careful
> > check here.
>
> See the current work.fdtable...

this patchset seems to have fallen through the cracks. anything
holding up the merge?

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Christian Brauner 1 year, 3 months ago

On Thu, Oct 31, 2024 at 08:42:18AM +0100, Mateusz Guzik wrote:
> On Thu, Aug 15, 2024 at 5:45 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
> >
> > > Yes, thanks Al, fully agree with you. The if (error) could be removed here
> > > as the above unlikely would make sure no 0 return here. Should I submit
> > > another version of patch set to update it, or you may help to update it
> > > directly during merge? I'm not very familiar with the rules, please let me
> > > know if I'd update and I'll take action soon. Thanks again for your careful
> > > check here.
> >
> > See the current work.fdtable...
> 
> this patchset seems to have fallen through the cracks. anything
> holding up the merge?

It's in next for this merge window.

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Ma, Yu 1 year, 5 months ago

On 8/15/2024 11:45 AM, Al Viro wrote:
> On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
>
>> Yes, thanks Al, fully agree with you. The if (error) could be removed here
>> as the above unlikely would make sure no 0 return here. Should I submit
>> another version of patch set to update it, or you may help to update it
>> directly during merge? I'm not very familiar with the rules, please let me
>> know if I'd update and I'll take action soon. Thanks again for your careful
>> check here.
> See the current work.fdtable...
Saw it, thanks :)

Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by kernel test robot 1 year, 6 months ago


Hello,

kernel test robot noticed a 1.2% improvement of will-it-scale.per_process_ops on:


commit: f1139c8e66d5c618aad04a93a2378ad9586464f9 ("[PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-2-yu.ma@intel.com/
patch subject: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

testcase: will-it-scale
test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
parameters:

	nr_task: 100%
	mode: process
	test: dup1
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062146.832faa23-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-srf-2sp1/dup1/will-it-scale

commit: 
  5f30e082ab ("Merge branch 'vfs.iomap' into vfs.all")
  f1139c8e66 ("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")

5f30e082ab8b3431 f1139c8e66d5c618aad04a93a23 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    377983 ± 69%     +74.1%     658036 ± 17%  numa-meminfo.node0.AnonPages
     18.17 ± 10%     -48.6%       9.33 ± 35%  perf-c2c.DRAM.local
 8.796e+08            +1.2%  8.903e+08        will-it-scale.256.processes
   3436082            +1.2%    3477810        will-it-scale.per_process_ops
 8.796e+08            +1.2%  8.903e+08        will-it-scale.workload
 1.517e+11            -4.3%  1.452e+11        perf-stat.i.branch-instructions
      0.03 ±  8%      +0.0        0.04 ± 36%  perf-stat.i.branch-miss-rate%
      0.93            +3.9%       0.96        perf-stat.i.cpi
  7.13e+11            -3.5%   6.88e+11        perf-stat.i.instructions
      1.08            -3.4%       1.04        perf-stat.i.ipc
      0.93            +3.4%       0.96        perf-stat.overall.cpi
      1.08            -3.3%       1.04        perf-stat.overall.ipc
    245130            -4.4%     234451        perf-stat.overall.path-length
 1.512e+11            -4.3%  1.447e+11        perf-stat.ps.branch-instructions
 7.106e+11            -3.5%  6.857e+11        perf-stat.ps.instructions
 2.156e+14            -3.2%  2.087e+14        perf-stat.total.instructions
     14.90            -0.7       14.20        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
     12.01            -0.7       11.32        perf-profile.calltrace.cycles-pp.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
     16.54            -0.7       15.88        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.dup
      6.44            -0.6        5.89        perf-profile.calltrace.cycles-pp.alloc_fd.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
      2.86            -0.0        2.82        perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.__close
      8.94            -0.0        8.90        perf-profile.calltrace.cycles-pp.filp_flush.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      7.76            -0.0        7.72        perf-profile.calltrace.cycles-pp.locks_remove_posix.filp_flush.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.58            -0.0        2.54        perf-profile.calltrace.cycles-pp.__fput_sync.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      1.11            -0.0        1.10        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      1.33            +0.0        1.35        perf-profile.calltrace.cycles-pp.testcase
      0.54            +0.0        0.56        perf-profile.calltrace.cycles-pp.x64_sys_call.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      0.79            +0.0        0.82        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.dup
      1.33            +0.0        1.37        perf-profile.calltrace.cycles-pp.close@plt
      2.73            +0.1        2.78        perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.05            +0.1        1.11        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
      4.35            +0.1        4.42        perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
     22.18            +0.3       22.51        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__close
     21.50 ±  2%      +1.5       23.02        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.dup
     12.10            -0.7       11.39        perf-profile.children.cycles-pp.__x64_sys_dup
     34.79            -0.7       34.12        perf-profile.children.cycles-pp.do_syscall_64
     38.04            -0.6       37.42        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      6.48            -0.6        5.90        perf-profile.children.cycles-pp.alloc_fd
      1.86            -0.5        1.41        perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.57            -0.1        0.47        perf-profile.children.cycles-pp.fd_install
      9.11            -0.0        9.07        perf-profile.children.cycles-pp.filp_flush
      7.93            -0.0        7.89        perf-profile.children.cycles-pp.locks_remove_posix
      2.61            -0.0        2.58        perf-profile.children.cycles-pp.__fput_sync
      1.16            +0.0        1.18        perf-profile.children.cycles-pp.x64_sys_call
      0.05            +0.0        0.07 ± 13%  perf-profile.children.cycles-pp.clockevents_program_event
      0.51            +0.0        0.53        perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
      2.17            +0.0        2.20        perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      5.72            +0.0        5.75        perf-profile.children.cycles-pp._raw_spin_lock
      2.10            +0.0        2.13        perf-profile.children.cycles-pp.testcase
      2.02            +0.0        2.06        perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.13 ±  2%      +0.0        0.17        perf-profile.children.cycles-pp.dup@plt
      4.38            +0.1        4.46        perf-profile.children.cycles-pp.file_close_fd
     23.00            +0.1       23.11        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
     59.27            +0.5       59.73        perf-profile.children.cycles-pp.__close
     28.73            +1.1       29.80        perf-profile.children.cycles-pp.entry_SYSCALL_64
      1.86            -0.5        1.41        perf-profile.self.cycles-pp.syscall_return_via_sysret
      2.28            -0.2        2.12        perf-profile.self.cycles-pp.alloc_fd
      0.54            -0.1        0.43        perf-profile.self.cycles-pp.fd_install
      7.87            -0.0        7.83        perf-profile.self.cycles-pp.locks_remove_posix
      2.47            -0.0        2.44        perf-profile.self.cycles-pp.__fput_sync
      1.23            +0.0        1.24        perf-profile.self.cycles-pp.file_close_fd_locked
      1.09            +0.0        1.11        perf-profile.self.cycles-pp.x64_sys_call
      0.51            +0.0        0.53        perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
      1.29            +0.0        1.32        perf-profile.self.cycles-pp.testcase
      5.66            +0.0        5.69        perf-profile.self.cycles-pp._raw_spin_lock
      1.95            +0.0        1.99        perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
      2.85            +0.0        2.90        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.02 ±141%      +0.0        0.06 ± 13%  perf-profile.self.cycles-pp.ktime_get
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.dup@plt
     22.93            +0.1       23.05        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
     10.11            +0.2       10.34        perf-profile.self.cycles-pp.dup
     13.70            +0.3       13.98        perf-profile.self.cycles-pp.entry_SYSCALL_64
      9.84 ±  3%      +0.7       10.51        perf-profile.self.cycles-pp.__close




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

[PATCH v5 2/3] fs/file.c: conditionally clear full_fds

Posted by Yu Ma 1 year, 6 months ago

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index e1b9d6df7941..1be2a5bcc7c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
-	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+	fd /= BITS_PER_LONG;
+	if (test_bit(fd, fdt->full_fds_bits))
+		__clear_bit(fd, fdt->full_fds_bits);
 }
 
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
-- 
2.43.0

[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Yu Ma 1 year, 6 months ago

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..729c07a4fc28 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
 	unsigned int maxbit = maxfd / BITS_PER_LONG;
 	unsigned int bitbit = start / BITS_PER_LONG;
+	unsigned int bit;
+
+	/*
+	 * Try to avoid looking at the second level bitmap
+	 */
+	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+				 start & (BITS_PER_LONG - 1));
+	if (bit < BITS_PER_LONG)
+		return bit + bitbit * BITS_PER_LONG;
 
 	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
 	if (bitbit >= maxfd)
-- 
2.43.0

Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

Posted by kernel test robot 1 year, 6 months ago


Hello,

kernel test robot noticed a 6.3% improvement of will-it-scale.per_thread_ops on:


commit: b8decf0015a8b1ff02cdac61c0aa54355d8e73d7 ("[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-4-yu.ma@intel.com/
patch subject: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

testcase: will-it-scale
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:

	nr_task: 100%
	mode: thread
	test: open3
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062152.7e5b5d6d-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/thread/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/open3/will-it-scale

commit: 
  5bb3423bf9 ("fs/file.c: conditionally clear full_fds")
  b8decf0015 ("fs/file.c: add fast path in find_next_fd()")

5bb3423bf9f9d91e b8decf0015a8b1ff02cdac61c0a 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    848151            +6.2%     901119 ±  2%  will-it-scale.224.threads
      3785            +6.3%       4022 ±  2%  will-it-scale.per_thread_ops
    848151            +6.2%     901119 ±  2%  will-it-scale.workload
      0.28 ±  4%     +13.3%       0.32 ±  3%  perf-stat.i.MPKI
     31.31 ±  3%      +2.0       33.28        perf-stat.i.cache-miss-rate%
  14955855 ±  4%     +13.6%   16995785 ±  4%  perf-stat.i.cache-misses
  49676581            +6.7%   53009444 ±  3%  perf-stat.i.cache-references
     43955 ±  4%     -12.3%      38549 ±  4%  perf-stat.i.cycles-between-cache-misses
      0.28 ±  4%     +13.4%       0.32 ±  4%  perf-stat.overall.MPKI
     29.84 ±  3%      +1.9       31.78 ±  2%  perf-stat.overall.cache-miss-rate%
     43445 ±  4%     -12.1%      38200 ±  4%  perf-stat.overall.cycles-between-cache-misses
  19005976            -5.4%   17972604 ±  2%  perf-stat.overall.path-length
  14869677 ±  4%     +13.6%   16898438 ±  4%  perf-stat.ps.cache-misses
  49821402            +6.7%   53168235 ±  3%  perf-stat.ps.cache-references
     49.42            -0.1       49.34        perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.40            -0.1       49.32        perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
     49.25            -0.1       49.18        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64
     49.20            -0.1       49.13        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
     49.33            -0.1       49.26        perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.28            -0.1       49.22        perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
     50.14            +0.0       50.18        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
     50.17            +0.0       50.21        perf-profile.calltrace.cycles-pp.open64
      0.64 ±  5%      +0.1        0.75 ±  6%  perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.62 ±  5%      +0.1        0.74 ±  6%  perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
     49.42            -0.1       49.34        perf-profile.children.cycles-pp.alloc_fd
     49.40            -0.1       49.32        perf-profile.children.cycles-pp.file_close_fd
      0.06            -0.0        0.05        perf-profile.children.cycles-pp.file_close_fd_locked
      0.15 ±  5%      +0.0        0.17 ±  4%  perf-profile.children.cycles-pp.init_file
      0.22 ±  3%      +0.0        0.25 ±  3%  perf-profile.children.cycles-pp.alloc_empty_file
      0.18 ±  6%      +0.0        0.22 ±  6%  perf-profile.children.cycles-pp.__fput
     50.14            +0.0       50.18        perf-profile.children.cycles-pp.__x64_sys_openat
     50.18            +0.0       50.22        perf-profile.children.cycles-pp.open64
      0.18 ± 14%      +0.0        0.23 ±  7%  perf-profile.children.cycles-pp.do_dentry_open
      0.30 ±  8%      +0.1        0.36 ±  8%  perf-profile.children.cycles-pp.do_open
      0.64 ±  5%      +0.1        0.75 ±  6%  perf-profile.children.cycles-pp.do_filp_open
      0.63 ±  5%      +0.1        0.75 ±  6%  perf-profile.children.cycles-pp.path_openat
      0.06            -0.0        0.05        perf-profile.self.cycles-pp.file_close_fd_locked
      0.16 ±  2%      +0.0        0.18 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.08 ± 12%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.__fput
      0.05 ±  7%      +0.1        0.10 ±  4%  perf-profile.self.cycles-pp.alloc_fd




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Mateusz Guzik 1 year, 6 months ago

On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
>
> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> word contains next_fd, as:
> (1) next_fd indicates the lower bound for the first free fd.
> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.

this is stale -- now the fast path searches up to 64 fds in the lower bitmap

> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
>

> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> versions.

I think this paragraph is droppable. You already got an ack from Jan
below, so stating he agrees with the patch is redundant. As for me I
don't think this warrants mentioning. Just remove it, perhaps
Christian will be willing to massage it by himself to avoid another
series posting.

> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index 1be2a5bcc7c4..729c07a4fc28 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>         unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
> +       unsigned int bit;
> +
> +       /*
> +        * Try to avoid looking at the second level bitmap
> +        */
> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> +                                start & (BITS_PER_LONG - 1));
> +       if (bit < BITS_PER_LONG)
> +               return bit + bitbit * BITS_PER_LONG;
>
>         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>         if (bitbit >= maxfd)
> --
> 2.43.0
>


-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 6 months ago

On 7/20/2024 1:53 AM, Mateusz Guzik wrote:
> On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
>> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
>> word contains next_fd, as:
>> (1) next_fd indicates the lower bound for the first free fd.
>> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
> this is stale -- now the fast path searches up to 64 fds in the lower bitmap

Nope, this is still valid, as the searching size of the fast path inside 
of find_next_fd() is always 64, it will execute the fast path inside of 
find_next_zero_bit().


>
>> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>>
>> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
>> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
>> versions.
> I think this paragraph is droppable. You already got an ack from Jan
> below, so stating he agrees with the patch is redundant. As for me I
> don't think this warrants mentioning. Just remove it, perhaps
> Christian will be willing to massage it by himself to avoid another
> series posting.

The idea of fast path for the word contains next_fd is from you, 
although this patch is small, I think it is reasonable to record here 
out of my respect. Appreciate for your guide and comments on this patch, 
I've learned a lot on the way of resolving problems :)


Regards

Yu

>> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
>> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>>   fs/file.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 1be2a5bcc7c4..729c07a4fc28 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>          unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>          unsigned int bitbit = start / BITS_PER_LONG;
>> +       unsigned int bit;
>> +
>> +       /*
>> +        * Try to avoid looking at the second level bitmap
>> +        */
>> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> +                                start & (BITS_PER_LONG - 1));
>> +       if (bit < BITS_PER_LONG)
>> +               return bit + bitbit * BITS_PER_LONG;
>>
>>          bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>>          if (bitbit >= maxfd)
>> --
>> 2.43.0
>>
>

Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Mateusz Guzik 1 year, 6 months ago

I think this is getting too much fluff traffic at this point, which is
partially my fault.

I'm buggering off.

Overall the patchset looks good, I don't see any technical reasons to
avoid merging it.

On Sat, Jul 20, 2024 at 2:57 PM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 7/20/2024 1:53 AM, Mateusz Guzik wrote:
> > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
> >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> >> word contains next_fd, as:
> >> (1) next_fd indicates the lower bound for the first free fd.
> >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> >> searching.
> > this is stale -- now the fast path searches up to 64 fds in the lower bitmap
>
> Nope, this is still valid, as the searching size of the fast path inside
> of find_next_fd() is always 64, it will execute the fast path inside of
> find_next_zero_bit().
>
>
> >
> >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> >> it would never be shrunk. The search size increases but there are few open fds
> >> available here.
> >>
> >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> >> versions.
> > I think this paragraph is droppable. You already got an ack from Jan
> > below, so stating he agrees with the patch is redundant. As for me I
> > don't think this warrants mentioning. Just remove it, perhaps
> > Christian will be willing to massage it by himself to avoid another
> > series posting.
>
> The idea of fast path for the word contains next_fd is from you,
> although this patch is small, I think it is reasonable to record here
> out of my respect. Appreciate for your guide and comments on this patch,
> I've learned a lot on the way of resolving problems :)
>
>
> Regards
>
> Yu
>
> >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Yu Ma <yu.ma@intel.com>
> >> ---
> >>   fs/file.c | 9 +++++++++
> >>   1 file changed, 9 insertions(+)
> >>
> >> diff --git a/fs/file.c b/fs/file.c
> >> index 1be2a5bcc7c4..729c07a4fc28 100644
> >> --- a/fs/file.c
> >> +++ b/fs/file.c
> >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >>          unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> >>          unsigned int maxbit = maxfd / BITS_PER_LONG;
> >>          unsigned int bitbit = start / BITS_PER_LONG;
> >> +       unsigned int bit;
> >> +
> >> +       /*
> >> +        * Try to avoid looking at the second level bitmap
> >> +        */
> >> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> >> +                                start & (BITS_PER_LONG - 1));
> >> +       if (bit < BITS_PER_LONG)
> >> +               return bit + bitbit * BITS_PER_LONG;
> >>
> >>          bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> >>          if (bitbit >= maxfd)
> >> --
> >> 2.43.0
> >>
> >



-- 
Mateusz Guzik <mjguzik gmail.com>

[PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Yu Ma 1 year, 6 months ago

pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.

These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, on top of patch 1,
pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
on Intel ICX 160 cores configuration with v6.10-rc7.

v3 -> v4:
1. Rebased the patch set to v6.10-rc7 and updated performance results.
2. Updated patch 1 to revise the order of fd checking.
3. As proposed by Mateusz Guzik <mjguzik gmail.com>, and agreed by Jan Kara
<jack@suse.cz> and Tim Chen <tim.c.chen@linux.intel.com>, updated patch 3 with
a more generic and scalable fast path for the word contains next_fd.

v2 -> v3:
1. Rebased the patch set to latest v6.10-rc6 and updated the performance results
2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com>
3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by
Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient
and more concise than v2.

v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance results
2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check
for fast path
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer

Yu Ma (3):
  fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: add fast path in find_next_fd()

 fs/file.c | 50 +++++++++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

-- 
2.43.0

[PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Yu Ma 1 year, 6 months ago

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..e1b9d6df7941 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (fd < files->next_fd)
 		fd = files->next_fd;
 
-	if (fd < fdt->max_fds)
+	if (likely(fd < fdt->max_fds))
 		fd = find_next_fd(fdt, fd);
 
 	/*
@@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	 * will limit the total number of files that can be opened.
 	 */
 	error = -EMFILE;
-	if (fd >= end)
+	if (unlikely(fd >= end))
 		goto out;
 
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
+	if (unlikely(fd >= fdt->max_fds)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
 
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
+		/*
+		 * If we needed to expand the fs array we
+		 * might have blocked - try again.
+		 */
+		if (error)
+			goto repeat;
+	}
 
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
@@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
 
 out:
 	spin_unlock(&files->file_lock);
@@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
 		rcu_read_unlock_sched();
 		spin_lock(&files->file_lock);
 		fdt = files_fdtable(files);
-		BUG_ON(fdt->fd[fd] != NULL);
+		WARN_ON(fdt->fd[fd] != NULL);
 		rcu_assign_pointer(fdt->fd[fd], file);
 		spin_unlock(&files->file_lock);
 		return;
-- 
2.43.0

Re: [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Jan Kara 1 year, 6 months ago

On Fri 12-07-24 22:39:15, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/file.c | 33 ++++++++++++++-------------------
>  1 file changed, 14 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..e1b9d6df7941 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	if (fd < files->next_fd)
>  		fd = files->next_fd;
>  
> -	if (fd < fdt->max_fds)
> +	if (likely(fd < fdt->max_fds))
>  		fd = find_next_fd(fdt, fd);
>  
>  	/*
> @@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	 * will limit the total number of files that can be opened.
>  	 */
>  	error = -EMFILE;
> -	if (fd >= end)
> +	if (unlikely(fd >= end))
>  		goto out;
>  
> -	error = expand_files(files, fd);
> -	if (error < 0)
> -		goto out;
> +	if (unlikely(fd >= fdt->max_fds)) {
> +		error = expand_files(files, fd);
> +		if (error < 0)
> +			goto out;
>  
> -	/*
> -	 * If we needed to expand the fs array we
> -	 * might have blocked - try again.
> -	 */
> -	if (error)
> -		goto repeat;
> +		/*
> +		 * If we needed to expand the fs array we
> +		 * might have blocked - try again.
> +		 */
> +		if (error)
> +			goto repeat;
> +	}
>  
>  	if (start <= files->next_fd)
>  		files->next_fd = fd + 1;
> @@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	else
>  		__clear_close_on_exec(fd, fdt);
>  	error = fd;
> -#if 1
> -	/* Sanity check */
> -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> -		rcu_assign_pointer(fdt->fd[fd], NULL);
> -	}
> -#endif
>  
>  out:
>  	spin_unlock(&files->file_lock);
> @@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
>  		rcu_read_unlock_sched();
>  		spin_lock(&files->file_lock);
>  		fdt = files_fdtable(files);
> -		BUG_ON(fdt->fd[fd] != NULL);
> +		WARN_ON(fdt->fd[fd] != NULL);
>  		rcu_assign_pointer(fdt->fd[fd], file);
>  		spin_unlock(&files->file_lock);
>  		return;
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH v4 2/3] fs/file.c: conditionally clear full_fds

Posted by Yu Ma 1 year, 6 months ago

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index e1b9d6df7941..1be2a5bcc7c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
-	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+	fd /= BITS_PER_LONG;
+	if (test_bit(fd, fdt->full_fds_bits))
+		__clear_bit(fd, fdt->full_fds_bits);
 }
 
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
-- 
2.43.0

[PATCH v4 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Yu Ma 1 year, 6 months ago

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..a3ce6ba30c8c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -488,9 +488,20 @@ struct files_struct init_files = {
 
 static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 {
+	unsigned int bitbit = start / BITS_PER_LONG;
+	unsigned int bit;
+
+	/*
+	 * Try to avoid looking at the second level bitmap
+	 */
+	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+				 start & (BITS_PER_LONG -1));
+	if (bit < BITS_PER_LONG) {
+		return bit + bitbit * BITS_PER_LONG;
+	}
+
 	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
 	unsigned int maxbit = maxfd / BITS_PER_LONG;
-	unsigned int bitbit = start / BITS_PER_LONG;
 
 	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
 	if (bitbit >= maxfd)
-- 
2.43.0

Re: [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Jan Kara 1 year, 6 months ago

On Fri 12-07-24 22:39:17, Yu Ma wrote:
> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> word contains next_fd, as:
> (1) next_fd indicates the lower bound for the first free fd.
> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> 
> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>

Looks good. Just some code style nits below.

> diff --git a/fs/file.c b/fs/file.c
> index 1be2a5bcc7c4..a3ce6ba30c8c 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,9 +488,20 @@ struct files_struct init_files = {
>  
>  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>  {
> +	unsigned int bitbit = start / BITS_PER_LONG;
> +	unsigned int bit;
> +
> +	/*
> +	 * Try to avoid looking at the second level bitmap
> +	 */
> +	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> +				 start & (BITS_PER_LONG -1));
							^^ Either
(BITS_PER_LONG-1) or (BITS_PER_LONG - 1) please. Your combination looks
particularly weird :)

> +	if (bit < BITS_PER_LONG) {
> +		return bit + bitbit * BITS_PER_LONG;
> +	}

No need for braces around the above block.

>  	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>  	unsigned int maxbit = maxfd / BITS_PER_LONG;

We keep declarations at the beginning of the block. Usually it keeps the
code more readable and the compiler should be clever enough to perform the
loads & arithmetics only when needed.

After fixing these style nits feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 6 months ago

On 7/16/2024 7:19 PM, Jan Kara wrote:
> On Fri 12-07-24 22:39:17, Yu Ma wrote:
>> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
>> word contains next_fd, as:
>> (1) next_fd indicates the lower bound for the first free fd.
>> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>>
>> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
>> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
>> versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
>> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Looks good. Just some code style nits below.

Copy that, thanks Honza, I'll revise and send out updated version soon.


>
>> diff --git a/fs/file.c b/fs/file.c
>> index 1be2a5bcc7c4..a3ce6ba30c8c 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,9 +488,20 @@ struct files_struct init_files = {
>>   
>>   static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>   {
>> +	unsigned int bitbit = start / BITS_PER_LONG;
>> +	unsigned int bit;
>> +
>> +	/*
>> +	 * Try to avoid looking at the second level bitmap
>> +	 */
>> +	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> +				 start & (BITS_PER_LONG -1));
> 							^^ Either
> (BITS_PER_LONG-1) or (BITS_PER_LONG - 1) please. Your combination looks
> particularly weird :)
>
>> +	if (bit < BITS_PER_LONG) {
>> +		return bit + bitbit * BITS_PER_LONG;
>> +	}
> No need for braces around the above block.
>
>>   	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>   	unsigned int maxbit = maxfd / BITS_PER_LONG;
> We keep declarations at the beginning of the block. Usually it keeps the
> code more readable and the compiler should be clever enough to perform the
> loads & arithmetics only when needed.
>
> After fixing these style nits feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> 								Honza

Yes, I'll polish this part of code accordingly, thanks for all the 
comments here :)

[PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Yu Ma 1 year, 7 months ago

pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.

These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, on top of patch 1,
pts/blogbench-1.1.0 has been improved by 28% for read and 12% for write
on Intel ICX 160 cores configuration with v6.10-rc6.

v2 -> v3:
1. Rebased the patch set to latest v6.10-rc6 and updated the performance results
2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com>
3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by
Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient
and more concise than v2.

v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance results
2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check
for fast path
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer

Yu Ma (3):
  fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: add fast path in find_next_fd()

 fs/file.c | 47 ++++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 23 deletions(-)

-- 
2.43.0

[PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Yu Ma 1 year, 7 months ago

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 38 ++++++++++++++++----------------------
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..5178b246e54b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (fd < files->next_fd)
 		fd = files->next_fd;
 
-	if (fd < fdt->max_fds)
+	if (likely(fd < fdt->max_fds))
 		fd = find_next_fd(fdt, fd);
 
+	error = -EMFILE;
+	if (unlikely(fd >= fdt->max_fds)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
+		/*
+		 * If we needed to expand the fs array we
+		 * might have blocked - try again.
+		 */
+		if (error)
+			goto repeat;
+	}
+
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
 	 * will limit the total number of files that can be opened.
 	 */
-	error = -EMFILE;
-	if (fd >= end)
-		goto out;
-
-	error = expand_files(files, fd);
-	if (error < 0)
+	if (unlikely(fd >= end))
 		goto out;
 
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
-
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
 
@@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
 
 out:
 	spin_unlock(&files->file_lock);
@@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
 		rcu_read_unlock_sched();
 		spin_lock(&files->file_lock);
 		fdt = files_fdtable(files);
-		BUG_ON(fdt->fd[fd] != NULL);
+		WARN_ON(fdt->fd[fd] != NULL);
 		rcu_assign_pointer(fdt->fd[fd], file);
 		spin_unlock(&files->file_lock);
 		return;
-- 
2.43.0

Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Christian Brauner 1 year, 7 months ago

On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 38 ++++++++++++++++----------------------
>  1 file changed, 16 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..5178b246e54b 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	if (fd < files->next_fd)
>  		fd = files->next_fd;
>  
> -	if (fd < fdt->max_fds)
> +	if (likely(fd < fdt->max_fds))
>  		fd = find_next_fd(fdt, fd);
>  
> +	error = -EMFILE;
> +	if (unlikely(fd >= fdt->max_fds)) {
> +		error = expand_files(files, fd);
> +		if (error < 0)
> +			goto out;
> +		/*
> +		 * If we needed to expand the fs array we
> +		 * might have blocked - try again.
> +		 */
> +		if (error)
> +			goto repeat;
> +	}

So this ends up removing the expand_files() above the fd >= end check
which means that you can end up expanding the files_struct even though
the request fd is past the provided end. That seems odd. What's the
reason for that reordering?

> +
>  	/*
>  	 * N.B. For clone tasks sharing a files structure, this test
>  	 * will limit the total number of files that can be opened.
>  	 */
> -	error = -EMFILE;
> -	if (fd >= end)
> -		goto out;
> -
> -	error = expand_files(files, fd);
> -	if (error < 0)
> +	if (unlikely(fd >= end))
>  		goto out;
>  
> -	/*
> -	 * If we needed to expand the fs array we
> -	 * might have blocked - try again.
> -	 */
> -	if (error)
> -		goto repeat;
> -
>  	if (start <= files->next_fd)
>  		files->next_fd = fd + 1;
>  
> @@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	else
>  		__clear_close_on_exec(fd, fdt);
>  	error = fd;
> -#if 1
> -	/* Sanity check */
> -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> -		rcu_assign_pointer(fdt->fd[fd], NULL);
> -	}
> -#endif
>  
>  out:
>  	spin_unlock(&files->file_lock);
> @@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
>  		rcu_read_unlock_sched();
>  		spin_lock(&files->file_lock);
>  		fdt = files_fdtable(files);
> -		BUG_ON(fdt->fd[fd] != NULL);
> +		WARN_ON(fdt->fd[fd] != NULL);
>  		rcu_assign_pointer(fdt->fd[fd], file);
>  		spin_unlock(&files->file_lock);
>  		return;
> -- 
> 2.43.0
>

Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Jan Kara 1 year, 7 months ago

On Wed 03-07-24 16:34:49, Christian Brauner wrote:
> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > allocated fd is NULL. Remove this sanity check since it can be assured by
> > exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> > likely/unlikely and expand_file() call avoidance to reduce the work under
> > file_lock.
> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  fs/file.c | 38 ++++++++++++++++----------------------
> >  1 file changed, 16 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..5178b246e54b 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> >  	if (fd < files->next_fd)
> >  		fd = files->next_fd;
> >  
> > -	if (fd < fdt->max_fds)
> > +	if (likely(fd < fdt->max_fds))
> >  		fd = find_next_fd(fdt, fd);
> >  
> > +	error = -EMFILE;
> > +	if (unlikely(fd >= fdt->max_fds)) {
> > +		error = expand_files(files, fd);
> > +		if (error < 0)
> > +			goto out;
> > +		/*
> > +		 * If we needed to expand the fs array we
> > +		 * might have blocked - try again.
> > +		 */
> > +		if (error)
> > +			goto repeat;
> > +	}
> 
> So this ends up removing the expand_files() above the fd >= end check
> which means that you can end up expanding the files_struct even though
> the request fd is past the provided end. That seems odd. What's the
> reason for that reordering?

Yeah, not only that but also:

> >  	/*
> >  	 * N.B. For clone tasks sharing a files structure, this test
> >  	 * will limit the total number of files that can be opened.
> >  	 */
> > -	error = -EMFILE;
> > -	if (fd >= end)
> > -		goto out;
> > -
> > -	error = expand_files(files, fd);
> > -	if (error < 0)
> > +	if (unlikely(fd >= end))
> >  		goto out;

We could then exit here with error set to 0 instead of -EMFILE. So this
needs a bit of work. But otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/4/2024 6:11 PM, Jan Kara wrote:
> On Wed 03-07-24 16:34:49, Christian Brauner wrote:
>> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
>>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>>> allocated fd is NULL. Remove this sanity check since it can be assured by
>>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>>> likely/unlikely and expand_file() call avoidance to reduce the work under
>>> file_lock.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>>   fs/file.c | 38 ++++++++++++++++----------------------
>>>   1 file changed, 16 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..5178b246e54b 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>>   	if (fd < files->next_fd)
>>>   		fd = files->next_fd;
>>>   
>>> -	if (fd < fdt->max_fds)
>>> +	if (likely(fd < fdt->max_fds))
>>>   		fd = find_next_fd(fdt, fd);
>>>   
>>> +	error = -EMFILE;
>>> +	if (unlikely(fd >= fdt->max_fds)) {
>>> +		error = expand_files(files, fd);
>>> +		if (error < 0)
>>> +			goto out;
>>> +		/*
>>> +		 * If we needed to expand the fs array we
>>> +		 * might have blocked - try again.
>>> +		 */
>>> +		if (error)
>>> +			goto repeat;
>>> +	}
>> So this ends up removing the expand_files() above the fd >= end check
>> which means that you can end up expanding the files_struct even though
>> the request fd is past the provided end. That seems odd. What's the
>> reason for that reordering?
> Yeah, not only that but also:
>
>>>   	/*
>>>   	 * N.B. For clone tasks sharing a files structure, this test
>>>   	 * will limit the total number of files that can be opened.
>>>   	 */
>>> -	error = -EMFILE;
>>> -	if (fd >= end)
>>> -		goto out;
>>> -
>>> -	error = expand_files(files, fd);
>>> -	if (error < 0)
>>> +	if (unlikely(fd >= end))
>>>   		goto out;
> We could then exit here with error set to 0 instead of -EMFILE. So this
> needs a bit of work. But otherwise the patch looks good to me.
>
> 								Honza

Do you mean that we return 0 here is fd >=end, I'm afraid that might 
broke the original design of this function. And all the callers of it 
are using ret < 0 for error handling, if ret=0, that should mean the fd 
allocated is 0 ...

Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Jan Kara 1 year, 7 months ago

On Thu 04-07-24 22:45:32, Ma, Yu wrote:
> 
> On 7/4/2024 6:11 PM, Jan Kara wrote:
> > On Wed 03-07-24 16:34:49, Christian Brauner wrote:
> > > On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> > > > likely/unlikely and expand_file() call avoidance to reduce the work under
> > > > file_lock.
> > > > 
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > >   fs/file.c | 38 ++++++++++++++++----------------------
> > > >   1 file changed, 16 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a3b72aa64f11..5178b246e54b 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > >   	if (fd < files->next_fd)
> > > >   		fd = files->next_fd;
> > > > -	if (fd < fdt->max_fds)
> > > > +	if (likely(fd < fdt->max_fds))
> > > >   		fd = find_next_fd(fdt, fd);
> > > > +	error = -EMFILE;
> > > > +	if (unlikely(fd >= fdt->max_fds)) {
> > > > +		error = expand_files(files, fd);
> > > > +		if (error < 0)
> > > > +			goto out;
> > > > +		/*
> > > > +		 * If we needed to expand the fs array we
> > > > +		 * might have blocked - try again.
> > > > +		 */
> > > > +		if (error)
> > > > +			goto repeat;
> > > > +	}
> > > So this ends up removing the expand_files() above the fd >= end check
> > > which means that you can end up expanding the files_struct even though
> > > the request fd is past the provided end. That seems odd. What's the
> > > reason for that reordering?
> > Yeah, not only that but also:
> > 
> > > >   	/*
> > > >   	 * N.B. For clone tasks sharing a files structure, this test
> > > >   	 * will limit the total number of files that can be opened.
> > > >   	 */
> > > > -	error = -EMFILE;
> > > > -	if (fd >= end)
> > > > -		goto out;
> > > > -
> > > > -	error = expand_files(files, fd);
> > > > -	if (error < 0)
> > > > +	if (unlikely(fd >= end))
> > > >   		goto out;
> > We could then exit here with error set to 0 instead of -EMFILE. So this
> > needs a bit of work. But otherwise the patch looks good to me.
> > 
> > 								Honza
> 
> Do you mean that we return 0 here is fd >=end, I'm afraid that might broke
> the original design of this function. And all the callers of it are using
> ret < 0 for error handling, if ret=0, that should mean the fd allocated is 0
> ...

What I meant is that after your changes alloc_fd() could return 0 in fd >=
end case. It could happen if we went through expand_files() which then
returned 0. Anyway, please just fix the ordering of checks and we should be
fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/3/2024 10:34 PM, Christian Brauner wrote:
> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>> allocated fd is NULL. Remove this sanity check since it can be assured by
>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>> likely/unlikely and expand_file() call avoidance to reduce the work under
>> file_lock.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>>   fs/file.c | 38 ++++++++++++++++----------------------
>>   1 file changed, 16 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a3b72aa64f11..5178b246e54b 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>   	if (fd < files->next_fd)
>>   		fd = files->next_fd;
>>   
>> -	if (fd < fdt->max_fds)
>> +	if (likely(fd < fdt->max_fds))
>>   		fd = find_next_fd(fdt, fd);
>>   
>> +	error = -EMFILE;
>> +	if (unlikely(fd >= fdt->max_fds)) {
>> +		error = expand_files(files, fd);
>> +		if (error < 0)
>> +			goto out;
>> +		/*
>> +		 * If we needed to expand the fs array we
>> +		 * might have blocked - try again.
>> +		 */
>> +		if (error)
>> +			goto repeat;
>> +	}
> So this ends up removing the expand_files() above the fd >= end check
> which means that you can end up expanding the files_struct even though
> the request fd is past the provided end. That seems odd. What's the
> reason for that reordering?

Yes, you are right, thanks Christian. This incorrect reordering here is 
due to historical versions with fast path inside. I'll update the order 
back.

>> +
>>   	/*
>>   	 * N.B. For clone tasks sharing a files structure, this test
>>   	 * will limit the total number of files that can be opened.
>>   	 */
>> -	error = -EMFILE;
>> -	if (fd >= end)
>> -		goto out;
>> -
>> -	error = expand_files(files, fd);
>> -	if (error < 0)
>> +	if (unlikely(fd >= end))
>>   		goto out;
>>   
>> -	/*
>> -	 * If we needed to expand the fs array we
>> -	 * might have blocked - try again.
>> -	 */
>> -	if (error)
>> -		goto repeat;
>> -
>>   	if (start <= files->next_fd)
>>   		files->next_fd = fd + 1;
>>   
>> @@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>   	else
>>   		__clear_close_on_exec(fd, fdt);
>>   	error = fd;
>> -#if 1
>> -	/* Sanity check */
>> -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
>> -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
>> -		rcu_assign_pointer(fdt->fd[fd], NULL);
>> -	}
>> -#endif
>>   
>>   out:
>>   	spin_unlock(&files->file_lock);
>> @@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
>>   		rcu_read_unlock_sched();
>>   		spin_lock(&files->file_lock);
>>   		fdt = files_fdtable(files);
>> -		BUG_ON(fdt->fd[fd] != NULL);
>> +		WARN_ON(fdt->fd[fd] != NULL);
>>   		rcu_assign_pointer(fdt->fd[fd], file);
>>   		spin_unlock(&files->file_lock);
>>   		return;
>> -- 
>> 2.43.0
>>

[PATCH v3 2/3] fs/file.c: conditionally clear full_fds

Posted by Yu Ma 1 year, 7 months ago

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc6.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 5178b246e54b..a15317db3119 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
-	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+	fd /= BITS_PER_LONG;
+	if (test_bit(fd, fdt->full_fds_bits))
+		__clear_bit(fd, fdt->full_fds_bits);
 }
 
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
-- 
2.43.0

[PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Yu Ma 1 year, 7 months ago

There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.

Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.

As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
update the fast path from alloc_fd() to find_next_fd(). With which, on top of
patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
Intel ICX 160 cores configuration with v6.10-rc6.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index a15317db3119..f25eca311f51 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -488,6 +488,11 @@ struct files_struct init_files = {
 
 static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 {
+	unsigned int bit;
+	bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
+	if (bit < BITS_PER_LONG)
+		return bit;
+
 	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
 	unsigned int maxbit = maxfd / BITS_PER_LONG;
 	unsigned int bitbit = start / BITS_PER_LONG;
-- 
2.43.0

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Mateusz Guzik 1 year, 7 months ago

On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
>  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>  {
> +       unsigned int bit;
> +       bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> +       if (bit < BITS_PER_LONG)
> +               return bit;
> +
>         unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
> --
> 2.43.0
>

I had something like this in mind:
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..4d3307e39db7 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
*fdt, unsigned int start)
        unsigned int maxfd = fdt->max_fds; /* always multiple of
BITS_PER_LONG */
        unsigned int maxbit = maxfd / BITS_PER_LONG;
        unsigned int bitbit = start / BITS_PER_LONG;
+       unsigned int bit;
+
+       /*
+        * Try to avoid looking at the second level map.
+        */
+       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+                               start & (BITS_PER_LONG - 1));
+       if (bit < BITS_PER_LONG) {
+               return bit + bitbit * BITS_PER_LONG;
+       }

        bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit,
bitbit) * BITS_PER_LONG;
        if (bitbit >= maxfd)

can you please test it out. I expect it to provide a tiny improvement
over your patch.

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Jan Kara 1 year, 7 months ago

On Thu 04-07-24 19:44:10, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
> >
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> > update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> > patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> > Intel ICX 160 cores configuration with v6.10-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  fs/file.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a15317db3119..f25eca311f51 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -488,6 +488,11 @@ struct files_struct init_files = {
> >
> >  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >  {
> > +       unsigned int bit;
> > +       bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > +       if (bit < BITS_PER_LONG)
> > +               return bit;
> > +
> >         unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> >         unsigned int bitbit = start / BITS_PER_LONG;
> > --
> > 2.43.0
> >
> 
> I had something like this in mind:
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..4d3307e39db7 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> *fdt, unsigned int start)
>         unsigned int maxfd = fdt->max_fds; /* always multiple of
> BITS_PER_LONG */
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
> +       unsigned int bit;
> +
> +       /*
> +        * Try to avoid looking at the second level map.
> +        */
> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> +                               start & (BITS_PER_LONG - 1));
> +       if (bit < BITS_PER_LONG) {
> +               return bit + bitbit * BITS_PER_LONG;
> +       }

Drat, you're right. I missed that Ma did not add the proper offset to
open_fds. *This* is what I meant :)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/5/2024 5:55 AM, Jan Kara wrote:
> On Thu 04-07-24 19:44:10, Mateusz Guzik wrote:
>> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>>> when we look for an available fd slot. Skip 2-levels searching via
>>> find_next_zero_bit() for this common fast path.
>>>
>>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>>> free slot is available there, as:
>>> (1) The fd allocation algorithm would always allocate fd from small to large.
>>> Lower bits in open_fds bitmap would be used much more frequently than higher
>>> bits.
>>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>>> it would never be shrunk. The search size increases but there are few open fds
>>> available here.
>>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>>> searching.
>>>
>>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>>> Intel ICX 160 cores configuration with v6.10-rc6.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>>   fs/file.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a15317db3119..f25eca311f51 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>>
>>>   static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>>   {
>>> +       unsigned int bit;
>>> +       bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>>> +       if (bit < BITS_PER_LONG)
>>> +               return bit;
>>> +
>>>          unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>>          unsigned int bitbit = start / BITS_PER_LONG;
>>> --
>>> 2.43.0
>>>
>> I had something like this in mind:
>> diff --git a/fs/file.c b/fs/file.c
>> index a3b72aa64f11..4d3307e39db7 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>> *fdt, unsigned int start)
>>          unsigned int maxfd = fdt->max_fds; /* always multiple of
>> BITS_PER_LONG */
>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>          unsigned int bitbit = start / BITS_PER_LONG;
>> +       unsigned int bit;
>> +
>> +       /*
>> +        * Try to avoid looking at the second level map.
>> +        */
>> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> +                               start & (BITS_PER_LONG - 1));
>> +       if (bit < BITS_PER_LONG) {
>> +               return bit + bitbit * BITS_PER_LONG;
>> +       }
> Drat, you're right. I missed that Ma did not add the proper offset to
> open_fds. *This* is what I meant :)
>
> 								Honza

Just tried this on v6.10-rc6, the improvement on top of patch 1 and 
patch 2 is 7% for read and 3% for write, less than just check first word.

Per my understanding, its performance would be better if we can find 
free bit in the same word of next_fd with high possibility, but next_fd 
just represents the lowest possible free bit. If fds are open/close 
frequently and randomly, that might not always be the case, next_fd may 
be distributed randomly, for example, 0-65 are occupied, fd=3 is 
returned, next_fd will be set to 3, next time when 3 is allocated, 
next_fd will be set to 4, while the actual first free bit is 66 , when 
66 is allocated, and fd=5 is returned, then the above process would be 
went through again.

Yu

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/5/2024 3:56 PM, Ma, Yu wrote:
> I had something like this in mind:
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..4d3307e39db7 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>>> *fdt, unsigned int start)
>>>          unsigned int maxfd = fdt->max_fds; /* always multiple of
>>> BITS_PER_LONG */
>>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>>          unsigned int bitbit = start / BITS_PER_LONG;
>>> +       unsigned int bit;
>>> +
>>> +       /*
>>> +        * Try to avoid looking at the second level map.
>>> +        */
>>> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>>> +                               start & (BITS_PER_LONG - 1));
>>> +       if (bit < BITS_PER_LONG) {
>>> +               return bit + bitbit * BITS_PER_LONG;
>>> +       }
>> Drat, you're right. I missed that Ma did not add the proper offset to
>> open_fds. *This* is what I meant :)
>>
>>                                 Honza
>
> Just tried this on v6.10-rc6, the improvement on top of patch 1 and 
> patch 2 is 7% for read and 3% for write, less than just check first word.
>
> Per my understanding, its performance would be better if we can find 
> free bit in the same word of next_fd with high possibility, but 
> next_fd just represents the lowest possible free bit. If fds are 
> open/close frequently and randomly, that might not always be the case, 
> next_fd may be distributed randomly, for example, 0-65 are occupied, 
> fd=3 is returned, next_fd will be set to 3, next time when 3 is 
> allocated, next_fd will be set to 4, while the actual first free bit 
> is 66 , when 66 is allocated, and fd=5 is returned, then the above 
> process would be went through again.
>
> Yu
>
Hi Guzik, Honza,

Do we have any more comment or idea regarding to the fast path? Thanks 
for your time and any feedback :)


Regards

Yu

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Mateusz Guzik 1 year, 7 months ago

Right, forgot to respond.

I suspect the different result is either because of mere variance
between reboots or blogbench using significantly less than 100 fds at
any given time -- I don't have an easy way to test at your scale at
the moment. You could probably test that by benching both approaches
while switching them at runtime with a static_branch. However, I don't
know if that effort is warranted atm.

So happens I'm busy with other stuff and it is not my call to either
block or let this in, so I'm buggering off.

On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 7/5/2024 3:56 PM, Ma, Yu wrote:
> > I had something like this in mind:
> >>> diff --git a/fs/file.c b/fs/file.c
> >>> index a3b72aa64f11..4d3307e39db7 100644
> >>> --- a/fs/file.c
> >>> +++ b/fs/file.c
> >>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> >>> *fdt, unsigned int start)
> >>>          unsigned int maxfd = fdt->max_fds; /* always multiple of
> >>> BITS_PER_LONG */
> >>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
> >>>          unsigned int bitbit = start / BITS_PER_LONG;
> >>> +       unsigned int bit;
> >>> +
> >>> +       /*
> >>> +        * Try to avoid looking at the second level map.
> >>> +        */
> >>> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> >>> +                               start & (BITS_PER_LONG - 1));
> >>> +       if (bit < BITS_PER_LONG) {
> >>> +               return bit + bitbit * BITS_PER_LONG;
> >>> +       }
> >> Drat, you're right. I missed that Ma did not add the proper offset to
> >> open_fds. *This* is what I meant :)
> >>
> >>                                 Honza
> >
> > Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> > patch 2 is 7% for read and 3% for write, less than just check first word.
> >
> > Per my understanding, its performance would be better if we can find
> > free bit in the same word of next_fd with high possibility, but
> > next_fd just represents the lowest possible free bit. If fds are
> > open/close frequently and randomly, that might not always be the case,
> > next_fd may be distributed randomly, for example, 0-65 are occupied,
> > fd=3 is returned, next_fd will be set to 3, next time when 3 is
> > allocated, next_fd will be set to 4, while the actual first free bit
> > is 66 , when 66 is allocated, and fd=5 is returned, then the above
> > process would be went through again.
> >
> > Yu
> >
> Hi Guzik, Honza,
>
> Do we have any more comment or idea regarding to the fast path? Thanks
> for your time and any feedback :)
>
>
> Regards
>
> Yu
>


-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Tim Chen 1 year, 7 months ago

On Tue, 2024-07-09 at 12:17 +0200, Mateusz Guzik wrote:
> Right, forgot to respond.
> 
> I suspect the different result is either because of mere variance
> between reboots or blogbench using significantly less than 100 fds at
> any given time -- I don't have an easy way to test at your scale at
> the moment. You could probably test that by benching both approaches
> while switching them at runtime with a static_branch. However, I don't
> know if that effort is warranted atm.
> 
> So happens I'm busy with other stuff and it is not my call to either
> block or let this in, so I'm buggering off.
> 
> On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
> > 
> > 
> > On 7/5/2024 3:56 PM, Ma, Yu wrote:
> > > I had something like this in mind:
> > > > > diff --git a/fs/file.c b/fs/file.c
> > > > > index a3b72aa64f11..4d3307e39db7 100644
> > > > > --- a/fs/file.c
> > > > > +++ b/fs/file.c
> > > > > @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> > > > > *fdt, unsigned int start)
> > > > >          unsigned int maxfd = fdt->max_fds; /* always multiple of
> > > > > BITS_PER_LONG */
> > > > >          unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > >          unsigned int bitbit = start / BITS_PER_LONG;
> > > > > +       unsigned int bit;
> > > > > +
> > > > > +       /*
> > > > > +        * Try to avoid looking at the second level map.
> > > > > +        */
> > > > > +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> > > > > +                               start & (BITS_PER_LONG - 1));
> > > > > +       if (bit < BITS_PER_LONG) {
> > > > > +               return bit + bitbit * BITS_PER_LONG;
> > > > > +       }

I think this approach based on next_fd quick check is more generic and scalable.

It just happen for blogbench, just checking the first 64 bit allow a quicker
skip to the two level search where this approach, next_fd may be left
in a 64 word that actually has no open bits and we are doing useless search
in find_next_zero_bit(). Perhaps we should check full_fds_bits to make sure
there are empty slots before we do
find_next_zero_bit() fast path.  Something like

	if (!test_bit(bitbit, fdt->full_fds_bits)) {
		bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
					start & (BITS_PER_LONG - 1));
		if (bit < BITS_PER_LONG)
			return bit + bitbit * BITS_PER_LONG;
	}
Tim

> > > > Drat, you're right. I missed that Ma did not add the proper offset to
> > > > open_fds. *This* is what I meant :)
> > > > 
> > > >                                 Honza
> > > 
> > > Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> > > patch 2 is 7% for read and 3% for write, less than just check first word.
> > > 
> > > Per my understanding, its performance would be better if we can find
> > > free bit in the same word of next_fd with high possibility, but
> > > next_fd just represents the lowest possible free bit. If fds are
> > > open/close frequently and randomly, that might not always be the case,
> > > next_fd may be distributed randomly, for example, 0-65 are occupied,
> > > fd=3 is returned, next_fd will be set to 3, next time when 3 is
> > > allocated, next_fd will be set to 4, while the actual first free bit
> > > is 66 , when 66 is allocated, and fd=5 is returned, then the above
> > > process would be went through again.
> > > 
> > > Yu
> > > 
> > Hi Guzik, Honza,
> > 
> > Do we have any more comment or idea regarding to the fast path? Thanks
> > for your time and any feedback :)
> > 
> > 
> > Regards
> > 
> > Yu
> > 
> 
>

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/11/2024 7:40 AM, Tim Chen wrote:
> On Tue, 2024-07-09 at 12:17 +0200, Mateusz Guzik wrote:
>> Right, forgot to respond.
>>
>> I suspect the different result is either because of mere variance
>> between reboots or blogbench using significantly less than 100 fds at
>> any given time -- I don't have an easy way to test at your scale at
>> the moment. You could probably test that by benching both approaches
>> while switching them at runtime with a static_branch. However, I don't
>> know if that effort is warranted atm.
>>
>> So happens I'm busy with other stuff and it is not my call to either
>> block or let this in, so I'm buggering off.
>>
>> On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
>>>
>>> On 7/5/2024 3:56 PM, Ma, Yu wrote:
>>>> I had something like this in mind:
>>>>>> diff --git a/fs/file.c b/fs/file.c
>>>>>> index a3b72aa64f11..4d3307e39db7 100644
>>>>>> --- a/fs/file.c
>>>>>> +++ b/fs/file.c
>>>>>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>>>>>> *fdt, unsigned int start)
>>>>>>           unsigned int maxfd = fdt->max_fds; /* always multiple of
>>>>>> BITS_PER_LONG */
>>>>>>           unsigned int maxbit = maxfd / BITS_PER_LONG;
>>>>>>           unsigned int bitbit = start / BITS_PER_LONG;
>>>>>> +       unsigned int bit;
>>>>>> +
>>>>>> +       /*
>>>>>> +        * Try to avoid looking at the second level map.
>>>>>> +        */
>>>>>> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>>>>>> +                               start & (BITS_PER_LONG - 1));
>>>>>> +       if (bit < BITS_PER_LONG) {
>>>>>> +               return bit + bitbit * BITS_PER_LONG;
>>>>>> +       }
> I think this approach based on next_fd quick check is more generic and scalable.
>
> It just happen for blogbench, just checking the first 64 bit allow a quicker
> skip to the two level search where this approach, next_fd may be left
> in a 64 word that actually has no open bits and we are doing useless search
> in find_next_zero_bit(). Perhaps we should check full_fds_bits to make sure
> there are empty slots before we do
> find_next_zero_bit() fast path.  Something like
>
> 	if (!test_bit(bitbit, fdt->full_fds_bits)) {
> 		bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> 					start & (BITS_PER_LONG - 1));
> 		if (bit < BITS_PER_LONG)
> 			return bit + bitbit * BITS_PER_LONG;
> 	}
> Tim

Yes, agree that it scales better, I'll update v4 with fast path for the 
word contains next_fd and send out for review soon

>>>>> Drat, you're right. I missed that Ma did not add the proper offset to
>>>>> open_fds. *This* is what I meant :)
>>>>>
>>>>>                                  Honza
>>>> Just tried this on v6.10-rc6, the improvement on top of patch 1 and
>>>> patch 2 is 7% for read and 3% for write, less than just check first word.
>>>>
>>>> Per my understanding, its performance would be better if we can find
>>>> free bit in the same word of next_fd with high possibility, but
>>>> next_fd just represents the lowest possible free bit. If fds are
>>>> open/close frequently and randomly, that might not always be the case,
>>>> next_fd may be distributed randomly, for example, 0-65 are occupied,
>>>> fd=3 is returned, next_fd will be set to 3, next time when 3 is
>>>> allocated, next_fd will be set to 4, while the actual first free bit
>>>> is 66 , when 66 is allocated, and fd=5 is returned, then the above
>>>> process would be went through again.
>>>>
>>>> Yu
>>>>
>>> Hi Guzik, Honza,
>>>
>>> Do we have any more comment or idea regarding to the fast path? Thanks
>>> for your time and any feedback :)
>>>
>>>
>>> Regards
>>>
>>> Yu
>>>
>>

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Jan Kara 1 year, 7 months ago

On Wed 03-07-24 10:33:11, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
> 
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
> 
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>

Nice! The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

One style nit below:

> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>  
>  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>  {
> +	unsigned int bit;

Empty line here please to separate variable declaration and code...

> +	bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> +	if (bit < BITS_PER_LONG)
> +		return bit;
> +
>  	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>  	unsigned int maxbit = maxfd / BITS_PER_LONG;
>  	unsigned int bitbit = start / BITS_PER_LONG;

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/4/2024 6:03 PM, Jan Kara wrote:
> On Wed 03-07-24 10:33:11, Yu Ma wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>> Intel ICX 160 cores configuration with v6.10-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Nice! The patch looks good to me. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> One style nit below:
>
>> diff --git a/fs/file.c b/fs/file.c
>> index a15317db3119..f25eca311f51 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>   
>>   static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>   {
>> +	unsigned int bit;
> Empty line here please to separate variable declaration and code...

Thanks Honza, copy that :)

>
>> +	bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>> +	if (bit < BITS_PER_LONG)
>> +		return bit;
>> +
>>   	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>   	unsigned int maxbit = maxfd / BITS_PER_LONG;
>>   	unsigned int bitbit = start / BITS_PER_LONG;
> 									Honza

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Mateusz Guzik 1 year, 7 months ago

On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
>  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>  {
> +       unsigned int bit;
> +       bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> +       if (bit < BITS_PER_LONG)
> +               return bit;
> +

The rest of the patchset looks good on cursory read.

As for this one, the suggestion was to make it work across the entire range.

Today I wont have time to write and test what we proposed, but will
probably find some time tomorrow. Perhaps Jan will do the needful(tm)
in the meantime.

That said, please stay tuned for a patch. :)

>         unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
> --
> 2.43.0
>


-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Jan Kara 1 year, 7 months ago

On Wed 03-07-24 16:17:01, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
> >
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> > update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> > patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> > Intel ICX 160 cores configuration with v6.10-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  fs/file.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a15317db3119..f25eca311f51 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -488,6 +488,11 @@ struct files_struct init_files = {
> >
> >  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >  {
> > +       unsigned int bit;
> > +       bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > +       if (bit < BITS_PER_LONG)
> > +               return bit;
> > +
> 
> The rest of the patchset looks good on cursory read.
> 
> As for this one, the suggestion was to make it work across the entire range.

I'm not sure what do you exactly mean by "make it work across the entire
range" because what Ma has implemented is exactly what I originally had in
mind - i.e., search the first word of open_fds starting from next_fd (note
that 'start' in this function is already set to max(start, next_fd)), if
that fails, go through the two level bitmap.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()

Posted by Ma, Yu 1 year, 7 months ago

On 7/3/2024 10:17 PM, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>> Intel ICX 160 cores configuration with v6.10-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>>   fs/file.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a15317db3119..f25eca311f51 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>
>>   static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>   {
>> +       unsigned int bit;
>> +       bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>> +       if (bit < BITS_PER_LONG)
>> +               return bit;
>> +
> The rest of the patchset looks good on cursory read.
>
> As for this one, the suggestion was to make it work across the entire range.
>
> Today I wont have time to write and test what we proposed, but will
> probably find some time tomorrow. Perhaps Jan will do the needful(tm)
> in the meantime.
>
> That said, please stay tuned for a patch. :)

Sure, understood, Guzik, thanks for the quick feedback and consideration 
of chances to make it better and more versatile. I'll also give a try to 
double check previous proposal on entire fds range.

>>          unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>          unsigned int bitbit = start / BITS_PER_LONG;
>> --
>> 2.43.0
>>
>

[PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in

Posted by Yu Ma 1 year, 7 months ago

pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.

These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been
improved by 32% for read and 17% for write with over 30% kernel cycles
reduced on ICX 160 cores configuration with v6.10-rc4.

v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance
results.
2. Updated patch 1 to resolve the bug identified by Mateusz Guzik with rlimit
check and ensure allocated fd is greater than start.
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer

Yu Ma (3):
  fs/file.c: add fast path in alloc_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: remove sanity_check from alloc_fd()

 fs/file.c | 46 ++++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 22 deletions(-)


base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
-- 
2.43.0

[PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()

Posted by Yu Ma 1 year, 7 months ago

There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.

Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) find_next_zero_bit() itself has a fast path inside to speed up searching
when size<=64.

Besides, "!start" is added to fast path condition to ensure the allocated fd is
greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
(1) Allocating a new fd (the most common usage scenario) via
get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
(2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
old_fd's index in fdt, which is only called by syscall fcntl.

With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..50e900a47107 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (fd < files->next_fd)
 		fd = files->next_fd;
 
-	if (fd < fdt->max_fds)
+	error = -EMFILE;
+	if (likely(fd < fdt->max_fds)) {
+		if (~fdt->open_fds[0] && !start) {
+			fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
+			goto fastreturn;
+		}
 		fd = find_next_fd(fdt, fd);
+	}
+
+	if (unlikely(fd >= fdt->max_fds)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
+		/*
+		 * If we needed to expand the fs array we
+		 * might have blocked - try again.
+		 */
+		if (error)
+			goto repeat;
+	}
 
+fastreturn:
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
 	 * will limit the total number of files that can be opened.
 	 */
-	error = -EMFILE;
-	if (fd >= end)
+	if (unlikely(fd >= end))
 		goto out;
 
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
-
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
-
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
 
-- 
2.43.0