* [PATCH 0/3] fs/file.c: optimize the critical section of
@ 2024-06-14 16:34 Yu Ma
2024-06-14 16:34 ` [PATCH 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
` (6 more replies)
0 siblings, 7 replies; 103+ messages in thread
From: Yu Ma @ 2024-06-14 16:34 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, tim.c.chen, tim.c.chen, pan.deng,
tianyou.li, yu.ma
pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been
improved by 32% for read and 15% for write with over 30% kernel cycles
reduced on ICX 160 cores configuration with v6.8-rc6.
Yu Ma (3):
fs/file.c: add fast path in alloc_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
fs/file.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
@ 2024-06-14 16:34 ` Yu Ma
2024-06-15 6:31 ` Mateusz Guzik
2024-06-19 10:36 ` David Laight
2024-06-14 16:34 ` [PATCH 2/3] fs/file.c: conditionally clear full_fds Yu Ma
` (5 subsequent siblings)
6 siblings, 2 replies; 103+ messages in thread
From: Yu Ma @ 2024-06-14 16:34 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, tim.c.chen, tim.c.chen, pan.deng,
tianyou.li, yu.ma
There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.
Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
With the fast path added in alloc_fd() through one-time bitmap searching,
pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
cores configuration with v6.8-rc6.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index 3b683b9101d8..e8d2f9ef7fd1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (fd < fdt->max_fds) {
+ if (~fdt->open_fds[0]) {
+ fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
+ goto success;
+ }
fd = find_next_fd(fdt, fd);
+ }
/*
* N.B. For clone tasks sharing a files structure, this test
@@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
*/
if (error)
goto repeat;
-
+success:
if (start <= files->next_fd)
files->next_fd = fd + 1;
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH 2/3] fs/file.c: conditionally clear full_fds
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
2024-06-14 16:34 ` [PATCH 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
@ 2024-06-14 16:34 ` Yu Ma
2024-06-14 16:34 ` [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd() Yu Ma
` (4 subsequent siblings)
6 siblings, 0 replies; 103+ messages in thread
From: Yu Ma @ 2024-06-14 16:34 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, tim.c.chen, tim.c.chen, pan.deng,
tianyou.li, yu.ma
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Together with patch 1, they improves pts/blogbench-1.1.0 read for 28%, and write
for 14% on Intel ICX 160 cores configuration with v6.8-rc6.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index e8d2f9ef7fd1..a0e94a178c0b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static unsigned int count_open_files(struct fdtable *fdt)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
2024-06-14 16:34 ` [PATCH 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
2024-06-14 16:34 ` [PATCH 2/3] fs/file.c: conditionally clear full_fds Yu Ma
@ 2024-06-14 16:34 ` Yu Ma
2024-06-15 4:41 ` Mateusz Guzik
2024-06-22 15:49 ` [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
` (3 subsequent siblings)
6 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-06-14 16:34 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, tim.c.chen, tim.c.chen, pan.deng,
tianyou.li, yu.ma
alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
allocated fd is NULL. Move the sanity check from performance critical alloc_fd()
path to non performance critical put_unused_fd() path.
As the initial NULL FILE object condition can be assured by zero initialization
in init_file, we just need to make sure that it is NULL when recycling fd back.
There are 3 functions call __put_unused_fd() to return fd,
file_close_fd_locked(), do_close_on_exec() and put_unused_fd(). For
file_close_fd_locked() and do_close_on_exec(), they have implemented NULL check
already. Adds NULL check to put_unused_fd() to cover all release paths.
Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a0e94a178c0b..59d62909e2e3 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
}
EXPORT_SYMBOL(get_unused_fd_flags);
-static void __put_unused_fd(struct files_struct *files, unsigned int fd)
+static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = files_fdtable(files);
__clear_open_fd(fd, fdt);
@@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
void put_unused_fd(unsigned int fd)
{
struct files_struct *files = current->files;
+ struct fdtable *fdt = files_fdtable(files);
spin_lock(&files->file_lock);
+ if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
+ printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
+ rcu_assign_pointer(fdt->fd[fd], NULL);
+ }
__put_unused_fd(files, fd);
spin_unlock(&files->file_lock);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-14 16:34 ` [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd() Yu Ma
@ 2024-06-15 4:41 ` Mateusz Guzik
2024-06-15 5:07 ` Mateusz Guzik
` (2 more replies)
0 siblings, 3 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-15 4:41 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li
On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
Total nitpick: FILE is the libc thing, I would refer to it as 'struct
file'. See below for the actual point.
> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a0e94a178c0b..59d62909e2e3 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
I was going to ask when was the last time anyone seen this fire and
suggest getting rid of it if enough time(tm) passed. Turns out it does
show up sometimes, latest result I found is 2017 vintage:
https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
So you are moving this to another locked area, but one which does not
execute in the benchmark?
Patch 2/3 states 28% read and 14% write increase, this commit message
claims it goes up to 32% and 15% respectively -- pretty big. I presume
this has to do with bouncing a line containing the fd.
I would argue moving this check elsewhere is about as good as removing
it altogether, but that's for the vfs overlords to decide.
All that aside, looking at disasm of alloc_fd it is pretty clear there
is time to save, for example:
if (unlikely(nr >= fdt->max_fds)) {
if (fd >= end) {
error = -EMFILE;
goto out;
}
error = expand_files(fd, fd);
if (error < 0)
goto out;
if (error)
goto repeat;
}
This elides 2 branches and a func call in the common case. Completely
untested, maybe has some brainfarts, feel free to take without credit
and further massage the routine.
Moreover my disasm shows that even looking for a bit results in
a func call(!) to _find_next_zero_bit -- someone(tm) should probably
massage it into another inline.
After the above massaging is done and if it turns out the check has to
stay, you can plausibly damage-control it with prefetch -- issue it
immediately after finding the fd number, before any other work.
All that said, by the above I'm confident there is still *some*
performance left on the table despite the lock.
> out:
> spin_unlock(&files->file_lock);
> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> }
> EXPORT_SYMBOL(get_unused_fd_flags);
>
> -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> {
> struct fdtable *fdt = files_fdtable(files);
> __clear_open_fd(fd, fdt);
> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> void put_unused_fd(unsigned int fd)
> {
> struct files_struct *files = current->files;
> + struct fdtable *fdt = files_fdtable(files);
> spin_lock(&files->file_lock);
> + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> + rcu_assign_pointer(fdt->fd[fd], NULL);
> + }
> __put_unused_fd(files, fd);
> spin_unlock(&files->file_lock);
> }
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-15 4:41 ` Mateusz Guzik
@ 2024-06-15 5:07 ` Mateusz Guzik
2024-06-17 17:55 ` Tim Chen
2024-06-16 3:47 ` Ma, Yu
2024-06-17 8:36 ` Christian Brauner
2 siblings, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-15 5:07 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li
On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
>
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
>
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 14 ++++++--------
> > 1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a0e94a178c0b..59d62909e2e3 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > - }
> > -#endif
> >
>
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
>
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
>
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
>
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide.
>
> All that aside, looking at disasm of alloc_fd it is pretty clear there
> is time to save, for example:
>
> if (unlikely(nr >= fdt->max_fds)) {
> if (fd >= end) {
> error = -EMFILE;
> goto out;
> }
> error = expand_files(fd, fd);
> if (error < 0)
> goto out;
> if (error)
> goto repeat;
> }
>
Now that I wrote it I noticed the fd < end check has to be performed
regardless of max_fds -- someone could have changed rlimit to a lower
value after using a higher fd. But the main point stands: the call to
expand_files and associated error handling don't have to be there.
> This elides 2 branches and a func call in the common case. Completely
> untested, maybe has some brainfarts, feel free to take without credit
> and further massage the routine.
>
> Moreover my disasm shows that even looking for a bit results in
> a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> massage it into another inline.
>
> After the above massaging is done and if it turns out the check has to
> stay, you can plausibly damage-control it with prefetch -- issue it
> immediately after finding the fd number, before any other work.
>
> All that said, by the above I'm confident there is still *some*
> performance left on the table despite the lock.
>
> > out:
> > spin_unlock(&files->file_lock);
> > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > }
> > EXPORT_SYMBOL(get_unused_fd_flags);
> >
> > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > {
> > struct fdtable *fdt = files_fdtable(files);
> > __clear_open_fd(fd, fdt);
> > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > void put_unused_fd(unsigned int fd)
> > {
> > struct files_struct *files = current->files;
> > + struct fdtable *fdt = files_fdtable(files);
> > spin_lock(&files->file_lock);
> > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > + rcu_assign_pointer(fdt->fd[fd], NULL);
> > + }
> > __put_unused_fd(files, fd);
> > spin_unlock(&files->file_lock);
> > }
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-14 16:34 ` [PATCH 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
@ 2024-06-15 6:31 ` Mateusz Guzik
2024-06-16 4:01 ` Ma, Yu
2024-06-17 17:49 ` Tim Chen
2024-06-19 10:36 ` David Laight
1 sibling, 2 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-15 6:31 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li
On Fri, Jun 14, 2024 at 12:34:14PM -0400, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> With the fast path added in alloc_fd() through one-time bitmap searching,
> pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
> cores configuration with v6.8-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 3b683b9101d8..e8d2f9ef7fd1 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (fd < fdt->max_fds) {
> + if (~fdt->open_fds[0]) {
> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> + goto success;
> + }
> fd = find_next_fd(fdt, fd);
> + }
>
> /*
> * N.B. For clone tasks sharing a files structure, this test
> @@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> */
> if (error)
> goto repeat;
> -
> +success:
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
As indicated in my other e-mail it may be a process can reach a certain
fd number and then lower its rlimit(NOFILE). In that case the max_fds
field can happen to be higher and the above patch will fail to check for
the (fd < end) case.
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-15 4:41 ` Mateusz Guzik
2024-06-15 5:07 ` Mateusz Guzik
@ 2024-06-16 3:47 ` Ma, Yu
2024-06-17 11:23 ` Mateusz Guzik
2024-06-17 8:36 ` Christian Brauner
2 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-06-16 3:47 UTC (permalink / raw)
To: Mateusz Guzik
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li, yu.ma
On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
Good point, not nitpick at all ;) , will update the word in commit message.
>> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
>> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 14 ++++++--------
>> 1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a0e94a178c0b..59d62909e2e3 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> else
>> __clear_close_on_exec(fd, fdt);
>> error = fd;
>> -#if 1
>> - /* Sanity check */
>> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
>> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
>> - rcu_assign_pointer(fdt->fd[fd], NULL);
>> - }
>> -#endif
>>
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
>
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
The consideration here as mentioned is to reduce the performance impact
(if to reserve the sanity check, and have the same functionality) by
moving it from critical path to non-critical, as put_unused_fd() is
mostly used for error handling when fd is allocated successfully but
struct file failed to obtained in the next step.
>
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
>
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide
>
> All that aside, looking at disasm of alloc_fd it is pretty clear there
> is time to save, for example:
>
> if (unlikely(nr >= fdt->max_fds)) {
> if (fd >= end) {
> error = -EMFILE;
> goto out;
> }
> error = expand_files(fd, fd);
> if (error < 0)
> goto out;
> if (error)
> goto repeat;
> }
>
> This elides 2 branches and a func call in the common case. Completely
> untested, maybe has some brainfarts, feel free to take without credit
> and further massage the routine.
>
> Moreover my disasm shows that even looking for a bit results in
> a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> massage it into another inline.
>
> After the above massaging is done and if it turns out the check has to
> stay, you can plausibly damage-control it with prefetch -- issue it
> immediately after finding the fd number, before any other work.
>
> All that said, by the above I'm confident there is still *some*
> performance left on the table despite the lock.
Thank you Guzik for such quick check and good suggestions :) Yes, there
are extra branches and func call can be reduced for better performance,
considering the fix for fast path, how about flow as below draft
(besides the massage to find_next_fd()):
error = -EMFILE;
if (fd < fdt->max_fds) {
if (~fdt->open_fds[0]) {
fd = find_next_zero_bit(fdt->open_fds,
BITS_PER_LONG, fd);
goto fastreturn;
}
fd = find_next_fd(fdt, fd);
}
if (unlikely(fd >= fdt->max_fds)) {
error = expand_files(files, fd);
if (error < 0)
goto out;
if (error)
goto repeat;
}
fastreturn:
if (unlikely(fd >= end))
goto out;
if (start <= files->next_fd)
files->next_fd = fd + 1;
....
>> out:
>> spin_unlock(&files->file_lock);
>> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
>> }
>> EXPORT_SYMBOL(get_unused_fd_flags);
>>
>> -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> {
>> struct fdtable *fdt = files_fdtable(files);
>> __clear_open_fd(fd, fdt);
>> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> void put_unused_fd(unsigned int fd)
>> {
>> struct files_struct *files = current->files;
>> + struct fdtable *fdt = files_fdtable(files);
>> spin_lock(&files->file_lock);
>> + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
>> + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
>> + rcu_assign_pointer(fdt->fd[fd], NULL);
>> + }
>> __put_unused_fd(files, fd);
>> spin_unlock(&files->file_lock);
>> }
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-15 6:31 ` Mateusz Guzik
@ 2024-06-16 4:01 ` Ma, Yu
2024-06-17 17:49 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-06-16 4:01 UTC (permalink / raw)
To: Mateusz Guzik
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li, yu.ma
On 6/15/2024 2:31 PM, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:14PM -0400, Yu Ma wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> With the fast path added in alloc_fd() through one-time bitmap searching,
>> pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
>> cores configuration with v6.8-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 3b683b9101d8..e8d2f9ef7fd1 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> - if (fd < fdt->max_fds)
>> + if (fd < fdt->max_fds) {
>> + if (~fdt->open_fds[0]) {
>> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>> + goto success;
>> + }
>> fd = find_next_fd(fdt, fd);
>> + }
>>
>> /*
>> * N.B. For clone tasks sharing a files structure, this test
>> @@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> */
>> if (error)
>> goto repeat;
>> -
>> +success:
>> if (start <= files->next_fd)
>> files->next_fd = fd + 1;
>>
> As indicated in my other e-mail it may be a process can reach a certain
> fd number and then lower its rlimit(NOFILE). In that case the max_fds
> field can happen to be higher and the above patch will fail to check for
> the (fd < end) case.
Thanks for the good catch, replied in that mail thread for details.
>
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-15 4:41 ` Mateusz Guzik
2024-06-15 5:07 ` Mateusz Guzik
2024-06-16 3:47 ` Ma, Yu
@ 2024-06-17 8:36 ` Christian Brauner
2 siblings, 0 replies; 103+ messages in thread
From: Christian Brauner @ 2024-06-17 8:36 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yu Ma, viro, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li
On Sat, Jun 15, 2024 at 06:41:45AM GMT, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
>
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
>
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 14 ++++++--------
> > 1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a0e94a178c0b..59d62909e2e3 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > - }
> > -#endif
> >
>
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
>
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
>
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
>
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide.
This all dates back to 9d11a5176cc5 ("I just put a pre-90 on
ftp.kernel.org, and I'm happy to report that Davem")
which is pre-git. I think removing is fine.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-16 3:47 ` Ma, Yu
@ 2024-06-17 11:23 ` Mateusz Guzik
2024-06-17 17:22 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-17 11:23 UTC (permalink / raw)
To: Ma, Yu
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li
On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote:
>
> On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
> > So you are moving this to another locked area, but one which does not
> > execute in the benchmark?
>
> The consideration here as mentioned is to reduce the performance impact (if
> to reserve the sanity check, and have the same functionality) by moving it
> from critical path to non-critical, as put_unused_fd() is mostly used for
> error handling when fd is allocated successfully but struct file failed to
> obtained in the next step.
>
As mentioned by Christian in his mail this check can just be removed.
> error = -EMFILE;
> if (fd < fdt->max_fds) {
I would likely() on it.
> if (~fdt->open_fds[0]) {
> fd = find_next_zero_bit(fdt->open_fds,
> BITS_PER_LONG, fd);
> goto fastreturn;
> }
> fd = find_next_fd(fdt, fd);
> }
>
> if (unlikely(fd >= fdt->max_fds)) {
> error = expand_files(files, fd);
> if (error < 0)
> goto out;
> if (error)
> goto repeat;
> }
> fastreturn:
> if (unlikely(fd >= end))
> goto out;
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
> ....
>
This is not a review, but it does read fine.
LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd
tests, I would make sure they still pass before posting a v2.
I would definitely try moving out the lock to its own cacheline --
currently it resides with the bitmaps (and first 4 fds of the embedded
array).
I expect it to provide some win on top of the current patchset, but
whether it will be sufficient to justify it I have no idea.
Something of this sort:
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 2944d4aa413b..423cb599268a 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -50,11 +50,11 @@ struct files_struct {
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
- unsigned int next_fd;
+ unsigned int next_fd ____cacheline_aligned_in_smp;
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
- struct file __rcu * fd_array[NR_OPEN_DEFAULT];
+ struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp;
};
struct file_operations;
(feel free to just take)
All that said, I have to note blogbench has numerous bugs. For example
it collects stats by merely bumping a global variable (not even with
atomics), so some updates are straight up lost.
To my understanding it was meant to test filesystems and I'm guessing
threading (instead of separate processes) was only used to make it
easier to share statistics. Given the current scales this
unintentionally transitioned into bottlenecking on the file_lock
instead.
There were scalability changes made about 9 years ago and according to
git log they were benchmarked by Eric Dumazet, while benchmark code was
not shared. I suggest CC'ing the guy with your v2 and asking if he can
bench. Even if he is no longer in position to do it perhaps he can rope
someone in (or even better share his benchmark).
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-17 11:23 ` Mateusz Guzik
@ 2024-06-17 17:22 ` Ma, Yu
0 siblings, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-06-17 17:22 UTC (permalink / raw)
To: Mateusz Guzik
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
tim.c.chen, pan.deng, tianyou.li, yu.ma
On 6/17/2024 7:23 PM, Mateusz Guzik wrote:
> On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote:
>> On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
>>> So you are moving this to another locked area, but one which does not
>>> execute in the benchmark?
>> The consideration here as mentioned is to reduce the performance impact (if
>> to reserve the sanity check, and have the same functionality) by moving it
>> from critical path to non-critical, as put_unused_fd() is mostly used for
>> error handling when fd is allocated successfully but struct file failed to
>> obtained in the next step.
>>
> As mentioned by Christian in his mail this check can just be removed.
Yes, that's great, I'll update in v2
>
>> error = -EMFILE;
>> if (fd < fdt->max_fds) {
> I would likely() on it.
That's better :)
>
>> if (~fdt->open_fds[0]) {
>> fd = find_next_zero_bit(fdt->open_fds,
>> BITS_PER_LONG, fd);
>> goto fastreturn;
>> }
>> fd = find_next_fd(fdt, fd);
>> }
>>
>> if (unlikely(fd >= fdt->max_fds)) {
>> error = expand_files(files, fd);
>> if (error < 0)
>> goto out;
>> if (error)
>> goto repeat;
>> }
>> fastreturn:
>> if (unlikely(fd >= end))
>> goto out;
>> if (start <= files->next_fd)
>> files->next_fd = fd + 1;
>>
>> ....
>>
> This is not a review, but it does read fine.
>
> LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd
> tests, I would make sure they still pass before posting a v2.
Got it, thanks for the kind reminder.
>
> I would definitely try moving out the lock to its own cacheline --
> currently it resides with the bitmaps (and first 4 fds of the embedded
> array).
>
> I expect it to provide some win on top of the current patchset, but
> whether it will be sufficient to justify it I have no idea.
>
> Something of this sort:
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index 2944d4aa413b..423cb599268a 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -50,11 +50,11 @@ struct files_struct {
> * written part on a separate cache line in SMP
> */
> spinlock_t file_lock ____cacheline_aligned_in_smp;
> - unsigned int next_fd;
> + unsigned int next_fd ____cacheline_aligned_in_smp;
> unsigned long close_on_exec_init[1];
> unsigned long open_fds_init[1];
> unsigned long full_fds_bits_init[1];
> - struct file __rcu * fd_array[NR_OPEN_DEFAULT];
> + struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp;
> };
>
> struct file_operations;
>
> (feel free to just take)
Thanks Guzik for the thoughtful suggestions, seems we've ever tried to
separate file_lock and next_fd to different cachelines, but no obvious
improvement observed, we'll double check and verify these 2 tips to see
how it goes.
>
> All that said, I have to note blogbench has numerous bugs. For example
> it collects stats by merely bumping a global variable (not even with
> atomics), so some updates are straight up lost.
>
> To my understanding it was meant to test filesystems and I'm guessing
> threading (instead of separate processes) was only used to make it
> easier to share statistics. Given the current scales this
> unintentionally transitioned into bottlenecking on the file_lock
> instead.
>
> There were scalability changes made about 9 years ago and according to
> git log they were benchmarked by Eric Dumazet, while benchmark code was
> not shared. I suggest CC'ing the guy with your v2 and asking if he can
> bench. Even if he is no longer in position to do it perhaps he can rope
> someone in (or even better share his benchmark).
Good advice, we also noticed the problem for blogbench, and current data
in commits is collected based on revision to make sure the variables for
score statistic/reporting are safe, will submit patch to blogbench for
update. We'll copy Eric for more information on his benchmark and check
details if it is available.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-15 6:31 ` Mateusz Guzik
2024-06-16 4:01 ` Ma, Yu
@ 2024-06-17 17:49 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Tim Chen @ 2024-06-17 17:49 UTC (permalink / raw)
To: Mateusz Guzik, Yu Ma
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
pan.deng, tianyou.li
On Sat, 2024-06-15 at 08:31 +0200, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:14PM -0400, Yu Ma wrote:
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > With the fast path added in alloc_fd() through one-time bitmap searching,
> > pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
> > cores configuration with v6.8-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index 3b683b9101d8..e8d2f9ef7fd1 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > if (fd < files->next_fd)
> > fd = files->next_fd;
> >
> > - if (fd < fdt->max_fds)
> > + if (fd < fdt->max_fds) {
> > + if (~fdt->open_fds[0]) {
> > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
Will adding a check here work to ensure fd < end?
if (unlikely(fd >= end)) {
error = -EMFILE;
goto out;
}
> > + goto success;
> > + }
> > fd = find_next_fd(fdt, fd);
> > + }
> >
> > /*
> > * N.B. For clone tasks sharing a files structure, this test
> > @@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > */
> > if (error)
> > goto repeat;
> > -
> > +success:
> > if (start <= files->next_fd)
> > files->next_fd = fd + 1;
> >
>
> As indicated in my other e-mail it may be a process can reach a certain
> fd number and then lower its rlimit(NOFILE). In that case the max_fds
> field can happen to be higher and the above patch will fail to check for
> the (fd < end) case.
>
Thanks.
Tim
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-15 5:07 ` Mateusz Guzik
@ 2024-06-17 17:55 ` Tim Chen
2024-06-17 17:59 ` Mateusz Guzik
2024-06-17 18:04 ` Tim Chen
0 siblings, 2 replies; 103+ messages in thread
From: Tim Chen @ 2024-06-17 17:55 UTC (permalink / raw)
To: Mateusz Guzik, Yu Ma
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
pan.deng, tianyou.li
On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote:
> On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> >
> > Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> > file'. See below for the actual point.
> >
> > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > ---
> > > fs/file.c | 14 ++++++--------
> > > 1 file changed, 6 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/fs/file.c b/fs/file.c
> > > index a0e94a178c0b..59d62909e2e3 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > else
> > > __clear_close_on_exec(fd, fdt);
> > > error = fd;
> > > -#if 1
> > > - /* Sanity check */
> > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > > - }
> > > -#endif
> > >
> >
> > I was going to ask when was the last time anyone seen this fire and
> > suggest getting rid of it if enough time(tm) passed. Turns out it does
> > show up sometimes, latest result I found is 2017 vintage:
> > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
> >
> > So you are moving this to another locked area, but one which does not
> > execute in the benchmark?
> >
> > Patch 2/3 states 28% read and 14% write increase, this commit message
> > claims it goes up to 32% and 15% respectively -- pretty big. I presume
> > this has to do with bouncing a line containing the fd.
> >
> > I would argue moving this check elsewhere is about as good as removing
> > it altogether, but that's for the vfs overlords to decide.
> >
> > All that aside, looking at disasm of alloc_fd it is pretty clear there
> > is time to save, for example:
> >
> > if (unlikely(nr >= fdt->max_fds)) {
> > if (fd >= end) {
> > error = -EMFILE;
> > goto out;
> > }
> > error = expand_files(fd, fd);
> > if (error < 0)
> > goto out;
> > if (error)
> > goto repeat;
> > }
> >
>
> Now that I wrote it I noticed the fd < end check has to be performed
> regardless of max_fds -- someone could have changed rlimit to a lower
> value after using a higher fd. But the main point stands: the call to
> expand_files and associated error handling don't have to be there.
To really prevent someone from mucking with rlimit, we should probably
take the task_lock to prevent do_prlimit() racing with this function.
task_lock(current->group_leader);
Tim
>
> > This elides 2 branches and a func call in the common case. Completely
> > untested, maybe has some brainfarts, feel free to take without credit
> > and further massage the routine.
> >
> > Moreover my disasm shows that even looking for a bit results in
> > a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> > massage it into another inline.
> >
> > After the above massaging is done and if it turns out the check has to
> > stay, you can plausibly damage-control it with prefetch -- issue it
> > immediately after finding the fd number, before any other work.
> >
> > All that said, by the above I'm confident there is still *some*
> > performance left on the table despite the lock.
> >
> > > out:
> > > spin_unlock(&files->file_lock);
> > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > > }
> > > EXPORT_SYMBOL(get_unused_fd_flags);
> > >
> > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > {
> > > struct fdtable *fdt = files_fdtable(files);
> > > __clear_open_fd(fd, fdt);
> > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > void put_unused_fd(unsigned int fd)
> > > {
> > > struct files_struct *files = current->files;
> > > + struct fdtable *fdt = files_fdtable(files);
> > > spin_lock(&files->file_lock);
> > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > > + rcu_assign_pointer(fdt->fd[fd], NULL);
> > > + }
> > > __put_unused_fd(files, fd);
> > > spin_unlock(&files->file_lock);
> > > }
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-17 17:55 ` Tim Chen
@ 2024-06-17 17:59 ` Mateusz Guzik
2024-06-17 18:04 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-17 17:59 UTC (permalink / raw)
To: Tim Chen
Cc: Yu Ma, viro, brauner, jack, linux-fsdevel, linux-kernel,
tim.c.chen, pan.deng, tianyou.li
On Mon, Jun 17, 2024 at 7:55 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote:
> > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> > >
> > > Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> > > file'. See below for the actual point.
> > >
> > > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > > fs/file.c | 14 ++++++--------
> > > > 1 file changed, 6 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a0e94a178c0b..59d62909e2e3 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > > else
> > > > __clear_close_on_exec(fd, fdt);
> > > > error = fd;
> > > > -#if 1
> > > > - /* Sanity check */
> > > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > - }
> > > > -#endif
> > > >
> > >
> > > I was going to ask when was the last time anyone seen this fire and
> > > suggest getting rid of it if enough time(tm) passed. Turns out it does
> > > show up sometimes, latest result I found is 2017 vintage:
> > > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
> > >
> > > So you are moving this to another locked area, but one which does not
> > > execute in the benchmark?
> > >
> > > Patch 2/3 states 28% read and 14% write increase, this commit message
> > > claims it goes up to 32% and 15% respectively -- pretty big. I presume
> > > this has to do with bouncing a line containing the fd.
> > >
> > > I would argue moving this check elsewhere is about as good as removing
> > > it altogether, but that's for the vfs overlords to decide.
> > >
> > > All that aside, looking at disasm of alloc_fd it is pretty clear there
> > > is time to save, for example:
> > >
> > > if (unlikely(nr >= fdt->max_fds)) {
> > > if (fd >= end) {
> > > error = -EMFILE;
> > > goto out;
> > > }
> > > error = expand_files(fd, fd);
> > > if (error < 0)
> > > goto out;
> > > if (error)
> > > goto repeat;
> > > }
> > >
> >
> > Now that I wrote it I noticed the fd < end check has to be performed
> > regardless of max_fds -- someone could have changed rlimit to a lower
> > value after using a higher fd. But the main point stands: the call to
> > expand_files and associated error handling don't have to be there.
>
> To really prevent someone from mucking with rlimit, we should probably
> take the task_lock to prevent do_prlimit() racing with this function.
>
> task_lock(current->group_leader);
>
It's fine to race against rlimit adjustments.
The problem here is that both in my toy refactoring above and the
posted patch the thread can use a high fd, lower the rlimit on its own
and not have it respected on calls made later.
>
> >
> > > This elides 2 branches and a func call in the common case. Completely
> > > untested, maybe has some brainfarts, feel free to take without credit
> > > and further massage the routine.
> > >
> > > Moreover my disasm shows that even looking for a bit results in
> > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> > > massage it into another inline.
> > >
> > > After the above massaging is done and if it turns out the check has to
> > > stay, you can plausibly damage-control it with prefetch -- issue it
> > > immediately after finding the fd number, before any other work.
> > >
> > > All that said, by the above I'm confident there is still *some*
> > > performance left on the table despite the lock.
> > >
> > > > out:
> > > > spin_unlock(&files->file_lock);
> > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > > > }
> > > > EXPORT_SYMBOL(get_unused_fd_flags);
> > > >
> > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > {
> > > > struct fdtable *fdt = files_fdtable(files);
> > > > __clear_open_fd(fd, fdt);
> > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > void put_unused_fd(unsigned int fd)
> > > > {
> > > > struct files_struct *files = current->files;
> > > > + struct fdtable *fdt = files_fdtable(files);
> > > > spin_lock(&files->file_lock);
> > > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > > > + rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > + }
> > > > __put_unused_fd(files, fd);
> > > > spin_unlock(&files->file_lock);
> > > > }
> >
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-17 17:55 ` Tim Chen
2024-06-17 17:59 ` Mateusz Guzik
@ 2024-06-17 18:04 ` Tim Chen
2024-06-18 8:35 ` Michal Hocko
1 sibling, 1 reply; 103+ messages in thread
From: Tim Chen @ 2024-06-17 18:04 UTC (permalink / raw)
To: Mateusz Guzik, Yu Ma
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, tim.c.chen,
pan.deng, tianyou.li
On Mon, 2024-06-17 at 10:55 -0700, Tim Chen wrote:
> On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote:
> > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> > >
> > >
> >
> > Now that I wrote it I noticed the fd < end check has to be performed
> > regardless of max_fds -- someone could have changed rlimit to a lower
> > value after using a higher fd. But the main point stands: the call to
> > expand_files and associated error handling don't have to be there.
>
> To really prevent someone from mucking with rlimit, we should probably
> take the task_lock to prevent do_prlimit() racing with this function.
>
> task_lock(current->group_leader);
>
> Tim
And also move the task_lock in do_prlimit() before the RLIMIT_NOFILE check.
diff --git a/kernel/sys.c b/kernel/sys.c
index 3a2df1bd9f64..b4e523728c3e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
return -EINVAL;
resource = array_index_nospec(resource, RLIM_NLIMITS);
+ task_lock(tsk->group_leader);
if (new_rlim) {
if (new_rlim->rlim_cur > new_rlim->rlim_max)
return -EINVAL;
@@ -1481,7 +1482,6 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
/* Holding a refcount on tsk protects tsk->signal from disappearing. */
rlim = tsk->signal->rlim + resource;
- task_lock(tsk->group_leader);
if (new_rlim) {
/*
* Keep the capable check against init_user_ns until cgroups can
Tim
> >
> > > This elides 2 branches and a func call in the common case. Completely
> > > untested, maybe has some brainfarts, feel free to take without credit
> > > and further massage the routine.
> > >
> > > Moreover my disasm shows that even looking for a bit results in
> > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> > > massage it into another inline.
> > >
> > > After the above massaging is done and if it turns out the check has to
> > > stay, you can plausibly damage-control it with prefetch -- issue it
> > > immediately after finding the fd number, before any other work.
> > >
> > > All that said, by the above I'm confident there is still *some*
> > > performance left on the table despite the lock.
> > >
> > > > out:
> > > > spin_unlock(&files->file_lock);
> > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > > > }
> > > > EXPORT_SYMBOL(get_unused_fd_flags);
> > > >
> > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > {
> > > > struct fdtable *fdt = files_fdtable(files);
> > > > __clear_open_fd(fd, fdt);
> > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > void put_unused_fd(unsigned int fd)
> > > > {
> > > > struct files_struct *files = current->files;
> > > > + struct fdtable *fdt = files_fdtable(files);
> > > > spin_lock(&files->file_lock);
> > > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > > > + rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > + }
> > > > __put_unused_fd(files, fd);
> > > > spin_unlock(&files->file_lock);
> > > > }
> >
>
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-17 18:04 ` Tim Chen
@ 2024-06-18 8:35 ` Michal Hocko
2024-06-18 9:06 ` Mateusz Guzik
2024-06-18 20:40 ` Tim Chen
0 siblings, 2 replies; 103+ messages in thread
From: Michal Hocko @ 2024-06-18 8:35 UTC (permalink / raw)
To: Tim Chen
Cc: Mateusz Guzik, Yu Ma, viro, brauner, jack, linux-fsdevel,
linux-kernel, tim.c.chen, pan.deng, tianyou.li
On Mon 17-06-24 11:04:41, Tim Chen wrote:
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 3a2df1bd9f64..b4e523728c3e 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> return -EINVAL;
> resource = array_index_nospec(resource, RLIM_NLIMITS);
>
> + task_lock(tsk->group_leader);
> if (new_rlim) {
> if (new_rlim->rlim_cur > new_rlim->rlim_max)
> return -EINVAL;
This is clearly broken as it leaves the lock behind on the error, no?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-18 8:35 ` Michal Hocko
@ 2024-06-18 9:06 ` Mateusz Guzik
2024-06-18 20:40 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-18 9:06 UTC (permalink / raw)
To: Michal Hocko
Cc: Tim Chen, Yu Ma, viro, brauner, jack, linux-fsdevel, linux-kernel,
tim.c.chen, pan.deng, tianyou.li
On Tue, Jun 18, 2024 at 10:35 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 17-06-24 11:04:41, Tim Chen wrote:
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 3a2df1bd9f64..b4e523728c3e 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > return -EINVAL;
> > resource = array_index_nospec(resource, RLIM_NLIMITS);
> >
> > + task_lock(tsk->group_leader);
> > if (new_rlim) {
> > if (new_rlim->rlim_cur > new_rlim->rlim_max)
> > return -EINVAL;
>
> This is clearly broken as it leaves the lock behind on the error, no?
As I explained in my other e-mail there is no need to synchronize
against rlimit changes, merely the code needs to honor the observed
value, whatever it is.
This holds for the stock kernel, does not hold for v1 of the patchset
and presumably will be addressed in v2.
Even if some synchronization was required, moving the lock higher does
not buy anything because the newly protected area only validates the
new limit before there is any attempt at setting it and the old limit
is not looked at.
I think we can safely drop this area and patiently wait for v2.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
2024-06-18 8:35 ` Michal Hocko
2024-06-18 9:06 ` Mateusz Guzik
@ 2024-06-18 20:40 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Tim Chen @ 2024-06-18 20:40 UTC (permalink / raw)
To: Michal Hocko
Cc: Mateusz Guzik, Yu Ma, viro, brauner, jack, linux-fsdevel,
linux-kernel, tim.c.chen, pan.deng, tianyou.li
On Tue, 2024-06-18 at 10:35 +0200, Michal Hocko wrote:
> On Mon 17-06-24 11:04:41, Tim Chen wrote:
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 3a2df1bd9f64..b4e523728c3e 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > return -EINVAL;
> > resource = array_index_nospec(resource, RLIM_NLIMITS);
> >
> > + task_lock(tsk->group_leader);
> > if (new_rlim) {
> > if (new_rlim->rlim_cur > new_rlim->rlim_max)
> > return -EINVAL;
>
> This is clearly broken as it leaves the lock behind on the error, no?
Thanks for pointing that out. Need unlock before return. I don't like
adding task_lock in alloc_fd path if there are multiple alloc_fd going
on causing contention.
The race with rlimit change should be a very rare thing. Should be
sufficient that patch one check for fd not going beyond the observed
rlimit.
Tim
^ permalink raw reply [flat|nested] 103+ messages in thread
* RE: [PATCH 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-14 16:34 ` [PATCH 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
2024-06-15 6:31 ` Mateusz Guzik
@ 2024-06-19 10:36 ` David Laight
2024-06-19 17:09 ` Ma, Yu
1 sibling, 1 reply; 103+ messages in thread
From: David Laight @ 2024-06-19 10:36 UTC (permalink / raw)
To: 'Yu Ma', viro@zeniv.linux.org.uk, brauner@kernel.org,
jack@suse.cz
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
tim.c.chen@linux.intel.com, tim.c.chen@intel.com,
pan.deng@intel.com, tianyou.li@intel.com
From: Yu Ma <yu.ma@intel.com>
> Sent: 14 June 2024 17:34
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> With the fast path added in alloc_fd() through one-time bitmap searching,
> pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
> cores configuration with v6.8-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 3b683b9101d8..e8d2f9ef7fd1 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (fd < fdt->max_fds) {
> + if (~fdt->open_fds[0]) {
> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> + goto success;
> + }
> fd = find_next_fd(fdt, fd);
> + }
Hmm...
How well does that work when the initial fd is > 64?
Since there is exactly one call to find_next_fd() and it is static and should
be inlined doesn't this optimisation belong inside find_next_fd().
Plausibly find_next_fd() just needs rewriting.
Or, possibly. even inside an inlinable copy of find_next_zero-bit()
(although a lot of callers won't be 'hot' enough for the inlined bloat
being worth while).
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: RE: [PATCH 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-19 10:36 ` David Laight
@ 2024-06-19 17:09 ` Ma, Yu
0 siblings, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-06-19 17:09 UTC (permalink / raw)
To: David Laight, viro@zeniv.linux.org.uk, brauner@kernel.org,
jack@suse.cz, Mateusz Guzik
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
tim.c.chen@linux.intel.com, tim.c.chen@intel.com,
pan.deng@intel.com, tianyou.li@intel.com, yu.ma
On 6/19/2024 6:36 PM, David Laight wrote:
> From: Yu Ma <yu.ma@intel.com>
>> Sent: 14 June 2024 17:34
>>
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> With the fast path added in alloc_fd() through one-time bitmap searching,
>> pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
>> cores configuration with v6.8-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 3b683b9101d8..e8d2f9ef7fd1 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> - if (fd < fdt->max_fds)
>> + if (fd < fdt->max_fds) {
>> + if (~fdt->open_fds[0]) {
>> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>> + goto success;
>> + }
>> fd = find_next_fd(fdt, fd);
>> + }
> Hmm...
> How well does that work when the initial fd is > 64?
>
> Since there is exactly one call to find_next_fd() and it is static and should
> be inlined doesn't this optimisation belong inside find_next_fd().
>
> Plausibly find_next_fd() just needs rewriting.
The consideration for this fast path is as stated in commit, for
scenarios like fd>64, it means that fast path already worked in the
first 64 bits for fast return and all other times when any fd<64 gets
recycled and then allocated. For some cases like a process opened more
than 64 fds and kept occupied, the extra cost would be a conditional
statement which can be benefit from branch prediction, as Guzik
suggests, we'll copy Eric for benchmark to check the effect if it is
available. For the code, it's more efficient to be here outside of
find_next_fd() for jumping to fast return. Besides, identified by Guzik,
find_next_fd() itself could be improved with inlined calls inside for
better performance, story for another patch :)
>
> Or, possibly. even inside an inlinable copy of find_next_zero-bit()
> (although a lot of callers won't be 'hot' enough for the inlined bloat
> being worth while).
As mentioned, current find_next_zero_bit() already has a fast path
inside to handle the searching size <= 64, and it has been utilized here
for fast return.
>
> David
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
` (2 preceding siblings ...)
2024-06-14 16:34 ` [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd() Yu Ma
@ 2024-06-22 15:49 ` Yu Ma
2024-06-22 15:49 ` [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
` (2 more replies)
2024-07-03 14:33 ` [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
` (2 subsequent siblings)
6 siblings, 3 replies; 103+ messages in thread
From: Yu Ma @ 2024-06-22 15:49 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been
improved by 32% for read and 17% for write with over 30% kernel cycles
reduced on ICX 160 cores configuration with v6.10-rc4.
v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance
results.
2. Updated patch 1 to resolve the bug identified by Mateusz Guzik with rlimit
check and ensure allocated fd is greater than start.
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer
Yu Ma (3):
fs/file.c: add fast path in alloc_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: remove sanity_check from alloc_fd()
fs/file.c | 46 ++++++++++++++++++++++++----------------------
1 file changed, 24 insertions(+), 22 deletions(-)
base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
--
2.43.0
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-22 15:49 ` [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-06-22 15:49 ` Yu Ma
2024-06-25 11:52 ` Jan Kara
2024-06-22 15:49 ` [PATCH v2 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-06-22 15:49 ` [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd() Yu Ma
2 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-06-22 15:49 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.
Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) find_next_zero_bit() itself has a fast path inside to speed up searching
when size<=64.
Besides, "!start" is added to fast path condition to ensure the allocated fd is
greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
(1) Allocating a new fd (the most common usage scenario) via
get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
(2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
old_fd's index in fdt, which is only called by syscall fcntl.
With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 35 +++++++++++++++++++++--------------
1 file changed, 21 insertions(+), 14 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..50e900a47107 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ error = -EMFILE;
+ if (likely(fd < fdt->max_fds)) {
+ if (~fdt->open_fds[0] && !start) {
+ fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
+ goto fastreturn;
+ }
fd = find_next_fd(fdt, fd);
+ }
+
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
+fastreturn:
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
- error = -EMFILE;
- if (fd >= end)
+ if (unlikely(fd >= end))
goto out;
- error = expand_files(files, fd);
- if (error < 0)
- goto out;
-
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
-
if (start <= files->next_fd)
files->next_fd = fd + 1;
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v2 2/3] fs/file.c: conditionally clear full_fds
2024-06-22 15:49 ` [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-06-22 15:49 ` [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
@ 2024-06-22 15:49 ` Yu Ma
2024-06-25 11:54 ` Jan Kara
2024-06-22 15:49 ` [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd() Yu Ma
2 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-06-22 15:49 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Together with patch 1, they improves pts/blogbench-1.1.0 read for 27%, and write
for 14% on Intel ICX 160 cores configuration with v6.10-rc4.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 50e900a47107..b4d25f6d4c19 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd()
2024-06-22 15:49 ` [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-06-22 15:49 ` [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
2024-06-22 15:49 ` [PATCH v2 2/3] fs/file.c: conditionally clear full_fds Yu Ma
@ 2024-06-22 15:49 ` Yu Ma
2024-06-25 12:08 ` Jan Kara
2 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-06-22 15:49 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd.
Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index b4d25f6d4c19..1153b0b7ba3d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-22 15:49 ` [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
@ 2024-06-25 11:52 ` Jan Kara
2024-06-25 12:53 ` Jan Kara
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-06-25 11:52 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Sat 22-06-24 11:49:02, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> when size<=64.
>
> Besides, "!start" is added to fast path condition to ensure the allocated fd is
> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> (1) Allocating a new fd (the most common usage scenario) via
> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> old_fd's index in fdt, which is only called by syscall fcntl.
>
> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 35 +++++++++++++++++++++--------------
> 1 file changed, 21 insertions(+), 14 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..50e900a47107 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + error = -EMFILE;
> + if (likely(fd < fdt->max_fds)) {
> + if (~fdt->open_fds[0] && !start) {
> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
So I don't think this is quite correct. If files->next_fd is set, we could
end up calling find_next_zero_bit() starting from quite high offset causing
a regression? Also because we don't expand in this case, we could cause access
beyond end of fdtable?
Finally, AFAIU this speeds up the lookup for cases where fd < 64 is
available at the cost of cases where the first long is full (there we
unnecessarily load open_fds[0] into cache). Did you check if the cost is
visible (e.g. by making blogbench occupy first 64 fds before starting its
load)?
Honza
> + goto fastreturn;
> + }
> fd = find_next_fd(fdt, fd);
> + }
> +
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
> + }
>
> +fastreturn:
> /*
> * N.B. For clone tasks sharing a files structure, this test
> * will limit the total number of files that can be opened.
> */
> - error = -EMFILE;
> - if (fd >= end)
> + if (unlikely(fd >= end))
> goto out;
>
> - error = expand_files(files, fd);
> - if (error < 0)
> - goto out;
> -
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> -
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 2/3] fs/file.c: conditionally clear full_fds
2024-06-22 15:49 ` [PATCH v2 2/3] fs/file.c: conditionally clear full_fds Yu Ma
@ 2024-06-25 11:54 ` Jan Kara
2024-06-25 15:41 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-06-25 11:54 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Sat 22-06-24 11:49:03, Yu Ma wrote:
> 64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
> likely that a bit in full_fds_bits has been cleared before in
> __clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
> clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
> ("vfs: conditionally clear close-on-exec flag") for a similar optimization.
> Together with patch 1, they improves pts/blogbench-1.1.0 read for 27%, and write
> for 14% on Intel ICX 160 cores configuration with v6.10-rc4.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Nice. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/file.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 50e900a47107..b4d25f6d4c19 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
> static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
> {
> __clear_bit(fd, fdt->open_fds);
> - __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
> + fd /= BITS_PER_LONG;
> + if (test_bit(fd, fdt->full_fds_bits))
> + __clear_bit(fd, fdt->full_fds_bits);
> }
>
> static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd()
2024-06-22 15:49 ` [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd() Yu Ma
@ 2024-06-25 12:08 ` Jan Kara
2024-06-25 13:09 ` Mateusz Guzik
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-06-25 12:08 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Sat 22-06-24 11:49:04, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd.
^^^ existing ^^^ initialization
Well, since this is a sanity check, it is expected it never hits. Yet
searching the web shows it has hit a few times in the past :). So would
wrapping this with unlikely() give a similar performance gain while keeping
debugability? If unlikely() does not help, I agree we can remove this since
fd_install() actually has the same check:
BUG_ON(fdt->fd[fd] != NULL);
and there we need the cacheline anyway so performance impact is minimal.
Now, this condition in alloc_fd() is nice that it does not take the kernel
down so perhaps we could change the BUG_ON to WARN() dumping similar kind
of info as alloc_fd()?
Honza
> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> 32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index b4d25f6d4c19..1153b0b7ba3d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
> out:
> spin_unlock(&files->file_lock);
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-25 11:52 ` Jan Kara
@ 2024-06-25 12:53 ` Jan Kara
2024-06-25 15:33 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-06-25 12:53 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Tue 25-06-24 13:52:57, Jan Kara wrote:
> On Sat 22-06-24 11:49:02, Yu Ma wrote:
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> > when size<=64.
> >
> > Besides, "!start" is added to fast path condition to ensure the allocated fd is
> > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> > (1) Allocating a new fd (the most common usage scenario) via
> > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> > old_fd's index in fdt, which is only called by syscall fcntl.
> >
> > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 35 +++++++++++++++++++++--------------
> > 1 file changed, 21 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..50e900a47107 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > if (fd < files->next_fd)
> > fd = files->next_fd;
> >
> > - if (fd < fdt->max_fds)
> > + error = -EMFILE;
> > + if (likely(fd < fdt->max_fds)) {
> > + if (~fdt->open_fds[0] && !start) {
> > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>
> So I don't think this is quite correct. If files->next_fd is set, we could
> end up calling find_next_zero_bit() starting from quite high offset causing
> a regression? Also because we don't expand in this case, we could cause access
> beyond end of fdtable?
OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
be false. As such the above condition could be rewritten as:
if (!start && files->next_fd < BITS_PER_LONG)
to avoid loading the first bitmap long if we know it is full? Or we could
maybe go as far as:
if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
goto fastreturn;
because AFAIU this should work in exactly the same cases as your code?
Honza
> > + goto fastreturn;
> > + }
> > fd = find_next_fd(fdt, fd);
> > + }
> > +
> > + if (unlikely(fd >= fdt->max_fds)) {
> > + error = expand_files(files, fd);
> > + if (error < 0)
> > + goto out;
> > + /*
> > + * If we needed to expand the fs array we
> > + * might have blocked - try again.
> > + */
> > + if (error)
> > + goto repeat;
> > + }
> >
> > +fastreturn:
> > /*
> > * N.B. For clone tasks sharing a files structure, this test
> > * will limit the total number of files that can be opened.
> > */
> > - error = -EMFILE;
> > - if (fd >= end)
> > + if (unlikely(fd >= end))
> > goto out;
> >
> > - error = expand_files(files, fd);
> > - if (error < 0)
> > - goto out;
> > -
> > - /*
> > - * If we needed to expand the fs array we
> > - * might have blocked - try again.
> > - */
> > - if (error)
> > - goto repeat;
> > -
> > if (start <= files->next_fd)
> > files->next_fd = fd + 1;
> >
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd()
2024-06-25 12:08 ` Jan Kara
@ 2024-06-25 13:09 ` Mateusz Guzik
2024-06-25 13:11 ` Mateusz Guzik
0 siblings, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-25 13:09 UTC (permalink / raw)
To: Jan Kara
Cc: Yu Ma, viro, brauner, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > allocated fd is NULL. Remove this sanity check since it can be assured by
> > exisitng zero initilization and NULL set when recycling fd.
> ^^^ existing ^^^ initialization
>
> Well, since this is a sanity check, it is expected it never hits. Yet
> searching the web shows it has hit a few times in the past :). So would
> wrapping this with unlikely() give a similar performance gain while keeping
> debugability? If unlikely() does not help, I agree we can remove this since
> fd_install() actually has the same check:
>
> BUG_ON(fdt->fd[fd] != NULL);
>
> and there we need the cacheline anyway so performance impact is minimal.
> Now, this condition in alloc_fd() is nice that it does not take the kernel
> down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> of info as alloc_fd()?
>
Christian suggested just removing it.
To my understanding the problem is not the branch per se, but the the
cacheline bounce of the fd array induced by reading the status.
Note the thing also nullifies the pointer, kind of defeating the
BUG_ON in fd_install.
I'm guessing it's not going to hurt to branch on it after releasing
the lock and forego nullifying, more or less:
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..d22b867db246 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
end, unsigned flags)
*/
error = -EMFILE;
if (fd >= end)
- goto out;
+ goto out_locked;
error = expand_files(files, fd);
if (error < 0)
- goto out;
+ goto out_locked;
/*
* If we needed to expand the fs array we
@@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
+ spin_unlock(&files->file_lock);
+
+ if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
}
-#endif
-out:
+ return error;
+
+out_locked:
spin_unlock(&files->file_lock);
return error;
}
> Honza
>
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 7 -------
> > 1 file changed, 7 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index b4d25f6d4c19..1153b0b7ba3d 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > - }
> > -#endif
> >
> > out:
> > spin_unlock(&files->file_lock);
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd()
2024-06-25 13:09 ` Mateusz Guzik
@ 2024-06-25 13:11 ` Mateusz Guzik
2024-06-25 13:30 ` Jan Kara
0 siblings, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-25 13:11 UTC (permalink / raw)
To: Jan Kara
Cc: Yu Ma, viro, brauner, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Tue, Jun 25, 2024 at 3:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > exisitng zero initilization and NULL set when recycling fd.
> > ^^^ existing ^^^ initialization
> >
> > Well, since this is a sanity check, it is expected it never hits. Yet
> > searching the web shows it has hit a few times in the past :). So would
> > wrapping this with unlikely() give a similar performance gain while keeping
> > debugability? If unlikely() does not help, I agree we can remove this since
> > fd_install() actually has the same check:
> >
> > BUG_ON(fdt->fd[fd] != NULL);
> >
> > and there we need the cacheline anyway so performance impact is minimal.
> > Now, this condition in alloc_fd() is nice that it does not take the kernel
> > down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> > of info as alloc_fd()?
> >
>
> Christian suggested just removing it.
>
> To my understanding the problem is not the branch per se, but the the
> cacheline bounce of the fd array induced by reading the status.
>
> Note the thing also nullifies the pointer, kind of defeating the
> BUG_ON in fd_install.
>
> I'm guessing it's not going to hurt to branch on it after releasing
> the lock and forego nullifying, more or less:
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..d22b867db246 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
> end, unsigned flags)
> */
> error = -EMFILE;
> if (fd >= end)
> - goto out;
> + goto out_locked;
>
> error = expand_files(files, fd);
> if (error < 0)
> - goto out;
> + goto out_locked;
>
> /*
> * If we needed to expand the fs array we
> @@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
> end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> + spin_unlock(&files->file_lock);
> +
> + if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
> printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> }
> -#endif
Now that I sent it it is of course not safe to deref without
protection from either rcu or the lock, so this would have to be
wrapped with rcu_read_lock, which makes it even less appealing.
Whacking the thing as in the submitted patch seems like the best way
forward here. :)
>
> -out:
> + return error;
> +
> +out_locked:
> spin_unlock(&files->file_lock);
> return error;
> }
>
>
>
> > Honza
> >
> > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > > 32%, write improved by 17% on Intel ICX 160 cores configuration with v6.10-rc4.
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > ---
> > > fs/file.c | 7 -------
> > > 1 file changed, 7 deletions(-)
> > >
> > > diff --git a/fs/file.c b/fs/file.c
> > > index b4d25f6d4c19..1153b0b7ba3d 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -555,13 +555,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > else
> > > __clear_close_on_exec(fd, fdt);
> > > error = fd;
> > > -#if 1
> > > - /* Sanity check */
> > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > > - }
> > > -#endif
> > >
> > > out:
> > > spin_unlock(&files->file_lock);
> > > --
> > > 2.43.0
> > >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd()
2024-06-25 13:11 ` Mateusz Guzik
@ 2024-06-25 13:30 ` Jan Kara
2024-06-26 13:10 ` Christian Brauner
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-06-25 13:30 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jan Kara, Yu Ma, viro, brauner, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Tue 25-06-24 15:11:23, Mateusz Guzik wrote:
> On Tue, Jun 25, 2024 at 3:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > exisitng zero initilization and NULL set when recycling fd.
> > > ^^^ existing ^^^ initialization
> > >
> > > Well, since this is a sanity check, it is expected it never hits. Yet
> > > searching the web shows it has hit a few times in the past :). So would
> > > wrapping this with unlikely() give a similar performance gain while keeping
> > > debugability? If unlikely() does not help, I agree we can remove this since
> > > fd_install() actually has the same check:
> > >
> > > BUG_ON(fdt->fd[fd] != NULL);
> > >
> > > and there we need the cacheline anyway so performance impact is minimal.
> > > Now, this condition in alloc_fd() is nice that it does not take the kernel
> > > down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> > > of info as alloc_fd()?
> > >
> >
> > Christian suggested just removing it.
> >
> > To my understanding the problem is not the branch per se, but the the
> > cacheline bounce of the fd array induced by reading the status.
> >
> > Note the thing also nullifies the pointer, kind of defeating the
> > BUG_ON in fd_install.
> >
> > I'm guessing it's not going to hurt to branch on it after releasing
> > the lock and forego nullifying, more or less:
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..d22b867db246 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
> > end, unsigned flags)
> > */
> > error = -EMFILE;
> > if (fd >= end)
> > - goto out;
> > + goto out_locked;
> >
> > error = expand_files(files, fd);
> > if (error < 0)
> > - goto out;
> > + goto out_locked;
> >
> > /*
> > * If we needed to expand the fs array we
> > @@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
> > end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > + spin_unlock(&files->file_lock);
> > +
> > + if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
> > printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > }
> > -#endif
>
> Now that I sent it it is of course not safe to deref without
> protection from either rcu or the lock, so this would have to be
> wrapped with rcu_read_lock, which makes it even less appealing.
>
> Whacking the thing as in the submitted patch seems like the best way
> forward here. :)
Yeah, as I wrote, I'm fine removing it, in particular if Christian is of
the same opinion. I was more musing about whether we should make the check
in fd_install() less aggressive since it is now more likely to trigger...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-25 12:53 ` Jan Kara
@ 2024-06-25 15:33 ` Ma, Yu
2024-06-26 11:54 ` Jan Kara
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-06-25 15:33 UTC (permalink / raw)
To: Jan Kara
Cc: viro, brauner, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 6/25/2024 8:53 PM, Jan Kara wrote:
> On Tue 25-06-24 13:52:57, Jan Kara wrote:
>> On Sat 22-06-24 11:49:02, Yu Ma wrote:
>>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>>> when we look for an available fd slot. Skip 2-levels searching via
>>> find_next_zero_bit() for this common fast path.
>>>
>>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>>> free slot is available there, as:
>>> (1) The fd allocation algorithm would always allocate fd from small to large.
>>> Lower bits in open_fds bitmap would be used much more frequently than higher
>>> bits.
>>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>>> it would never be shrunk. The search size increases but there are few open fds
>>> available here.
>>> (3) find_next_zero_bit() itself has a fast path inside to speed up searching
>>> when size<=64.
>>>
>>> Besides, "!start" is added to fast path condition to ensure the allocated fd is
>>> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
>>> (1) Allocating a new fd (the most common usage scenario) via
>>> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
>>> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
>>> old_fd's index in fdt, which is only called by syscall fcntl.
>>>
>>> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
>>> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>> fs/file.c | 35 +++++++++++++++++++++--------------
>>> 1 file changed, 21 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..50e900a47107 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>> if (fd < files->next_fd)
>>> fd = files->next_fd;
>>>
>>> - if (fd < fdt->max_fds)
>>> + error = -EMFILE;
>>> + if (likely(fd < fdt->max_fds)) {
>>> + if (~fdt->open_fds[0] && !start) {
>>> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>> So I don't think this is quite correct. If files->next_fd is set, we could
>> end up calling find_next_zero_bit() starting from quite high offset causing
>> a regression? Also because we don't expand in this case, we could cause access
>> beyond end of fdtable?
> OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
> the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
> be false. As such the above condition could be rewritten as:
>
> if (!start && files->next_fd < BITS_PER_LONG)
>
> to avoid loading the first bitmap long if we know it is full? Or we could
> maybe go as far as:
>
> if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
> goto fastreturn;
>
> because AFAIU this should work in exactly the same cases as your code?
>
> Honza
Thanks Honza for the good concern and suggestions here, while both above
conditions are not enough to ensure that there is available fd in the
first 64 bits of open_fds. As next_fd just means there is no available
fd before next_fd, just imagine that fd from 0 to 66 are already
occupied, now fd=3 is returned back, then next_fd would be set as 3 per
fd recycling logic (i.e. in __put_unused_fd()), next time when
alloc_fd() being called, it would return fd=3 to the caller and set
next_fd=4. Then next time when alloc_fd() being called again,
next_fd==4, but actually it's already been occupied. So
find_next_zero_bit() is needed to find the real 0 bit anyway. The
conditions should either be like it is in patch or if (!start &&
!test_bit(0, fdt->full_fds_bits)), the latter should also have the
bitmap loading cost, but another point is that a bit in full_fds_bits
represents 64 bits in open_fds, no matter fd >64 or not, full_fds_bits
should be loaded any way, maybe we can modify the condition to use
full_fds_bits ?
>>> + goto fastreturn;
>>> + }
>>> fd = find_next_fd(fdt, fd);
>>> + }
>>> +
>>> + if (unlikely(fd >= fdt->max_fds)) {
>>> + error = expand_files(files, fd);
>>> + if (error < 0)
>>> + goto out;
>>> + /*
>>> + * If we needed to expand the fs array we
>>> + * might have blocked - try again.
>>> + */
>>> + if (error)
>>> + goto repeat;
>>> + }
>>>
>>> +fastreturn:
>>> /*
>>> * N.B. For clone tasks sharing a files structure, this test
>>> * will limit the total number of files that can be opened.
>>> */
>>> - error = -EMFILE;
>>> - if (fd >= end)
>>> + if (unlikely(fd >= end))
>>> goto out;
>>>
>>> - error = expand_files(files, fd);
>>> - if (error < 0)
>>> - goto out;
>>> -
>>> - /*
>>> - * If we needed to expand the fs array we
>>> - * might have blocked - try again.
>>> - */
>>> - if (error)
>>> - goto repeat;
>>> -
>>> if (start <= files->next_fd)
>>> files->next_fd = fd + 1;
>>>
>>> --
>>> 2.43.0
>>>
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 2/3] fs/file.c: conditionally clear full_fds
2024-06-25 11:54 ` Jan Kara
@ 2024-06-25 15:41 ` Ma, Yu
0 siblings, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-06-25 15:41 UTC (permalink / raw)
To: Jan Kara
Cc: viro, brauner, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 6/25/2024 7:54 PM, Jan Kara wrote:
> On Sat 22-06-24 11:49:03, Yu Ma wrote:
>> 64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
>> likely that a bit in full_fds_bits has been cleared before in
>> __clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
>> clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
>> ("vfs: conditionally clear close-on-exec flag") for a similar optimization.
>> Together with patch 1, they improves pts/blogbench-1.1.0 read for 27%, and write
>> for 14% on Intel ICX 160 cores configuration with v6.10-rc4.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Nice. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> Honza
Copy that, thanks Honza :)
>> ---
>> fs/file.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 50e900a47107..b4d25f6d4c19 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
>> static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
>> {
>> __clear_bit(fd, fdt->open_fds);
>> - __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
>> + fd /= BITS_PER_LONG;
>> + if (test_bit(fd, fdt->full_fds_bits))
>> + __clear_bit(fd, fdt->full_fds_bits);
>> }
>>
>> static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-25 15:33 ` Ma, Yu
@ 2024-06-26 11:54 ` Jan Kara
2024-06-26 16:43 ` Tim Chen
2024-06-26 19:13 ` Mateusz Guzik
0 siblings, 2 replies; 103+ messages in thread
From: Jan Kara @ 2024-06-26 11:54 UTC (permalink / raw)
To: Ma, Yu
Cc: Jan Kara, viro, brauner, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Tue 25-06-24 23:33:34, Ma, Yu wrote:
> On 6/25/2024 8:53 PM, Jan Kara wrote:
> > On Tue 25-06-24 13:52:57, Jan Kara wrote:
> > > On Sat 22-06-24 11:49:02, Yu Ma wrote:
> > > > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > > > when we look for an available fd slot. Skip 2-levels searching via
> > > > find_next_zero_bit() for this common fast path.
> > > >
> > > > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > > > free slot is available there, as:
> > > > (1) The fd allocation algorithm would always allocate fd from small to large.
> > > > Lower bits in open_fds bitmap would be used much more frequently than higher
> > > > bits.
> > > > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > > > it would never be shrunk. The search size increases but there are few open fds
> > > > available here.
> > > > (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> > > > when size<=64.
> > > >
> > > > Besides, "!start" is added to fast path condition to ensure the allocated fd is
> > > > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> > > > (1) Allocating a new fd (the most common usage scenario) via
> > > > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> > > > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> > > > old_fd's index in fdt, which is only called by syscall fcntl.
> > > >
> > > > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> > > > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > > fs/file.c | 35 +++++++++++++++++++++--------------
> > > > 1 file changed, 21 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a3b72aa64f11..50e900a47107 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > > if (fd < files->next_fd)
> > > > fd = files->next_fd;
> > > > - if (fd < fdt->max_fds)
> > > > + error = -EMFILE;
> > > > + if (likely(fd < fdt->max_fds)) {
> > > > + if (~fdt->open_fds[0] && !start) {
> > > > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> > > So I don't think this is quite correct. If files->next_fd is set, we could
> > > end up calling find_next_zero_bit() starting from quite high offset causing
> > > a regression? Also because we don't expand in this case, we could cause access
> > > beyond end of fdtable?
> > OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
> > the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
> > be false. As such the above condition could be rewritten as:
> >
> > if (!start && files->next_fd < BITS_PER_LONG)
> >
> > to avoid loading the first bitmap long if we know it is full? Or we could
> > maybe go as far as:
> >
> > if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
> > goto fastreturn;
> >
> > because AFAIU this should work in exactly the same cases as your code?
> >
> > Honza
>
> Thanks Honza for the good concern and suggestions here, while both above
> conditions are not enough to ensure that there is available fd in the first
> 64 bits of open_fds. As next_fd just means there is no available fd before
> next_fd, just imagine that fd from 0 to 66 are already occupied, now fd=3 is
> returned back, then next_fd would be set as 3 per fd recycling logic (i.e.
> in __put_unused_fd()), next time when alloc_fd() being called, it would
> return fd=3 to the caller and set next_fd=4. Then next time when alloc_fd()
> being called again, next_fd==4, but actually it's already been occupied. So
> find_next_zero_bit() is needed to find the real 0 bit anyway.
Indeed, thanks for correcting me! next_fd is just a lower bound for the
first free fd.
> The conditions
> should either be like it is in patch or if (!start && !test_bit(0,
> fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> but another point is that a bit in full_fds_bits represents 64 bits in
> open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> maybe we can modify the condition to use full_fds_bits ?
So maybe I'm wrong but I think the biggest benefit of your code compared to
plain find_next_fd() is exactly in that we don't have to load full_fds_bits
into cache. So I'm afraid that using full_fds_bits in the condition would
destroy your performance gains. Thinking about this with a fresh head how
about putting implementing your optimization like:
--- a/fs/file.c
+++ b/fs/file.c
@@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ /*
+ * Optimistically search the first long of the open_fds bitmap. It
+ * saves us from loading full_fds_bits into cache in the common case
+ * and because BITS_PER_LONG > start >= files->next_fd, we have quite
+ * a good chance there's a bit free in there.
+ */
+ if (start < BITS_PER_LONG) {
+ unsigned int bit;
+
+ bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
+ if (bit < BITS_PER_LONG)
+ return bit;
+ }
+
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
return maxfd;
Plus your optimizations with likely / unlikely. This way the code flow in
alloc_fd() stays more readable, we avoid loading the first open_fds long
into cache if it is full, and we should get all the performance benefits?
Honza
> > > > + goto fastreturn;
> > > > + }
> > > > fd = find_next_fd(fdt, fd);
> > > > + }
> > > > +
> > > > + if (unlikely(fd >= fdt->max_fds)) {
> > > > + error = expand_files(files, fd);
> > > > + if (error < 0)
> > > > + goto out;
> > > > + /*
> > > > + * If we needed to expand the fs array we
> > > > + * might have blocked - try again.
> > > > + */
> > > > + if (error)
> > > > + goto repeat;
> > > > + }
> > > > +fastreturn:
> > > > /*
> > > > * N.B. For clone tasks sharing a files structure, this test
> > > > * will limit the total number of files that can be opened.
> > > > */
> > > > - error = -EMFILE;
> > > > - if (fd >= end)
> > > > + if (unlikely(fd >= end))
> > > > goto out;
> > > > - error = expand_files(files, fd);
> > > > - if (error < 0)
> > > > - goto out;
> > > > -
> > > > - /*
> > > > - * If we needed to expand the fs array we
> > > > - * might have blocked - try again.
> > > > - */
> > > > - if (error)
> > > > - goto repeat;
> > > > -
> > > > if (start <= files->next_fd)
> > > > files->next_fd = fd + 1;
> > > > --
> > > > 2.43.0
> > > >
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > >
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd()
2024-06-25 13:30 ` Jan Kara
@ 2024-06-26 13:10 ` Christian Brauner
0 siblings, 0 replies; 103+ messages in thread
From: Christian Brauner @ 2024-06-26 13:10 UTC (permalink / raw)
To: Jan Kara
Cc: Mateusz Guzik, Yu Ma, viro, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Tue, Jun 25, 2024 at 03:30:31PM GMT, Jan Kara wrote:
> On Tue 25-06-24 15:11:23, Mateusz Guzik wrote:
> > On Tue, Jun 25, 2024 at 3:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Tue, Jun 25, 2024 at 2:08 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Sat 22-06-24 11:49:04, Yu Ma wrote:
> > > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > > exisitng zero initilization and NULL set when recycling fd.
> > > > ^^^ existing ^^^ initialization
> > > >
> > > > Well, since this is a sanity check, it is expected it never hits. Yet
> > > > searching the web shows it has hit a few times in the past :). So would
> > > > wrapping this with unlikely() give a similar performance gain while keeping
> > > > debugability? If unlikely() does not help, I agree we can remove this since
> > > > fd_install() actually has the same check:
> > > >
> > > > BUG_ON(fdt->fd[fd] != NULL);
> > > >
> > > > and there we need the cacheline anyway so performance impact is minimal.
> > > > Now, this condition in alloc_fd() is nice that it does not take the kernel
> > > > down so perhaps we could change the BUG_ON to WARN() dumping similar kind
> > > > of info as alloc_fd()?
> > > >
> > >
> > > Christian suggested just removing it.
> > >
> > > To my understanding the problem is not the branch per se, but the the
> > > cacheline bounce of the fd array induced by reading the status.
> > >
> > > Note the thing also nullifies the pointer, kind of defeating the
> > > BUG_ON in fd_install.
> > >
> > > I'm guessing it's not going to hurt to branch on it after releasing
> > > the lock and forego nullifying, more or less:
> > > diff --git a/fs/file.c b/fs/file.c
> > > index a3b72aa64f11..d22b867db246 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -524,11 +524,11 @@ static int alloc_fd(unsigned start, unsigned
> > > end, unsigned flags)
> > > */
> > > error = -EMFILE;
> > > if (fd >= end)
> > > - goto out;
> > > + goto out_locked;
> > >
> > > error = expand_files(files, fd);
> > > if (error < 0)
> > > - goto out;
> > > + goto out_locked;
> > >
> > > /*
> > > * If we needed to expand the fs array we
> > > @@ -546,15 +546,15 @@ static int alloc_fd(unsigned start, unsigned
> > > end, unsigned flags)
> > > else
> > > __clear_close_on_exec(fd, fdt);
> > > error = fd;
> > > -#if 1
> > > - /* Sanity check */
> > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > + spin_unlock(&files->file_lock);
> > > +
> > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]) != NULL)) {
> > > printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > > }
> > > -#endif
> >
> > Now that I sent it it is of course not safe to deref without
> > protection from either rcu or the lock, so this would have to be
> > wrapped with rcu_read_lock, which makes it even less appealing.
> >
> > Whacking the thing as in the submitted patch seems like the best way
> > forward here. :)
>
> Yeah, as I wrote, I'm fine removing it, in particular if Christian is of
> the same opinion. I was more musing about whether we should make the check
> in fd_install() less aggressive since it is now more likely to trigger...
We could change it to WARN_ON() and then people can get BUG_ON()
behavior when they turn WARN into BUG which apparently is a thing that
we support.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-26 11:54 ` Jan Kara
@ 2024-06-26 16:43 ` Tim Chen
2024-06-26 16:52 ` Tim Chen
2024-06-26 19:13 ` Mateusz Guzik
1 sibling, 1 reply; 103+ messages in thread
From: Tim Chen @ 2024-06-26 16:43 UTC (permalink / raw)
To: Jan Kara, Ma, Yu
Cc: viro, brauner, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen
On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
>
>
> Indeed, thanks for correcting me! next_fd is just a lower bound for the
> first free fd.
>
> > The conditions
> > should either be like it is in patch or if (!start && !test_bit(0,
> > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > but another point is that a bit in full_fds_bits represents 64 bits in
> > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > maybe we can modify the condition to use full_fds_bits ?
>
> So maybe I'm wrong but I think the biggest benefit of your code compared to
> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> into cache. So I'm afraid that using full_fds_bits in the condition would
> destroy your performance gains. Thinking about this with a fresh head how
> about putting implementing your optimization like:
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
>
> + /*
> + * Optimistically search the first long of the open_fds bitmap. It
> + * saves us from loading full_fds_bits into cache in the common case
> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> + * a good chance there's a bit free in there.
> + */
> + if (start < BITS_PER_LONG) {
> + unsigned int bit;
> +
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
Say start is 31 (< BITS_PER_LONG)
bit found here could be 32 and greater than start. Do we care if we return bit > start?
Tim
> + if (bit < BITS_PER_LONG)
> + return bit;
> + }
> +
> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> if (bitbit >= maxfd)
> return maxfd;
>
> Plus your optimizations with likely / unlikely. This way the code flow in
> alloc_fd() stays more readable, we avoid loading the first open_fds long
> into cache if it is full, and we should get all the performance benefits?
>
> Honza
>
>
> > > > > + goto fastreturn;
> > > > > + }
> > > > > fd = find_next_fd(fdt, fd);
> > > > > + }
> > > > > +
> > > > > + if (unlikely(fd >= fdt->max_fds)) {
> > > > > + error = expand_files(files, fd);
> > > > > + if (error < 0)
> > > > > + goto out;
> > > > > + /*
> > > > > + * If we needed to expand the fs array we
> > > > > + * might have blocked - try again.
> > > > > + */
> > > > > + if (error)
> > > > > + goto repeat;
> > > > > + }
> > > > > +fastreturn:
> > > > > /*
> > > > > * N.B. For clone tasks sharing a files structure, this test
> > > > > * will limit the total number of files that can be opened.
> > > > > */
> > > > > - error = -EMFILE;
> > > > > - if (fd >= end)
> > > > > + if (unlikely(fd >= end))
> > > > > goto out;
> > > > > - error = expand_files(files, fd);
> > > > > - if (error < 0)
> > > > > - goto out;
> > > > > -
> > > > > - /*
> > > > > - * If we needed to expand the fs array we
> > > > > - * might have blocked - try again.
> > > > > - */
> > > > > - if (error)
> > > > > - goto repeat;
> > > > > -
> > > > > if (start <= files->next_fd)
> > > > > files->next_fd = fd + 1;
> > > > > --
> > > > > 2.43.0
> > > > >
> > > > --
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR
> > > >
> >
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-26 16:43 ` Tim Chen
@ 2024-06-26 16:52 ` Tim Chen
2024-06-27 12:09 ` Jan Kara
0 siblings, 1 reply; 103+ messages in thread
From: Tim Chen @ 2024-06-26 16:52 UTC (permalink / raw)
To: Jan Kara, Ma, Yu
Cc: viro, brauner, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen
On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> >
> >
> > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > first free fd.
> >
> > > The conditions
> > > should either be like it is in patch or if (!start && !test_bit(0,
> > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > maybe we can modify the condition to use full_fds_bits ?
> >
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> >
> > + /*
> > + * Optimistically search the first long of the open_fds bitmap. It
> > + * saves us from loading full_fds_bits into cache in the common case
> > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > + * a good chance there's a bit free in there.
> > + */
> > + if (start < BITS_PER_LONG) {
> > + unsigned int bit;
> > +
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>
> Say start is 31 (< BITS_PER_LONG)
> bit found here could be 32 and greater than start. Do we care if we return bit > start?
Sorry, I mean to say that we could find a bit like 30 that is less than start instead
of the other way round.
Tim
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-26 11:54 ` Jan Kara
2024-06-26 16:43 ` Tim Chen
@ 2024-06-26 19:13 ` Mateusz Guzik
2024-06-27 14:03 ` Jan Kara
2024-06-27 15:33 ` Christian Brauner
1 sibling, 2 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-26 19:13 UTC (permalink / raw)
To: Jan Kara
Cc: Ma, Yu, viro, brauner, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> So maybe I'm wrong but I think the biggest benefit of your code compared to
> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> into cache. So I'm afraid that using full_fds_bits in the condition would
> destroy your performance gains. Thinking about this with a fresh head how
> about putting implementing your optimization like:
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
>
> + /*
> + * Optimistically search the first long of the open_fds bitmap. It
> + * saves us from loading full_fds_bits into cache in the common case
> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> + * a good chance there's a bit free in there.
> + */
> + if (start < BITS_PER_LONG) {
> + unsigned int bit;
> +
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> + }
> +
> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> if (bitbit >= maxfd)
> return maxfd;
>
> Plus your optimizations with likely / unlikely. This way the code flow in
> alloc_fd() stays more readable, we avoid loading the first open_fds long
> into cache if it is full, and we should get all the performance benefits?
>
Huh.
So when I read the patch previously I assumed this is testing the bit
word for the map containing next_fd (whatever it is), avoiding looking
at the higher level bitmap and inlining the op (instead of calling the
fully fledged func for bit scans).
I did not mentally register this is in fact only checking for the
beginning of the range of the entire thing. So apologies from my end
as based on my feedback some work was done and I'm going to ask to
further redo it.
blogbench spawns 100 or so workers, say total fd count hovers just
above 100. say this lines up with about half of more cases in practice
for that benchmark.
Even so, that's a benchmark-specific optimization. A busy web server
can have literally tens of thousands of fds open (and this is a pretty
mundane case), making the 0-63 range not particularly interesting.
That aside I think the patchset is in the wrong order -- first patch
tries to not look at the higher level bitmap, while second reduces
stores made there. This makes it quite unclear how much is it worth to
reduce looking there if atomics are conditional.
So here is what I propose in terms of the patches:
1. NULL check removal, sprinkling of likely/unlikely and expand_files
call avoidance; no measurements done vs stock kernel for some effort
saving, just denote in the commit message there is less work under the
lock and treat it as baseline
2. conditional higher level bitmap clear as submitted; benchmarked against 1
3. open_fds check within the range containing fd, avoiding higher
level bitmap if a free slot is found. this should not result in any
func calls if successful; benchmarked against the above
Optionally the bitmap routines can grow variants which always inline
and are used here. If so that would probably land between 1 and 2 on
the list.
You noted you know about blogbench bugs and have them fixed. Would be
good to post a link to a pull request or some other spot for a
reference.
I'll be best if the vfs folk comment on what they want here.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-26 16:52 ` Tim Chen
@ 2024-06-27 12:09 ` Jan Kara
2024-06-27 12:20 ` Mateusz Guzik
2024-06-27 16:21 ` Tim Chen
0 siblings, 2 replies; 103+ messages in thread
From: Jan Kara @ 2024-06-27 12:09 UTC (permalink / raw)
To: Tim Chen
Cc: Jan Kara, Ma, Yu, viro, brauner, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen
On Wed 26-06-24 09:52:50, Tim Chen wrote:
> On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > >
> > >
> > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > first free fd.
> > >
> > > > The conditions
> > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > maybe we can modify the condition to use full_fds_bits ?
> > >
> > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > destroy your performance gains. Thinking about this with a fresh head how
> > > about putting implementing your optimization like:
> > >
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > unsigned int bitbit = start / BITS_PER_LONG;
> > >
> > > + /*
> > > + * Optimistically search the first long of the open_fds bitmap. It
> > > + * saves us from loading full_fds_bits into cache in the common case
> > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > + * a good chance there's a bit free in there.
> > > + */
> > > + if (start < BITS_PER_LONG) {
> > > + unsigned int bit;
> > > +
> > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> >
> > Say start is 31 (< BITS_PER_LONG)
> > bit found here could be 32 and greater than start. Do we care if we return bit > start?
>
> Sorry, I mean to say that we could find a bit like 30 that is less than
> start instead of the other way round.
Well, I propose calling find_next_zero_bit() with offset set to 'start' so
it cannot possibly happen that the returned bit number is smaller than
start... But maybe I'm missing something?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-27 12:09 ` Jan Kara
@ 2024-06-27 12:20 ` Mateusz Guzik
2024-06-27 16:21 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-27 12:20 UTC (permalink / raw)
To: Jan Kara
Cc: Tim Chen, Ma, Yu, viro, brauner, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen
On Thu, Jun 27, 2024 at 2:09 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 26-06-24 09:52:50, Tim Chen wrote:
> > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > >
> > > >
> > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > > first free fd.
> > > >
> > > > > The conditions
> > > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > > maybe we can modify the condition to use full_fds_bits ?
> > > >
> > > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > > destroy your performance gains. Thinking about this with a fresh head how
> > > > about putting implementing your optimization like:
> > > >
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > unsigned int bitbit = start / BITS_PER_LONG;
> > > >
> > > > + /*
> > > > + * Optimistically search the first long of the open_fds bitmap. It
> > > > + * saves us from loading full_fds_bits into cache in the common case
> > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > > + * a good chance there's a bit free in there.
> > > > + */
> > > > + if (start < BITS_PER_LONG) {
> > > > + unsigned int bit;
> > > > +
> > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > >
> > > Say start is 31 (< BITS_PER_LONG)
> > > bit found here could be 32 and greater than start. Do we care if we return bit > start?
> >
> > Sorry, I mean to say that we could find a bit like 30 that is less than
> > start instead of the other way round.
>
> Well, I propose calling find_next_zero_bit() with offset set to 'start' so
> it cannot possibly happen that the returned bit number is smaller than
> start... But maybe I'm missing something?
>
You gate it with " if (start < BITS_PER_LONG)" which only covers the
small initital range, while I'm arguing this should work for any fd.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-26 19:13 ` Mateusz Guzik
@ 2024-06-27 14:03 ` Jan Kara
2024-06-27 15:33 ` Christian Brauner
1 sibling, 0 replies; 103+ messages in thread
From: Jan Kara @ 2024-06-27 14:03 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jan Kara, Ma, Yu, viro, brauner, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed 26-06-24 21:13:07, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> >
> > + /*
> > + * Optimistically search the first long of the open_fds bitmap. It
> > + * saves us from loading full_fds_bits into cache in the common case
> > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > + * a good chance there's a bit free in there.
> > + */
> > + if (start < BITS_PER_LONG) {
> > + unsigned int bit;
> > +
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > + }
> > +
> > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> > if (bitbit >= maxfd)
> > return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
>
> Huh.
>
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
>
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.
Well, just confirming the fact that the way the code was written was
somewhat confusing ;)
> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
>
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.
I agree this optimization helps only processes with low number of open fds.
On the other hand that is usually the majority of the processes on the
system so the optimization makes sense to me. That being said your idea of
searching the word with next_fd makes sense as well...
> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
>
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above
Yeah, I guess this ordering is the most obvious -> the least obvious so it
makes sense to me.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-26 19:13 ` Mateusz Guzik
2024-06-27 14:03 ` Jan Kara
@ 2024-06-27 15:33 ` Christian Brauner
2024-06-27 18:27 ` Ma, Yu
1 sibling, 1 reply; 103+ messages in thread
From: Christian Brauner @ 2024-06-27 15:33 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Jan Kara, Ma, Yu, viro, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> >
> > + /*
> > + * Optimistically search the first long of the open_fds bitmap. It
> > + * saves us from loading full_fds_bits into cache in the common case
> > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > + * a good chance there's a bit free in there.
> > + */
> > + if (start < BITS_PER_LONG) {
> > + unsigned int bit;
> > +
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > + }
> > +
> > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> > if (bitbit >= maxfd)
> > return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
>
> Huh.
>
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
>
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.
>
> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
>
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.
>
> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
>
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above
>
> Optionally the bitmap routines can grow variants which always inline
> and are used here. If so that would probably land between 1 and 2 on
> the list.
>
> You noted you know about blogbench bugs and have them fixed. Would be
> good to post a link to a pull request or some other spot for a
> reference.
>
> I'll be best if the vfs folk comment on what they want here.
Optimizing only the < BIT_PER_LONG seems less desirable then making it
work for arbitrary next_fd. Imho, it'll also be easier to follow if
everything follows the same logic.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-27 12:09 ` Jan Kara
2024-06-27 12:20 ` Mateusz Guzik
@ 2024-06-27 16:21 ` Tim Chen
1 sibling, 0 replies; 103+ messages in thread
From: Tim Chen @ 2024-06-27 16:21 UTC (permalink / raw)
To: Jan Kara
Cc: Ma, Yu, viro, brauner, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen
On Thu, 2024-06-27 at 14:09 +0200, Jan Kara wrote:
> On Wed 26-06-24 09:52:50, Tim Chen wrote:
> > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > >
> > > >
> > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > > first free fd.
> > > >
> > > > > The conditions
> > > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > > maybe we can modify the condition to use full_fds_bits ?
> > > >
> > > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > > destroy your performance gains. Thinking about this with a fresh head how
> > > > about putting implementing your optimization like:
> > > >
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > unsigned int bitbit = start / BITS_PER_LONG;
> > > >
> > > > + /*
> > > > + * Optimistically search the first long of the open_fds bitmap. It
> > > > + * saves us from loading full_fds_bits into cache in the common case
> > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > > + * a good chance there's a bit free in there.
> > > > + */
> > > > + if (start < BITS_PER_LONG) {
> > > > + unsigned int bit;
> > > > +
> > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > >
> > > Say start is 31 (< BITS_PER_LONG)
> > > bit found here could be 32 and greater than start. Do we care if we return bit > start?
> >
> > Sorry, I mean to say that we could find a bit like 30 that is less than
> > start instead of the other way round.
>
> Well, I propose calling find_next_zero_bit() with offset set to 'start' so
> it cannot possibly happen that the returned bit number is smaller than
> start... But maybe I'm missing something?
Yes, you're right. the search begin at start bit so it is okay.
Tim
>
> Honza
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-27 15:33 ` Christian Brauner
@ 2024-06-27 18:27 ` Ma, Yu
2024-06-27 19:59 ` Mateusz Guzik
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-06-27 18:27 UTC (permalink / raw)
To: Christian Brauner, Mateusz Guzik
Cc: Jan Kara, viro, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 6/27/2024 11:33 PM, Christian Brauner wrote:
> On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
>> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
>>> So maybe I'm wrong but I think the biggest benefit of your code compared to
>>> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
>>> into cache. So I'm afraid that using full_fds_bits in the condition would
>>> destroy your performance gains. Thinking about this with a fresh head how
>>> about putting implementing your optimization like:
>>>
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>> unsigned int bitbit = start / BITS_PER_LONG;
>>>
>>> + /*
>>> + * Optimistically search the first long of the open_fds bitmap. It
>>> + * saves us from loading full_fds_bits into cache in the common case
>>> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite
>>> + * a good chance there's a bit free in there.
>>> + */
>>> + if (start < BITS_PER_LONG) {
>>> + unsigned int bit;
>>> +
>>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>>> + if (bit < BITS_PER_LONG)
>>> + return bit;
>>> + }
>>> +
>>> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>>> if (bitbit >= maxfd)
>>> return maxfd;
>>>
>>> Plus your optimizations with likely / unlikely. This way the code flow in
>>> alloc_fd() stays more readable, we avoid loading the first open_fds long
>>> into cache if it is full, and we should get all the performance benefits?
>>>
>> Huh.
>>
>> So when I read the patch previously I assumed this is testing the bit
>> word for the map containing next_fd (whatever it is), avoiding looking
>> at the higher level bitmap and inlining the op (instead of calling the
>> fully fledged func for bit scans).
>>
>> I did not mentally register this is in fact only checking for the
>> beginning of the range of the entire thing. So apologies from my end
>> as based on my feedback some work was done and I'm going to ask to
>> further redo it.
>>
>> blogbench spawns 100 or so workers, say total fd count hovers just
>> above 100. say this lines up with about half of more cases in practice
>> for that benchmark.
>>
>> Even so, that's a benchmark-specific optimization. A busy web server
>> can have literally tens of thousands of fds open (and this is a pretty
>> mundane case), making the 0-63 range not particularly interesting.
>>
>> That aside I think the patchset is in the wrong order -- first patch
>> tries to not look at the higher level bitmap, while second reduces
>> stores made there. This makes it quite unclear how much is it worth to
>> reduce looking there if atomics are conditional.
>>
>> So here is what I propose in terms of the patches:
>> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
>> call avoidance; no measurements done vs stock kernel for some effort
>> saving, just denote in the commit message there is less work under the
>> lock and treat it as baseline
>> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
>> 3. open_fds check within the range containing fd, avoiding higher
>> level bitmap if a free slot is found. this should not result in any
>> func calls if successful; benchmarked against the above
>>
>> Optionally the bitmap routines can grow variants which always inline
>> and are used here. If so that would probably land between 1 and 2 on
>> the list.
>>
>> You noted you know about blogbench bugs and have them fixed. Would be
>> good to post a link to a pull request or some other spot for a
>> reference.
>>
>> I'll be best if the vfs folk comment on what they want here.
> Optimizing only the < BIT_PER_LONG seems less desirable then making it
> work for arbitrary next_fd. Imho, it'll also be easier to follow if
> everything follows the same logic.
Sorry that this message is a bit long. Thanks for your time to review.
Firstly sincerely thanks all for the hot discussion and kind suggestions
with your expertise to make the patch set better. At least, we already
reached some agreements on removing sanity_check and adding conditional
clear (p.s. I'll revise the bug_on to warn_on in fd_install() as
aligned). I fully agree with Guzik's suggestion to resort the patches.
As the remaining focus of discussion is around fast path, I suggest that
we submit patch 1 & 2 (after reorder) for up-streaming firstly (with
data remeasured on latest kernel version accordingly), then we focus on
discussion for fast path.
For this fast path idea, here I summarized some info for further
discussion, why I still think it is valuable:
1. The original intention for fast path is to reduce func calls and
avoid unnecessary load/store on the members sharing the same cacheline
(such as file_lock, next_fd and the 3 bitmaps. BTW, we've tried to add
__cacheline_aligned_in_smp for next_fd and fd array, no improvement
observed), specially, yes, specially, all these operations are inside of
critical section of file_lock.
2. For fast path implementation, the essential and simple point is to
directly return an available bit if there is free bit in [0-63]. I'd
emphasize that it does not only improve low number of open fds (even it
is the majority case on system as Honza agreed), but also improve the
cases that lots of fds open/close frequently with short task (as per the
algorithm, lower bits will be prioritized to allocate after being
recycled). Not only blogbench, a synthetic benchmark, but also the
realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
performance case for __alloc_fd()"), which literally introduced this
2-levels bitmap searching algorithm to vfs as we see now. We may ask
Eric for help to see whether it's possible to let us have some data on
it. Besides, for those lots of fds are allocated and occupied for
not-short time, the lock contention would be much less than the
scenarios we're talking about here, then the impact of change would be
much less.
3. Now we talk about the extra cost of fast path based on the patch we
submitted. To be honest, before fast path, we firstly found that
alloc_fd() is only called in two scenarios, as I mentioned in commit:
(1) called by __get_unused_fd_flags() to find available fd start from
bit 0, which is the most common usage. It means start==0 for alloc_fd(),
and with this premise, alloc_fd() logic can be simpler, two branches for
comparing to next_fd can be reduced; (2) dup a fd via dup_fd() to find a
fd start from old_fd, which means "start" is not zero, but it is only
called by fcntl. Then the first impression came into our mind is that
why we sacrifice the performance of absolutely majority cases for this
specific dup_fd. So we revised __get_unused_fd_flags() to not call
alloc_fd() directly, but with the same logic as alloc_fd() by omitted
the branches related to "start". Based on this version, we then found
fast path would be possibly more efficient than the stock 2-levels
bitmap searching based on the considerations stated in item 2 and commit
message of this thread. Leaving aside the benefit, the extra cost is an
conditional branch, but with 2 branches related to "start" has been
reduced, it is still profitable, not even to say the cost can be
alleviated by branch predictor. However, with this draft version, the
code of __get_unused_fd_flags() duplicates a lot with alloc_fd(), then
we change to current version for concise code. What I want to say is
that there is space to make it faster with cost less than stock. For
whether to use open_fds[0] as conditions for fast path, we think it's OK
as all bitmaps are almost on the same cacheline, and we finaly need to
write open_fds to allocate bit anyway.
4. Based on patches order as suggested by Guzik, we've re-measured the
data on latest kernel 6.10-rc5, removing sanity_check and add
likely/unlikely would have 6% gain for read, and 2% for write. Combined
with conditional clear, it would have 14% gain for read, and 8% for
write. If with fast path, it might have another ~15% gain to read (we do
not re-measure this one yet due to time, will make up soon).
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-27 18:27 ` Ma, Yu
@ 2024-06-27 19:59 ` Mateusz Guzik
2024-06-28 9:12 ` Jan Kara
2024-06-29 14:23 ` Ma, Yu
0 siblings, 2 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-27 19:59 UTC (permalink / raw)
To: Ma, Yu
Cc: Christian Brauner, Jan Kara, viro, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> 2. For fast path implementation, the essential and simple point is to
> directly return an available bit if there is free bit in [0-63]. I'd
> emphasize that it does not only improve low number of open fds (even it
> is the majority case on system as Honza agreed), but also improve the
> cases that lots of fds open/close frequently with short task (as per the
> algorithm, lower bits will be prioritized to allocate after being
> recycled). Not only blogbench, a synthetic benchmark, but also the
> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> performance case for __alloc_fd()"), which literally introduced this
> 2-levels bitmap searching algorithm to vfs as we see now.
I don't understand how using next_fd instead is supposed to be inferior.
Maybe I should clarify that by API contract the kernel must return the
lowest free fd it can find. To that end it maintains the next_fd field
as a hint to hopefully avoid some of the search work.
In the stock kernel the first thing done in alloc_fd is setting it as
a starting point:
fdt = files_fdtable(files);
fd = start;
if (fd < files->next_fd)
fd = files->next_fd;
that is all the calls which come here with 0 start their search from
next_fd position.
Suppose you implemented the patch as suggested by me and next_fd fits
the range of 0-63. Then you get the benefit of lower level bitmap
check just like in the patch you submitted, but without having to
first branch on whether you happen to be in that range.
Suppose next_fd is somewhere higher up, say 80. With your general
approach the optimization wont be done whatsoever or it will be
attempted at the 0-63 range when it is an invariant it finds no free
fds.
With what I'm suggesting the general idea of taking a peek at the
lower level bitmap can be applied across the entire fd space. Some
manual mucking will be needed to make sure this never pulls more than
one cacheline, easiest way out I see would be to align next_fd to
BITS_PER_LONG for the bitmap search purposes.
Outside of the scope of this patchset, but definitely worth
considering, is an observation that this still pulls an entire
cacheline worth of a bitmap (assuming it grew). If one could assume
that the size is always a multiply of 64 bytes (which it is after
first expansion) the 64 byte scan could be entirely inlined -- there
is quite a bit of fd fields in this range we may as well scan in hopes
of avoiding looking at the higher level bitmap, after all we already
paid for fetching it. This would take the optimization to its logical
conclusion.
Perhaps it would be ok to special-case the lower bitmap to start with
64 bytes so that there would be no need to branch on it.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-27 19:59 ` Mateusz Guzik
@ 2024-06-28 9:12 ` Jan Kara
2024-06-29 15:41 ` Ma, Yu
2024-06-29 14:23 ` Ma, Yu
1 sibling, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-06-28 9:12 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Ma, Yu, Christian Brauner, Jan Kara, viro, edumazet,
linux-fsdevel, linux-kernel, pan.deng, tianyou.li, tim.c.chen,
tim.c.chen
On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> > 2. For fast path implementation, the essential and simple point is to
> > directly return an available bit if there is free bit in [0-63]. I'd
> > emphasize that it does not only improve low number of open fds (even it
> > is the majority case on system as Honza agreed), but also improve the
> > cases that lots of fds open/close frequently with short task (as per the
> > algorithm, lower bits will be prioritized to allocate after being
> > recycled). Not only blogbench, a synthetic benchmark, but also the
> > realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> > performance case for __alloc_fd()"), which literally introduced this
> > 2-levels bitmap searching algorithm to vfs as we see now.
>
> I don't understand how using next_fd instead is supposed to be inferior.
>
> Maybe I should clarify that by API contract the kernel must return the
> lowest free fd it can find. To that end it maintains the next_fd field
> as a hint to hopefully avoid some of the search work.
>
> In the stock kernel the first thing done in alloc_fd is setting it as
> a starting point:
> fdt = files_fdtable(files);
> fd = start;
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> that is all the calls which come here with 0 start their search from
> next_fd position.
Yup.
> Suppose you implemented the patch as suggested by me and next_fd fits
> the range of 0-63. Then you get the benefit of lower level bitmap
> check just like in the patch you submitted, but without having to
> first branch on whether you happen to be in that range.
>
> Suppose next_fd is somewhere higher up, say 80. With your general
> approach the optimization wont be done whatsoever or it will be
> attempted at the 0-63 range when it is an invariant it finds no free
> fds.
>
> With what I'm suggesting the general idea of taking a peek at the
> lower level bitmap can be applied across the entire fd space. Some
> manual mucking will be needed to make sure this never pulls more than
> one cacheline, easiest way out I see would be to align next_fd to
> BITS_PER_LONG for the bitmap search purposes.
Well, all you need to do is to call:
bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
BITS_PER_LONG, start & (BITS_PER_LONG-1));
if (bit < BITS_PER_LONG)
return bit + (start & ~(BITS_PER_LONG - 1));
in find_next_fd(). Not sure if this is what you meant by aligning next_fd
to BITS_PER_LONG...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-27 19:59 ` Mateusz Guzik
2024-06-28 9:12 ` Jan Kara
@ 2024-06-29 14:23 ` Ma, Yu
1 sibling, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-06-29 14:23 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Christian Brauner, Jan Kara, viro, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 6/28/2024 3:59 AM, Mateusz Guzik wrote:
> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
>> 2. For fast path implementation, the essential and simple point is to
>> directly return an available bit if there is free bit in [0-63]. I'd
>> emphasize that it does not only improve low number of open fds (even it
>> is the majority case on system as Honza agreed), but also improve the
>> cases that lots of fds open/close frequently with short task (as per the
>> algorithm, lower bits will be prioritized to allocate after being
>> recycled). Not only blogbench, a synthetic benchmark, but also the
>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
>> performance case for __alloc_fd()"), which literally introduced this
>> 2-levels bitmap searching algorithm to vfs as we see now.
> I don't understand how using next_fd instead is supposed to be inferior.
>
> Maybe I should clarify that by API contract the kernel must return the
> lowest free fd it can find. To that end it maintains the next_fd field
> as a hint to hopefully avoid some of the search work.
>
> In the stock kernel the first thing done in alloc_fd is setting it as
> a starting point:
> fdt = files_fdtable(files);
> fd = start;
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> that is all the calls which come here with 0 start their search from
> next_fd position.
>
> Suppose you implemented the patch as suggested by me and next_fd fits
> the range of 0-63. Then you get the benefit of lower level bitmap
> check just like in the patch you submitted, but without having to
> first branch on whether you happen to be in that range.
>
> Suppose next_fd is somewhere higher up, say 80. With your general
> approach the optimization wont be done whatsoever or it will be
> attempted at the 0-63 range when it is an invariant it finds no free
> fds.
>
> With what I'm suggesting the general idea of taking a peek at the
> lower level bitmap can be applied across the entire fd space. Some
> manual mucking will be needed to make sure this never pulls more than
> one cacheline, easiest way out I see would be to align next_fd to
> BITS_PER_LONG for the bitmap search purposes.
Some misunderstanding here, Guzik, I thought you felt not so worth for
fast path in previous feedback, so the whole message sent just wanna say
we still think the original idea is reasonable. Back to the point here,
the way to implement it in find_next_fd() by searching the word with
next_fd makes sense and OK to me. It's efficient, concise and should
bring us the expected benefits. I'll re-measure the data for reference
based on the code proposed by you and Honza.
> Outside of the scope of this patchset, but definitely worth
> considering, is an observation that this still pulls an entire
> cacheline worth of a bitmap (assuming it grew). If one could assume
> that the size is always a multiply of 64 bytes (which it is after
> first expansion) the 64 byte scan could be entirely inlined -- there
> is quite a bit of fd fields in this range we may as well scan in hopes
> of avoiding looking at the higher level bitmap, after all we already
> paid for fetching it. This would take the optimization to its logical
> conclusion.
>
> Perhaps it would be ok to special-case the lower bitmap to start with
> 64 bytes so that there would be no need to branch on it.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-28 9:12 ` Jan Kara
@ 2024-06-29 15:41 ` Ma, Yu
2024-06-29 15:46 ` Mateusz Guzik
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-06-29 15:41 UTC (permalink / raw)
To: Jan Kara, Mateusz Guzik
Cc: Christian Brauner, viro, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 6/28/2024 5:12 PM, Jan Kara wrote:
> On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
>> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
>>> 2. For fast path implementation, the essential and simple point is to
>>> directly return an available bit if there is free bit in [0-63]. I'd
>>> emphasize that it does not only improve low number of open fds (even it
>>> is the majority case on system as Honza agreed), but also improve the
>>> cases that lots of fds open/close frequently with short task (as per the
>>> algorithm, lower bits will be prioritized to allocate after being
>>> recycled). Not only blogbench, a synthetic benchmark, but also the
>>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
>>> performance case for __alloc_fd()"), which literally introduced this
>>> 2-levels bitmap searching algorithm to vfs as we see now.
>> I don't understand how using next_fd instead is supposed to be inferior.
>>
>> Maybe I should clarify that by API contract the kernel must return the
>> lowest free fd it can find. To that end it maintains the next_fd field
>> as a hint to hopefully avoid some of the search work.
>>
>> In the stock kernel the first thing done in alloc_fd is setting it as
>> a starting point:
>> fdt = files_fdtable(files);
>> fd = start;
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> that is all the calls which come here with 0 start their search from
>> next_fd position.
> Yup.
>
>> Suppose you implemented the patch as suggested by me and next_fd fits
>> the range of 0-63. Then you get the benefit of lower level bitmap
>> check just like in the patch you submitted, but without having to
>> first branch on whether you happen to be in that range.
>>
>> Suppose next_fd is somewhere higher up, say 80. With your general
>> approach the optimization wont be done whatsoever or it will be
>> attempted at the 0-63 range when it is an invariant it finds no free
>> fds.
>>
>> With what I'm suggesting the general idea of taking a peek at the
>> lower level bitmap can be applied across the entire fd space. Some
>> manual mucking will be needed to make sure this never pulls more than
>> one cacheline, easiest way out I see would be to align next_fd to
>> BITS_PER_LONG for the bitmap search purposes.
> Well, all you need to do is to call:
>
> bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
> BITS_PER_LONG, start & (BITS_PER_LONG-1));
> if (bit < BITS_PER_LONG)
> return bit + (start & ~(BITS_PER_LONG - 1));
>
>
> in find_next_fd(). Not sure if this is what you meant by aligning next_fd
> to BITS_PER_LONG...
>
> Honza
So the idea here is to take a peek at the word contains next_fd, return
directly if get free bit there. Not sure about the efficiency here if
open/close fd frequently and next_fd is distributed very randomly. Just
give a quick try for this part of code, seems kernel failed to boot
up😳, kind of out of expectation ...
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()
2024-06-29 15:41 ` Ma, Yu
@ 2024-06-29 15:46 ` Mateusz Guzik
0 siblings, 0 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-06-29 15:46 UTC (permalink / raw)
To: Ma, Yu
Cc: Jan Kara, Christian Brauner, viro, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
How about you post a working version of the patchset, even if it only
looks at 0-63 and we are going to massage it afterwards.
On Sat, Jun 29, 2024 at 5:41 PM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 6/28/2024 5:12 PM, Jan Kara wrote:
> > On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
> >> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> >>> 2. For fast path implementation, the essential and simple point is to
> >>> directly return an available bit if there is free bit in [0-63]. I'd
> >>> emphasize that it does not only improve low number of open fds (even it
> >>> is the majority case on system as Honza agreed), but also improve the
> >>> cases that lots of fds open/close frequently with short task (as per the
> >>> algorithm, lower bits will be prioritized to allocate after being
> >>> recycled). Not only blogbench, a synthetic benchmark, but also the
> >>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> >>> performance case for __alloc_fd()"), which literally introduced this
> >>> 2-levels bitmap searching algorithm to vfs as we see now.
> >> I don't understand how using next_fd instead is supposed to be inferior.
> >>
> >> Maybe I should clarify that by API contract the kernel must return the
> >> lowest free fd it can find. To that end it maintains the next_fd field
> >> as a hint to hopefully avoid some of the search work.
> >>
> >> In the stock kernel the first thing done in alloc_fd is setting it as
> >> a starting point:
> >> fdt = files_fdtable(files);
> >> fd = start;
> >> if (fd < files->next_fd)
> >> fd = files->next_fd;
> >>
> >> that is all the calls which come here with 0 start their search from
> >> next_fd position.
> > Yup.
> >
> >> Suppose you implemented the patch as suggested by me and next_fd fits
> >> the range of 0-63. Then you get the benefit of lower level bitmap
> >> check just like in the patch you submitted, but without having to
> >> first branch on whether you happen to be in that range.
> >>
> >> Suppose next_fd is somewhere higher up, say 80. With your general
> >> approach the optimization wont be done whatsoever or it will be
> >> attempted at the 0-63 range when it is an invariant it finds no free
> >> fds.
> >>
> >> With what I'm suggesting the general idea of taking a peek at the
> >> lower level bitmap can be applied across the entire fd space. Some
> >> manual mucking will be needed to make sure this never pulls more than
> >> one cacheline, easiest way out I see would be to align next_fd to
> >> BITS_PER_LONG for the bitmap search purposes.
> > Well, all you need to do is to call:
> >
> > bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
> > BITS_PER_LONG, start & (BITS_PER_LONG-1));
> > if (bit < BITS_PER_LONG)
> > return bit + (start & ~(BITS_PER_LONG - 1));
> >
> >
> > in find_next_fd(). Not sure if this is what you meant by aligning next_fd
> > to BITS_PER_LONG...
> >
> > Honza
>
> So the idea here is to take a peek at the word contains next_fd, return
> directly if get free bit there. Not sure about the efficiency here if
> open/close fd frequently and next_fd is distributed very randomly. Just
> give a quick try for this part of code, seems kernel failed to boot
> up😳, kind of out of expectation ...
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-03 14:33 ` [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
@ 2024-07-03 14:17 ` Mateusz Guzik
2024-07-03 14:28 ` Ma, Yu
2024-07-04 10:07 ` Jan Kara
2024-07-04 10:03 ` Jan Kara
2024-07-04 17:44 ` Mateusz Guzik
2 siblings, 2 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-07-03 14:17 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bit;
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> +
The rest of the patchset looks good on cursory read.
As for this one, the suggestion was to make it work across the entire range.
Today I wont have time to write and test what we proposed, but will
probably find some time tomorrow. Perhaps Jan will do the needful(tm)
in the meantime.
That said, please stay tuned for a patch. :)
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> --
> 2.43.0
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-03 14:17 ` Mateusz Guzik
@ 2024-07-03 14:28 ` Ma, Yu
2024-07-04 10:07 ` Jan Kara
1 sibling, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-07-03 14:28 UTC (permalink / raw)
To: Mateusz Guzik
Cc: viro, brauner, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/3/2024 10:17 PM, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>> Intel ICX 160 cores configuration with v6.10-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a15317db3119..f25eca311f51 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>
>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> {
>> + unsigned int bit;
>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>> + if (bit < BITS_PER_LONG)
>> + return bit;
>> +
> The rest of the patchset looks good on cursory read.
>
> As for this one, the suggestion was to make it work across the entire range.
>
> Today I wont have time to write and test what we proposed, but will
> probably find some time tomorrow. Perhaps Jan will do the needful(tm)
> in the meantime.
>
> That said, please stay tuned for a patch. :)
Sure, understood, Guzik, thanks for the quick feedback and consideration
of chances to make it better and more versatile. I'll also give a try to
double check previous proposal on entire fds range.
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
>> --
>> 2.43.0
>>
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
` (3 preceding siblings ...)
2024-06-22 15:49 ` [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-07-03 14:33 ` Yu Ma
2024-07-03 14:33 ` [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
` (2 more replies)
2024-07-13 2:39 ` [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
6 siblings, 3 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-03 14:33 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, on top of patch 1,
pts/blogbench-1.1.0 has been improved by 28% for read and 12% for write
on Intel ICX 160 cores configuration with v6.10-rc6.
v2 -> v3:
1. Rebased the patch set to latest v6.10-rc6 and updated the performance results
2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com>
3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by
Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient
and more concise than v2.
v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance results
2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check
for fast path
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer
Yu Ma (3):
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: add fast path in find_next_fd()
fs/file.c | 47 ++++++++++++++++++++++++-----------------------
1 file changed, 24 insertions(+), 23 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-03 14:33 ` [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-07-03 14:33 ` Yu Ma
2024-07-03 14:34 ` Christian Brauner
2024-07-03 14:33 ` [PATCH v3 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-07-03 14:33 ` [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-07-03 14:33 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 38 ++++++++++++++++----------------------
1 file changed, 16 insertions(+), 22 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..5178b246e54b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (likely(fd < fdt->max_fds))
fd = find_next_fd(fdt, fd);
+ error = -EMFILE;
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
+
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
- error = -EMFILE;
- if (fd >= end)
- goto out;
-
- error = expand_files(files, fd);
- if (error < 0)
+ if (unlikely(fd >= end))
goto out;
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
-
if (start <= files->next_fd)
files->next_fd = fd + 1;
@@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
- BUG_ON(fdt->fd[fd] != NULL);
+ WARN_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v3 2/3] fs/file.c: conditionally clear full_fds
2024-07-03 14:33 ` [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-03 14:33 ` [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
@ 2024-07-03 14:33 ` Yu Ma
2024-07-03 14:33 ` [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2 siblings, 0 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-03 14:33 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc6.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 5178b246e54b..a15317db3119 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-03 14:33 ` [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-03 14:33 ` [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-07-03 14:33 ` [PATCH v3 2/3] fs/file.c: conditionally clear full_fds Yu Ma
@ 2024-07-03 14:33 ` Yu Ma
2024-07-03 14:17 ` Mateusz Guzik
` (2 more replies)
2 siblings, 3 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-03 14:33 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.
Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
update the fast path from alloc_fd() to find_next_fd(). With which, on top of
patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
Intel ICX 160 cores configuration with v6.10-rc6.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index a15317db3119..f25eca311f51 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -488,6 +488,11 @@ struct files_struct init_files = {
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
+ unsigned int bit;
+ bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
+ if (bit < BITS_PER_LONG)
+ return bit;
+
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-03 14:33 ` [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
@ 2024-07-03 14:34 ` Christian Brauner
2024-07-03 14:46 ` Ma, Yu
2024-07-04 10:11 ` Jan Kara
0 siblings, 2 replies; 103+ messages in thread
From: Christian Brauner @ 2024-07-03 14:34 UTC (permalink / raw)
To: Yu Ma
Cc: viro, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 38 ++++++++++++++++----------------------
> 1 file changed, 16 insertions(+), 22 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..5178b246e54b 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (likely(fd < fdt->max_fds))
> fd = find_next_fd(fdt, fd);
>
> + error = -EMFILE;
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
> + }
So this ends up removing the expand_files() above the fd >= end check
which means that you can end up expanding the files_struct even though
the request fd is past the provided end. That seems odd. What's the
reason for that reordering?
> +
> /*
> * N.B. For clone tasks sharing a files structure, this test
> * will limit the total number of files that can be opened.
> */
> - error = -EMFILE;
> - if (fd >= end)
> - goto out;
> -
> - error = expand_files(files, fd);
> - if (error < 0)
> + if (unlikely(fd >= end))
> goto out;
>
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> -
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
> @@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
> out:
> spin_unlock(&files->file_lock);
> @@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
> rcu_read_unlock_sched();
> spin_lock(&files->file_lock);
> fdt = files_fdtable(files);
> - BUG_ON(fdt->fd[fd] != NULL);
> + WARN_ON(fdt->fd[fd] != NULL);
> rcu_assign_pointer(fdt->fd[fd], file);
> spin_unlock(&files->file_lock);
> return;
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-03 14:34 ` Christian Brauner
@ 2024-07-03 14:46 ` Ma, Yu
2024-07-04 10:11 ` Jan Kara
1 sibling, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-07-03 14:46 UTC (permalink / raw)
To: Christian Brauner
Cc: viro, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/3/2024 10:34 PM, Christian Brauner wrote:
> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>> allocated fd is NULL. Remove this sanity check since it can be assured by
>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>> likely/unlikely and expand_file() call avoidance to reduce the work under
>> file_lock.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 38 ++++++++++++++++----------------------
>> 1 file changed, 16 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a3b72aa64f11..5178b246e54b 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> - if (fd < fdt->max_fds)
>> + if (likely(fd < fdt->max_fds))
>> fd = find_next_fd(fdt, fd);
>>
>> + error = -EMFILE;
>> + if (unlikely(fd >= fdt->max_fds)) {
>> + error = expand_files(files, fd);
>> + if (error < 0)
>> + goto out;
>> + /*
>> + * If we needed to expand the fs array we
>> + * might have blocked - try again.
>> + */
>> + if (error)
>> + goto repeat;
>> + }
> So this ends up removing the expand_files() above the fd >= end check
> which means that you can end up expanding the files_struct even though
> the request fd is past the provided end. That seems odd. What's the
> reason for that reordering?
Yes, you are right, thanks Christian. This incorrect reordering here is
due to historical versions with fast path inside. I'll update the order
back.
>> +
>> /*
>> * N.B. For clone tasks sharing a files structure, this test
>> * will limit the total number of files that can be opened.
>> */
>> - error = -EMFILE;
>> - if (fd >= end)
>> - goto out;
>> -
>> - error = expand_files(files, fd);
>> - if (error < 0)
>> + if (unlikely(fd >= end))
>> goto out;
>>
>> - /*
>> - * If we needed to expand the fs array we
>> - * might have blocked - try again.
>> - */
>> - if (error)
>> - goto repeat;
>> -
>> if (start <= files->next_fd)
>> files->next_fd = fd + 1;
>>
>> @@ -546,13 +547,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> else
>> __clear_close_on_exec(fd, fdt);
>> error = fd;
>> -#if 1
>> - /* Sanity check */
>> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
>> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
>> - rcu_assign_pointer(fdt->fd[fd], NULL);
>> - }
>> -#endif
>>
>> out:
>> spin_unlock(&files->file_lock);
>> @@ -618,7 +612,7 @@ void fd_install(unsigned int fd, struct file *file)
>> rcu_read_unlock_sched();
>> spin_lock(&files->file_lock);
>> fdt = files_fdtable(files);
>> - BUG_ON(fdt->fd[fd] != NULL);
>> + WARN_ON(fdt->fd[fd] != NULL);
>> rcu_assign_pointer(fdt->fd[fd], file);
>> spin_unlock(&files->file_lock);
>> return;
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-03 14:33 ` [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-03 14:17 ` Mateusz Guzik
@ 2024-07-04 10:03 ` Jan Kara
2024-07-04 14:50 ` Ma, Yu
2024-07-04 17:44 ` Mateusz Guzik
2 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-07-04 10:03 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed 03-07-24 10:33:11, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Nice! The patch looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
One style nit below:
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bit;
Empty line here please to separate variable declaration and code...
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> +
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-03 14:17 ` Mateusz Guzik
2024-07-03 14:28 ` Ma, Yu
@ 2024-07-04 10:07 ` Jan Kara
1 sibling, 0 replies; 103+ messages in thread
From: Jan Kara @ 2024-07-04 10:07 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yu Ma, viro, brauner, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed 03-07-24 16:17:01, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
> >
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> > update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> > patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> > Intel ICX 160 cores configuration with v6.10-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a15317db3119..f25eca311f51 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -488,6 +488,11 @@ struct files_struct init_files = {
> >
> > static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > {
> > + unsigned int bit;
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > +
>
> The rest of the patchset looks good on cursory read.
>
> As for this one, the suggestion was to make it work across the entire range.
I'm not sure what do you exactly mean by "make it work across the entire
range" because what Ma has implemented is exactly what I originally had in
mind - i.e., search the first word of open_fds starting from next_fd (note
that 'start' in this function is already set to max(start, next_fd)), if
that fails, go through the two level bitmap.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-03 14:34 ` Christian Brauner
2024-07-03 14:46 ` Ma, Yu
@ 2024-07-04 10:11 ` Jan Kara
2024-07-04 14:45 ` Ma, Yu
1 sibling, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-07-04 10:11 UTC (permalink / raw)
To: Christian Brauner
Cc: Yu Ma, viro, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed 03-07-24 16:34:49, Christian Brauner wrote:
> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > allocated fd is NULL. Remove this sanity check since it can be assured by
> > exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> > likely/unlikely and expand_file() call avoidance to reduce the work under
> > file_lock.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 38 ++++++++++++++++----------------------
> > 1 file changed, 16 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..5178b246e54b 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > if (fd < files->next_fd)
> > fd = files->next_fd;
> >
> > - if (fd < fdt->max_fds)
> > + if (likely(fd < fdt->max_fds))
> > fd = find_next_fd(fdt, fd);
> >
> > + error = -EMFILE;
> > + if (unlikely(fd >= fdt->max_fds)) {
> > + error = expand_files(files, fd);
> > + if (error < 0)
> > + goto out;
> > + /*
> > + * If we needed to expand the fs array we
> > + * might have blocked - try again.
> > + */
> > + if (error)
> > + goto repeat;
> > + }
>
> So this ends up removing the expand_files() above the fd >= end check
> which means that you can end up expanding the files_struct even though
> the request fd is past the provided end. That seems odd. What's the
> reason for that reordering?
Yeah, not only that but also:
> > /*
> > * N.B. For clone tasks sharing a files structure, this test
> > * will limit the total number of files that can be opened.
> > */
> > - error = -EMFILE;
> > - if (fd >= end)
> > - goto out;
> > -
> > - error = expand_files(files, fd);
> > - if (error < 0)
> > + if (unlikely(fd >= end))
> > goto out;
We could then exit here with error set to 0 instead of -EMFILE. So this
needs a bit of work. But otherwise the patch looks good to me.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-04 10:11 ` Jan Kara
@ 2024-07-04 14:45 ` Ma, Yu
2024-07-04 15:41 ` Jan Kara
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-07-04 14:45 UTC (permalink / raw)
To: Jan Kara, Christian Brauner
Cc: viro, mjguzik, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/4/2024 6:11 PM, Jan Kara wrote:
> On Wed 03-07-24 16:34:49, Christian Brauner wrote:
>> On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
>>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>>> allocated fd is NULL. Remove this sanity check since it can be assured by
>>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>>> likely/unlikely and expand_file() call avoidance to reduce the work under
>>> file_lock.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>> fs/file.c | 38 ++++++++++++++++----------------------
>>> 1 file changed, 16 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..5178b246e54b 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>> if (fd < files->next_fd)
>>> fd = files->next_fd;
>>>
>>> - if (fd < fdt->max_fds)
>>> + if (likely(fd < fdt->max_fds))
>>> fd = find_next_fd(fdt, fd);
>>>
>>> + error = -EMFILE;
>>> + if (unlikely(fd >= fdt->max_fds)) {
>>> + error = expand_files(files, fd);
>>> + if (error < 0)
>>> + goto out;
>>> + /*
>>> + * If we needed to expand the fs array we
>>> + * might have blocked - try again.
>>> + */
>>> + if (error)
>>> + goto repeat;
>>> + }
>> So this ends up removing the expand_files() above the fd >= end check
>> which means that you can end up expanding the files_struct even though
>> the request fd is past the provided end. That seems odd. What's the
>> reason for that reordering?
> Yeah, not only that but also:
>
>>> /*
>>> * N.B. For clone tasks sharing a files structure, this test
>>> * will limit the total number of files that can be opened.
>>> */
>>> - error = -EMFILE;
>>> - if (fd >= end)
>>> - goto out;
>>> -
>>> - error = expand_files(files, fd);
>>> - if (error < 0)
>>> + if (unlikely(fd >= end))
>>> goto out;
> We could then exit here with error set to 0 instead of -EMFILE. So this
> needs a bit of work. But otherwise the patch looks good to me.
>
> Honza
Do you mean that we return 0 here is fd >=end, I'm afraid that might
broke the original design of this function. And all the callers of it
are using ret < 0 for error handling, if ret=0, that should mean the fd
allocated is 0 ...
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-04 10:03 ` Jan Kara
@ 2024-07-04 14:50 ` Ma, Yu
0 siblings, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-07-04 14:50 UTC (permalink / raw)
To: Jan Kara
Cc: viro, brauner, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/4/2024 6:03 PM, Jan Kara wrote:
> On Wed 03-07-24 10:33:11, Yu Ma wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>> Intel ICX 160 cores configuration with v6.10-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Nice! The patch looks good to me. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> One style nit below:
>
>> diff --git a/fs/file.c b/fs/file.c
>> index a15317db3119..f25eca311f51 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>
>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> {
>> + unsigned int bit;
> Empty line here please to separate variable declaration and code...
Thanks Honza, copy that :)
>
>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>> + if (bit < BITS_PER_LONG)
>> + return bit;
>> +
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
> Honza
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-04 14:45 ` Ma, Yu
@ 2024-07-04 15:41 ` Jan Kara
0 siblings, 0 replies; 103+ messages in thread
From: Jan Kara @ 2024-07-04 15:41 UTC (permalink / raw)
To: Ma, Yu
Cc: Jan Kara, Christian Brauner, viro, mjguzik, edumazet,
linux-fsdevel, linux-kernel, pan.deng, tianyou.li, tim.c.chen,
tim.c.chen
On Thu 04-07-24 22:45:32, Ma, Yu wrote:
>
> On 7/4/2024 6:11 PM, Jan Kara wrote:
> > On Wed 03-07-24 16:34:49, Christian Brauner wrote:
> > > On Wed, Jul 03, 2024 at 10:33:09AM GMT, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> > > > allocated fd is NULL. Remove this sanity check since it can be assured by
> > > > exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> > > > likely/unlikely and expand_file() call avoidance to reduce the work under
> > > > file_lock.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > > fs/file.c | 38 ++++++++++++++++----------------------
> > > > 1 file changed, 16 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a3b72aa64f11..5178b246e54b 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -515,28 +515,29 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > > if (fd < files->next_fd)
> > > > fd = files->next_fd;
> > > > - if (fd < fdt->max_fds)
> > > > + if (likely(fd < fdt->max_fds))
> > > > fd = find_next_fd(fdt, fd);
> > > > + error = -EMFILE;
> > > > + if (unlikely(fd >= fdt->max_fds)) {
> > > > + error = expand_files(files, fd);
> > > > + if (error < 0)
> > > > + goto out;
> > > > + /*
> > > > + * If we needed to expand the fs array we
> > > > + * might have blocked - try again.
> > > > + */
> > > > + if (error)
> > > > + goto repeat;
> > > > + }
> > > So this ends up removing the expand_files() above the fd >= end check
> > > which means that you can end up expanding the files_struct even though
> > > the request fd is past the provided end. That seems odd. What's the
> > > reason for that reordering?
> > Yeah, not only that but also:
> >
> > > > /*
> > > > * N.B. For clone tasks sharing a files structure, this test
> > > > * will limit the total number of files that can be opened.
> > > > */
> > > > - error = -EMFILE;
> > > > - if (fd >= end)
> > > > - goto out;
> > > > -
> > > > - error = expand_files(files, fd);
> > > > - if (error < 0)
> > > > + if (unlikely(fd >= end))
> > > > goto out;
> > We could then exit here with error set to 0 instead of -EMFILE. So this
> > needs a bit of work. But otherwise the patch looks good to me.
> >
> > Honza
>
> Do you mean that we return 0 here is fd >=end, I'm afraid that might broke
> the original design of this function. And all the callers of it are using
> ret < 0 for error handling, if ret=0, that should mean the fd allocated is 0
> ...
What I meant is that after your changes alloc_fd() could return 0 in fd >=
end case. It could happen if we went through expand_files() which then
returned 0. Anyway, please just fix the ordering of checks and we should be
fine.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-03 14:33 ` [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-03 14:17 ` Mateusz Guzik
2024-07-04 10:03 ` Jan Kara
@ 2024-07-04 17:44 ` Mateusz Guzik
2024-07-04 21:55 ` Jan Kara
2 siblings, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-07-04 17:44 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> Intel ICX 160 cores configuration with v6.10-rc6.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index a15317db3119..f25eca311f51 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bit;
> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> + if (bit < BITS_PER_LONG)
> + return bit;
> +
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> --
> 2.43.0
>
I had something like this in mind:
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..4d3307e39db7 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
*fdt, unsigned int start)
unsigned int maxfd = fdt->max_fds; /* always multiple of
BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level map.
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG - 1));
+ if (bit < BITS_PER_LONG) {
+ return bit + bitbit * BITS_PER_LONG;
+ }
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit,
bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
can you please test it out. I expect it to provide a tiny improvement
over your patch.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-04 17:44 ` Mateusz Guzik
@ 2024-07-04 21:55 ` Jan Kara
2024-07-05 7:56 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-07-04 21:55 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yu Ma, viro, brauner, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Thu 04-07-24 19:44:10, Mateusz Guzik wrote:
> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
> >
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> >
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> > searching.
> >
> > As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
> > update the fast path from alloc_fd() to find_next_fd(). With which, on top of
> > patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
> > Intel ICX 160 cores configuration with v6.10-rc6.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> > fs/file.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a15317db3119..f25eca311f51 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -488,6 +488,11 @@ struct files_struct init_files = {
> >
> > static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > {
> > + unsigned int bit;
> > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > + if (bit < BITS_PER_LONG)
> > + return bit;
> > +
> > unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> > --
> > 2.43.0
> >
>
> I had something like this in mind:
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..4d3307e39db7 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> *fdt, unsigned int start)
> unsigned int maxfd = fdt->max_fds; /* always multiple of
> BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> + unsigned int bit;
> +
> + /*
> + * Try to avoid looking at the second level map.
> + */
> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> + start & (BITS_PER_LONG - 1));
> + if (bit < BITS_PER_LONG) {
> + return bit + bitbit * BITS_PER_LONG;
> + }
Drat, you're right. I missed that Ma did not add the proper offset to
open_fds. *This* is what I meant :)
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-04 21:55 ` Jan Kara
@ 2024-07-05 7:56 ` Ma, Yu
2024-07-09 8:32 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-07-05 7:56 UTC (permalink / raw)
To: Jan Kara, Mateusz Guzik
Cc: viro, brauner, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/5/2024 5:55 AM, Jan Kara wrote:
> On Thu 04-07-24 19:44:10, Mateusz Guzik wrote:
>> On Wed, Jul 3, 2024 at 4:07 PM Yu Ma <yu.ma@intel.com> wrote:
>>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>>> when we look for an available fd slot. Skip 2-levels searching via
>>> find_next_zero_bit() for this common fast path.
>>>
>>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>>> free slot is available there, as:
>>> (1) The fd allocation algorithm would always allocate fd from small to large.
>>> Lower bits in open_fds bitmap would be used much more frequently than higher
>>> bits.
>>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>>> it would never be shrunk. The search size increases but there are few open fds
>>> available here.
>>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>>> searching.
>>>
>>> As suggested by Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>,
>>> update the fast path from alloc_fd() to find_next_fd(). With which, on top of
>>> patch 1 and 2, pts/blogbench-1.1.0 read is improved by 13% and write by 7% on
>>> Intel ICX 160 cores configuration with v6.10-rc6.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>> fs/file.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a15317db3119..f25eca311f51 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -488,6 +488,11 @@ struct files_struct init_files = {
>>>
>>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>> {
>>> + unsigned int bit;
>>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>>> + if (bit < BITS_PER_LONG)
>>> + return bit;
>>> +
>>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>> unsigned int bitbit = start / BITS_PER_LONG;
>>> --
>>> 2.43.0
>>>
>> I had something like this in mind:
>> diff --git a/fs/file.c b/fs/file.c
>> index a3b72aa64f11..4d3307e39db7 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>> *fdt, unsigned int start)
>> unsigned int maxfd = fdt->max_fds; /* always multiple of
>> BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
>> + unsigned int bit;
>> +
>> + /*
>> + * Try to avoid looking at the second level map.
>> + */
>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> + start & (BITS_PER_LONG - 1));
>> + if (bit < BITS_PER_LONG) {
>> + return bit + bitbit * BITS_PER_LONG;
>> + }
> Drat, you're right. I missed that Ma did not add the proper offset to
> open_fds. *This* is what I meant :)
>
> Honza
Just tried this on v6.10-rc6, the improvement on top of patch 1 and
patch 2 is 7% for read and 3% for write, less than just check first word.
Per my understanding, its performance would be better if we can find
free bit in the same word of next_fd with high possibility, but next_fd
just represents the lowest possible free bit. If fds are open/close
frequently and randomly, that might not always be the case, next_fd may
be distributed randomly, for example, 0-65 are occupied, fd=3 is
returned, next_fd will be set to 3, next time when 3 is allocated,
next_fd will be set to 4, while the actual first free bit is 66 , when
66 is allocated, and fd=5 is returned, then the above process would be
went through again.
Yu
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-05 7:56 ` Ma, Yu
@ 2024-07-09 8:32 ` Ma, Yu
2024-07-09 10:17 ` Mateusz Guzik
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-07-09 8:32 UTC (permalink / raw)
To: Jan Kara, Mateusz Guzik
Cc: viro, brauner, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/5/2024 3:56 PM, Ma, Yu wrote:
> I had something like this in mind:
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..4d3307e39db7 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>>> *fdt, unsigned int start)
>>> unsigned int maxfd = fdt->max_fds; /* always multiple of
>>> BITS_PER_LONG */
>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>> unsigned int bitbit = start / BITS_PER_LONG;
>>> + unsigned int bit;
>>> +
>>> + /*
>>> + * Try to avoid looking at the second level map.
>>> + */
>>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>>> + start & (BITS_PER_LONG - 1));
>>> + if (bit < BITS_PER_LONG) {
>>> + return bit + bitbit * BITS_PER_LONG;
>>> + }
>> Drat, you're right. I missed that Ma did not add the proper offset to
>> open_fds. *This* is what I meant :)
>>
>> Honza
>
> Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> patch 2 is 7% for read and 3% for write, less than just check first word.
>
> Per my understanding, its performance would be better if we can find
> free bit in the same word of next_fd with high possibility, but
> next_fd just represents the lowest possible free bit. If fds are
> open/close frequently and randomly, that might not always be the case,
> next_fd may be distributed randomly, for example, 0-65 are occupied,
> fd=3 is returned, next_fd will be set to 3, next time when 3 is
> allocated, next_fd will be set to 4, while the actual first free bit
> is 66 , when 66 is allocated, and fd=5 is returned, then the above
> process would be went through again.
>
> Yu
>
Hi Guzik, Honza,
Do we have any more comment or idea regarding to the fast path? Thanks
for your time and any feedback :)
Regards
Yu
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-09 8:32 ` Ma, Yu
@ 2024-07-09 10:17 ` Mateusz Guzik
2024-07-10 23:40 ` Tim Chen
0 siblings, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-07-09 10:17 UTC (permalink / raw)
To: Ma, Yu
Cc: Jan Kara, viro, brauner, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
Right, forgot to respond.
I suspect the different result is either because of mere variance
between reboots or blogbench using significantly less than 100 fds at
any given time -- I don't have an easy way to test at your scale at
the moment. You could probably test that by benching both approaches
while switching them at runtime with a static_branch. However, I don't
know if that effort is warranted atm.
So happens I'm busy with other stuff and it is not my call to either
block or let this in, so I'm buggering off.
On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 7/5/2024 3:56 PM, Ma, Yu wrote:
> > I had something like this in mind:
> >>> diff --git a/fs/file.c b/fs/file.c
> >>> index a3b72aa64f11..4d3307e39db7 100644
> >>> --- a/fs/file.c
> >>> +++ b/fs/file.c
> >>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> >>> *fdt, unsigned int start)
> >>> unsigned int maxfd = fdt->max_fds; /* always multiple of
> >>> BITS_PER_LONG */
> >>> unsigned int maxbit = maxfd / BITS_PER_LONG;
> >>> unsigned int bitbit = start / BITS_PER_LONG;
> >>> + unsigned int bit;
> >>> +
> >>> + /*
> >>> + * Try to avoid looking at the second level map.
> >>> + */
> >>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> >>> + start & (BITS_PER_LONG - 1));
> >>> + if (bit < BITS_PER_LONG) {
> >>> + return bit + bitbit * BITS_PER_LONG;
> >>> + }
> >> Drat, you're right. I missed that Ma did not add the proper offset to
> >> open_fds. *This* is what I meant :)
> >>
> >> Honza
> >
> > Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> > patch 2 is 7% for read and 3% for write, less than just check first word.
> >
> > Per my understanding, its performance would be better if we can find
> > free bit in the same word of next_fd with high possibility, but
> > next_fd just represents the lowest possible free bit. If fds are
> > open/close frequently and randomly, that might not always be the case,
> > next_fd may be distributed randomly, for example, 0-65 are occupied,
> > fd=3 is returned, next_fd will be set to 3, next time when 3 is
> > allocated, next_fd will be set to 4, while the actual first free bit
> > is 66 , when 66 is allocated, and fd=5 is returned, then the above
> > process would be went through again.
> >
> > Yu
> >
> Hi Guzik, Honza,
>
> Do we have any more comment or idea regarding to the fast path? Thanks
> for your time and any feedback :)
>
>
> Regards
>
> Yu
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-09 10:17 ` Mateusz Guzik
@ 2024-07-10 23:40 ` Tim Chen
2024-07-11 9:27 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Tim Chen @ 2024-07-10 23:40 UTC (permalink / raw)
To: Mateusz Guzik, Ma, Yu
Cc: Jan Kara, viro, brauner, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen
On Tue, 2024-07-09 at 12:17 +0200, Mateusz Guzik wrote:
> Right, forgot to respond.
>
> I suspect the different result is either because of mere variance
> between reboots or blogbench using significantly less than 100 fds at
> any given time -- I don't have an easy way to test at your scale at
> the moment. You could probably test that by benching both approaches
> while switching them at runtime with a static_branch. However, I don't
> know if that effort is warranted atm.
>
> So happens I'm busy with other stuff and it is not my call to either
> block or let this in, so I'm buggering off.
>
> On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
> >
> >
> > On 7/5/2024 3:56 PM, Ma, Yu wrote:
> > > I had something like this in mind:
> > > > > diff --git a/fs/file.c b/fs/file.c
> > > > > index a3b72aa64f11..4d3307e39db7 100644
> > > > > --- a/fs/file.c
> > > > > +++ b/fs/file.c
> > > > > @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
> > > > > *fdt, unsigned int start)
> > > > > unsigned int maxfd = fdt->max_fds; /* always multiple of
> > > > > BITS_PER_LONG */
> > > > > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > > > unsigned int bitbit = start / BITS_PER_LONG;
> > > > > + unsigned int bit;
> > > > > +
> > > > > + /*
> > > > > + * Try to avoid looking at the second level map.
> > > > > + */
> > > > > + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> > > > > + start & (BITS_PER_LONG - 1));
> > > > > + if (bit < BITS_PER_LONG) {
> > > > > + return bit + bitbit * BITS_PER_LONG;
> > > > > + }
I think this approach based on next_fd quick check is more generic and scalable.
It just happen for blogbench, just checking the first 64 bit allow a quicker
skip to the two level search where this approach, next_fd may be left
in a 64 word that actually has no open bits and we are doing useless search
in find_next_zero_bit(). Perhaps we should check full_fds_bits to make sure
there are empty slots before we do
find_next_zero_bit() fast path. Something like
if (!test_bit(bitbit, fdt->full_fds_bits)) {
bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
start & (BITS_PER_LONG - 1));
if (bit < BITS_PER_LONG)
return bit + bitbit * BITS_PER_LONG;
}
Tim
> > > > Drat, you're right. I missed that Ma did not add the proper offset to
> > > > open_fds. *This* is what I meant :)
> > > >
> > > > Honza
> > >
> > > Just tried this on v6.10-rc6, the improvement on top of patch 1 and
> > > patch 2 is 7% for read and 3% for write, less than just check first word.
> > >
> > > Per my understanding, its performance would be better if we can find
> > > free bit in the same word of next_fd with high possibility, but
> > > next_fd just represents the lowest possible free bit. If fds are
> > > open/close frequently and randomly, that might not always be the case,
> > > next_fd may be distributed randomly, for example, 0-65 are occupied,
> > > fd=3 is returned, next_fd will be set to 3, next time when 3 is
> > > allocated, next_fd will be set to 4, while the actual first free bit
> > > is 66 , when 66 is allocated, and fd=5 is returned, then the above
> > > process would be went through again.
> > >
> > > Yu
> > >
> > Hi Guzik, Honza,
> >
> > Do we have any more comment or idea regarding to the fast path? Thanks
> > for your time and any feedback :)
> >
> >
> > Regards
> >
> > Yu
> >
>
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-10 23:40 ` Tim Chen
@ 2024-07-11 9:27 ` Ma, Yu
0 siblings, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-07-11 9:27 UTC (permalink / raw)
To: Tim Chen, Mateusz Guzik
Cc: Jan Kara, viro, brauner, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, yu.ma
On 7/11/2024 7:40 AM, Tim Chen wrote:
> On Tue, 2024-07-09 at 12:17 +0200, Mateusz Guzik wrote:
>> Right, forgot to respond.
>>
>> I suspect the different result is either because of mere variance
>> between reboots or blogbench using significantly less than 100 fds at
>> any given time -- I don't have an easy way to test at your scale at
>> the moment. You could probably test that by benching both approaches
>> while switching them at runtime with a static_branch. However, I don't
>> know if that effort is warranted atm.
>>
>> So happens I'm busy with other stuff and it is not my call to either
>> block or let this in, so I'm buggering off.
>>
>> On Tue, Jul 9, 2024 at 10:32 AM Ma, Yu <yu.ma@intel.com> wrote:
>>>
>>> On 7/5/2024 3:56 PM, Ma, Yu wrote:
>>>> I had something like this in mind:
>>>>>> diff --git a/fs/file.c b/fs/file.c
>>>>>> index a3b72aa64f11..4d3307e39db7 100644
>>>>>> --- a/fs/file.c
>>>>>> +++ b/fs/file.c
>>>>>> @@ -489,6 +489,16 @@ static unsigned int find_next_fd(struct fdtable
>>>>>> *fdt, unsigned int start)
>>>>>> unsigned int maxfd = fdt->max_fds; /* always multiple of
>>>>>> BITS_PER_LONG */
>>>>>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>>>>>> unsigned int bitbit = start / BITS_PER_LONG;
>>>>>> + unsigned int bit;
>>>>>> +
>>>>>> + /*
>>>>>> + * Try to avoid looking at the second level map.
>>>>>> + */
>>>>>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>>>>>> + start & (BITS_PER_LONG - 1));
>>>>>> + if (bit < BITS_PER_LONG) {
>>>>>> + return bit + bitbit * BITS_PER_LONG;
>>>>>> + }
> I think this approach based on next_fd quick check is more generic and scalable.
>
> It just happen for blogbench, just checking the first 64 bit allow a quicker
> skip to the two level search where this approach, next_fd may be left
> in a 64 word that actually has no open bits and we are doing useless search
> in find_next_zero_bit(). Perhaps we should check full_fds_bits to make sure
> there are empty slots before we do
> find_next_zero_bit() fast path. Something like
>
> if (!test_bit(bitbit, fdt->full_fds_bits)) {
> bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> start & (BITS_PER_LONG - 1));
> if (bit < BITS_PER_LONG)
> return bit + bitbit * BITS_PER_LONG;
> }
> Tim
Yes, agree that it scales better, I'll update v4 with fast path for the
word contains next_fd and send out for review soon
>>>>> Drat, you're right. I missed that Ma did not add the proper offset to
>>>>> open_fds. *This* is what I meant :)
>>>>>
>>>>> Honza
>>>> Just tried this on v6.10-rc6, the improvement on top of patch 1 and
>>>> patch 2 is 7% for read and 3% for write, less than just check first word.
>>>>
>>>> Per my understanding, its performance would be better if we can find
>>>> free bit in the same word of next_fd with high possibility, but
>>>> next_fd just represents the lowest possible free bit. If fds are
>>>> open/close frequently and randomly, that might not always be the case,
>>>> next_fd may be distributed randomly, for example, 0-65 are occupied,
>>>> fd=3 is returned, next_fd will be set to 3, next time when 3 is
>>>> allocated, next_fd will be set to 4, while the actual first free bit
>>>> is 66 , when 66 is allocated, and fd=5 is returned, then the above
>>>> process would be went through again.
>>>>
>>>> Yu
>>>>
>>> Hi Guzik, Honza,
>>>
>>> Do we have any more comment or idea regarding to the fast path? Thanks
>>> for your time and any feedback :)
>>>
>>>
>>> Regards
>>>
>>> Yu
>>>
>>
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
` (4 preceding siblings ...)
2024-07-03 14:33 ` [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-07-13 2:39 ` Yu Ma
2024-07-13 2:39 ` [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
` (2 more replies)
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
6 siblings, 3 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-13 2:39 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, on top of patch 1,
pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
on Intel ICX 160 cores configuration with v6.10-rc7.
v3 -> v4:
1. Rebased the patch set to v6.10-rc7 and updated performance results.
2. Updated patch 1 to revise the order of fd checking.
3. As proposed by Mateusz Guzik <mjguzik gmail.com>, and agreed by Jan Kara
<jack@suse.cz> and Tim Chen <tim.c.chen@linux.intel.com>, updated patch 3 with
a more generic and scalable fast path for the word contains next_fd.
v2 -> v3:
1. Rebased the patch set to latest v6.10-rc6 and updated the performance results
2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com>
3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by
Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient
and more concise than v2.
v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance results
2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check
for fast path
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer
Yu Ma (3):
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: add fast path in find_next_fd()
fs/file.c | 50 +++++++++++++++++++++++++++++---------------------
1 file changed, 29 insertions(+), 21 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-13 2:39 ` [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-07-13 2:39 ` Yu Ma
2024-07-16 11:11 ` Jan Kara
2024-07-13 2:39 ` [PATCH v4 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-07-13 2:39 ` [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-07-13 2:39 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..e1b9d6df7941 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (likely(fd < fdt->max_fds))
fd = find_next_fd(fdt, fd);
/*
@@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
- if (fd >= end)
+ if (unlikely(fd >= end))
goto out;
- error = expand_files(files, fd);
- if (error < 0)
- goto out;
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
if (start <= files->next_fd)
files->next_fd = fd + 1;
@@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
- BUG_ON(fdt->fd[fd] != NULL);
+ WARN_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v4 2/3] fs/file.c: conditionally clear full_fds
2024-07-13 2:39 ` [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-13 2:39 ` [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
@ 2024-07-13 2:39 ` Yu Ma
2024-07-13 2:39 ` [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2 siblings, 0 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-13 2:39 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index e1b9d6df7941..1be2a5bcc7c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-13 2:39 ` [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-13 2:39 ` [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-07-13 2:39 ` [PATCH v4 2/3] fs/file.c: conditionally clear full_fds Yu Ma
@ 2024-07-13 2:39 ` Yu Ma
2024-07-16 11:19 ` Jan Kara
2 siblings, 1 reply; 103+ messages in thread
From: Yu Ma @ 2024-07-13 2:39 UTC (permalink / raw)
To: viro, brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..a3ce6ba30c8c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -488,9 +488,20 @@ struct files_struct init_files = {
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
+ unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level bitmap
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG -1));
+ if (bit < BITS_PER_LONG) {
+ return bit + bitbit * BITS_PER_LONG;
+ }
+
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
- unsigned int bitbit = start / BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-13 2:39 ` [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
@ 2024-07-16 11:11 ` Jan Kara
0 siblings, 0 replies; 103+ messages in thread
From: Jan Kara @ 2024-07-16 11:11 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Fri 12-07-24 22:39:15, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/file.c | 33 ++++++++++++++-------------------
> 1 file changed, 14 insertions(+), 19 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..e1b9d6df7941 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (likely(fd < fdt->max_fds))
> fd = find_next_fd(fdt, fd);
>
> /*
> @@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> * will limit the total number of files that can be opened.
> */
> error = -EMFILE;
> - if (fd >= end)
> + if (unlikely(fd >= end))
> goto out;
>
> - error = expand_files(files, fd);
> - if (error < 0)
> - goto out;
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
>
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
> + }
>
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
> @@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
> out:
> spin_unlock(&files->file_lock);
> @@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
> rcu_read_unlock_sched();
> spin_lock(&files->file_lock);
> fdt = files_fdtable(files);
> - BUG_ON(fdt->fd[fd] != NULL);
> + WARN_ON(fdt->fd[fd] != NULL);
> rcu_assign_pointer(fdt->fd[fd], file);
> spin_unlock(&files->file_lock);
> return;
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-13 2:39 ` [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
@ 2024-07-16 11:19 ` Jan Kara
2024-07-16 12:37 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-07-16 11:19 UTC (permalink / raw)
To: Yu Ma
Cc: viro, brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Fri 12-07-24 22:39:17, Yu Ma wrote:
> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> word contains next_fd, as:
> (1) next_fd indicates the lower bound for the first free fd.
> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
>
> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
Looks good. Just some code style nits below.
> diff --git a/fs/file.c b/fs/file.c
> index 1be2a5bcc7c4..a3ce6ba30c8c 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -488,9 +488,20 @@ struct files_struct init_files = {
>
> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> {
> + unsigned int bitbit = start / BITS_PER_LONG;
> + unsigned int bit;
> +
> + /*
> + * Try to avoid looking at the second level bitmap
> + */
> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> + start & (BITS_PER_LONG -1));
^^ Either
(BITS_PER_LONG-1) or (BITS_PER_LONG - 1) please. Your combination looks
particularly weird :)
> + if (bit < BITS_PER_LONG) {
> + return bit + bitbit * BITS_PER_LONG;
> + }
No need for braces around the above block.
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
We keep declarations at the beginning of the block. Usually it keeps the
code more readable and the compiler should be clever enough to perform the
loads & arithmetics only when needed.
After fixing these style nits feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-16 11:19 ` Jan Kara
@ 2024-07-16 12:37 ` Ma, Yu
0 siblings, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-07-16 12:37 UTC (permalink / raw)
To: Jan Kara
Cc: viro, brauner, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 7/16/2024 7:19 PM, Jan Kara wrote:
> On Fri 12-07-24 22:39:17, Yu Ma wrote:
>> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
>> word contains next_fd, as:
>> (1) next_fd indicates the lower bound for the first free fd.
>> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>>
>> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
>> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
>> versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
>> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
> Looks good. Just some code style nits below.
Copy that, thanks Honza, I'll revise and send out updated version soon.
>
>> diff --git a/fs/file.c b/fs/file.c
>> index 1be2a5bcc7c4..a3ce6ba30c8c 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -488,9 +488,20 @@ struct files_struct init_files = {
>>
>> static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> {
>> + unsigned int bitbit = start / BITS_PER_LONG;
>> + unsigned int bit;
>> +
>> + /*
>> + * Try to avoid looking at the second level bitmap
>> + */
>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> + start & (BITS_PER_LONG -1));
> ^^ Either
> (BITS_PER_LONG-1) or (BITS_PER_LONG - 1) please. Your combination looks
> particularly weird :)
>
>> + if (bit < BITS_PER_LONG) {
>> + return bit + bitbit * BITS_PER_LONG;
>> + }
> No need for braces around the above block.
>
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
> We keep declarations at the beginning of the block. Usually it keeps the
> code more readable and the compiler should be clever enough to perform the
> loads & arithmetics only when needed.
>
> After fixing these style nits feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> Honza
Yes, I'll polish this part of code accordingly, thanks for all the
comments here :)
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
` (5 preceding siblings ...)
2024-07-13 2:39 ` [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-07-17 14:50 ` Yu Ma
2024-07-17 14:50 ` [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
` (3 more replies)
6 siblings, 4 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-17 14:50 UTC (permalink / raw)
To: brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen, viro
pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, on top of patch 1,
pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
on Intel ICX 160 cores configuration with v6.10-rc7.
v4 -> v5:
Revised patch 3 for some code style issues.
v3 -> v4:
1. Rebased the patch set to v6.10-rc7 and updated performance results.
2. Updated patch 1 to revise the order of fd checking.
3. As proposed by Mateusz Guzik <mjguzik gmail.com>, and agreed by Jan Kara
<jack@suse.cz> and Tim Chen <tim.c.chen@linux.intel.com>, updated patch 3 with
a more generic and scalable fast path for the word contains next_fd.
v2 -> v3:
1. Rebased the patch set to latest v6.10-rc6 and updated the performance results
2. Reordered the patches as suggested by Mateusz Guzik <mjguzik gmail.com>
3. Updated the fast path from alloc_fd() to find_next_fd() as suggested by
Mateusz Guzik <mjguzik gmail.com> and Jan Kara <jack@suse.cz>, it is efficient
and more concise than v2.
v1 -> v2:
1. Rebased the patch set to latest v6.10-rc4 and updated the performance results
2. Fixed the bug identified by Mateusz Guzik in patch 1 with adding rlimit check
for fast path
3. Updated patch 3 to remove sanity_check directly per the alignment with
maintainer
Yu Ma (3):
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: add fast path in find_next_fd()
fs/file.c | 46 ++++++++++++++++++++++++++--------------------
1 file changed, 26 insertions(+), 20 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
@ 2024-07-17 14:50 ` Yu Ma
2024-08-06 13:44 ` kernel test robot
2024-08-14 21:38 ` Al Viro
2024-07-17 14:50 ` [PATCH v5 2/3] fs/file.c: conditionally clear full_fds Yu Ma
` (2 subsequent siblings)
3 siblings, 2 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-17 14:50 UTC (permalink / raw)
To: brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen, viro
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a3b72aa64f11..e1b9d6df7941 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -515,7 +515,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (likely(fd < fdt->max_fds))
fd = find_next_fd(fdt, fd);
/*
@@ -523,19 +523,21 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
- if (fd >= end)
+ if (unlikely(fd >= end))
goto out;
- error = expand_files(files, fd);
- if (error < 0)
- goto out;
+ if (unlikely(fd >= fdt->max_fds)) {
+ error = expand_files(files, fd);
+ if (error < 0)
+ goto out;
- /*
- * If we needed to expand the fs array we
- * might have blocked - try again.
- */
- if (error)
- goto repeat;
+ /*
+ * If we needed to expand the fs array we
+ * might have blocked - try again.
+ */
+ if (error)
+ goto repeat;
+ }
if (start <= files->next_fd)
files->next_fd = fd + 1;
@@ -546,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -618,7 +613,7 @@ void fd_install(unsigned int fd, struct file *file)
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
- BUG_ON(fdt->fd[fd] != NULL);
+ WARN_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v5 2/3] fs/file.c: conditionally clear full_fds
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-17 14:50 ` [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
@ 2024-07-17 14:50 ` Yu Ma
2024-07-17 14:50 ` [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-22 15:02 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Christian Brauner
3 siblings, 0 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-17 14:50 UTC (permalink / raw)
To: brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen, viro
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index e1b9d6df7941..1be2a5bcc7c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-17 14:50 ` [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-07-17 14:50 ` [PATCH v5 2/3] fs/file.c: conditionally clear full_fds Yu Ma
@ 2024-07-17 14:50 ` Yu Ma
2024-07-19 17:53 ` Mateusz Guzik
2024-08-06 13:48 ` kernel test robot
2024-07-22 15:02 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Christian Brauner
3 siblings, 2 replies; 103+ messages in thread
From: Yu Ma @ 2024-07-17 14:50 UTC (permalink / raw)
To: brauner, jack, mjguzik, edumazet
Cc: yu.ma, linux-fsdevel, linux-kernel, pan.deng, tianyou.li,
tim.c.chen, tim.c.chen, viro
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
fs/file.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..729c07a4fc28 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
+ unsigned int bit;
+
+ /*
+ * Try to avoid looking at the second level bitmap
+ */
+ bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+ start & (BITS_PER_LONG - 1));
+ if (bit < BITS_PER_LONG)
+ return bit + bitbit * BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
--
2.43.0
^ permalink raw reply related [flat|nested] 103+ messages in thread
* Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-17 14:50 ` [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
@ 2024-07-19 17:53 ` Mateusz Guzik
2024-07-20 12:57 ` Ma, Yu
2024-08-06 13:48 ` kernel test robot
1 sibling, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-07-19 17:53 UTC (permalink / raw)
To: Yu Ma
Cc: brauner, jack, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, viro
On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
>
> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> word contains next_fd, as:
> (1) next_fd indicates the lower bound for the first free fd.
> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
this is stale -- now the fast path searches up to 64 fds in the lower bitmap
> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
>
> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> versions.
I think this paragraph is droppable. You already got an ack from Jan
below, so stating he agrees with the patch is redundant. As for me I
don't think this warrants mentioning. Just remove it, perhaps
Christian will be willing to massage it by himself to avoid another
series posting.
> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
> fs/file.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index 1be2a5bcc7c4..729c07a4fc28 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> unsigned int maxbit = maxfd / BITS_PER_LONG;
> unsigned int bitbit = start / BITS_PER_LONG;
> + unsigned int bit;
> +
> + /*
> + * Try to avoid looking at the second level bitmap
> + */
> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> + start & (BITS_PER_LONG - 1));
> + if (bit < BITS_PER_LONG)
> + return bit + bitbit * BITS_PER_LONG;
>
> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> if (bitbit >= maxfd)
> --
> 2.43.0
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-19 17:53 ` Mateusz Guzik
@ 2024-07-20 12:57 ` Ma, Yu
2024-07-20 14:22 ` Mateusz Guzik
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-07-20 12:57 UTC (permalink / raw)
To: Mateusz Guzik
Cc: brauner, jack, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, viro, yu.ma
On 7/20/2024 1:53 AM, Mateusz Guzik wrote:
> On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
>> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
>> word contains next_fd, as:
>> (1) next_fd indicates the lower bound for the first free fd.
>> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
> this is stale -- now the fast path searches up to 64 fds in the lower bitmap
Nope, this is still valid, as the searching size of the fast path inside
of find_next_fd() is always 64, it will execute the fast path inside of
find_next_zero_bit().
>
>> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>>
>> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
>> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
>> versions.
> I think this paragraph is droppable. You already got an ack from Jan
> below, so stating he agrees with the patch is redundant. As for me I
> don't think this warrants mentioning. Just remove it, perhaps
> Christian will be willing to massage it by himself to avoid another
> series posting.
The idea of fast path for the word contains next_fd is from you,
although this patch is small, I think it is reasonable to record here
out of my respect. Appreciate for your guide and comments on this patch,
I've learned a lot on the way of resolving problems :)
Regards
Yu
>> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
>> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>> fs/file.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 1be2a5bcc7c4..729c07a4fc28 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>> unsigned int maxbit = maxfd / BITS_PER_LONG;
>> unsigned int bitbit = start / BITS_PER_LONG;
>> + unsigned int bit;
>> +
>> + /*
>> + * Try to avoid looking at the second level bitmap
>> + */
>> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> + start & (BITS_PER_LONG - 1));
>> + if (bit < BITS_PER_LONG)
>> + return bit + bitbit * BITS_PER_LONG;
>>
>> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>> if (bitbit >= maxfd)
>> --
>> 2.43.0
>>
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-20 12:57 ` Ma, Yu
@ 2024-07-20 14:22 ` Mateusz Guzik
0 siblings, 0 replies; 103+ messages in thread
From: Mateusz Guzik @ 2024-07-20 14:22 UTC (permalink / raw)
To: Ma, Yu
Cc: brauner, jack, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, viro
I think this is getting too much fluff traffic at this point, which is
partially my fault.
I'm buggering off.
Overall the patchset looks good, I don't see any technical reasons to
avoid merging it.
On Sat, Jul 20, 2024 at 2:57 PM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 7/20/2024 1:53 AM, Mateusz Guzik wrote:
> > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
> >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> >> word contains next_fd, as:
> >> (1) next_fd indicates the lower bound for the first free fd.
> >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> >> searching.
> > this is stale -- now the fast path searches up to 64 fds in the lower bitmap
>
> Nope, this is still valid, as the searching size of the fast path inside
> of find_next_fd() is always 64, it will execute the fast path inside of
> find_next_zero_bit().
>
>
> >
> >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> >> it would never be shrunk. The search size increases but there are few open fds
> >> available here.
> >>
> >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> >> versions.
> > I think this paragraph is droppable. You already got an ack from Jan
> > below, so stating he agrees with the patch is redundant. As for me I
> > don't think this warrants mentioning. Just remove it, perhaps
> > Christian will be willing to massage it by himself to avoid another
> > series posting.
>
> The idea of fast path for the word contains next_fd is from you,
> although this patch is small, I think it is reasonable to record here
> out of my respect. Appreciate for your guide and comments on this patch,
> I've learned a lot on the way of resolving problems :)
>
>
> Regards
>
> Yu
>
> >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Yu Ma <yu.ma@intel.com>
> >> ---
> >> fs/file.c | 9 +++++++++
> >> 1 file changed, 9 insertions(+)
> >>
> >> diff --git a/fs/file.c b/fs/file.c
> >> index 1be2a5bcc7c4..729c07a4fc28 100644
> >> --- a/fs/file.c
> >> +++ b/fs/file.c
> >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> >> unsigned int maxbit = maxfd / BITS_PER_LONG;
> >> unsigned int bitbit = start / BITS_PER_LONG;
> >> + unsigned int bit;
> >> +
> >> + /*
> >> + * Try to avoid looking at the second level bitmap
> >> + */
> >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> >> + start & (BITS_PER_LONG - 1));
> >> + if (bit < BITS_PER_LONG)
> >> + return bit + bitbit * BITS_PER_LONG;
> >>
> >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> >> if (bitbit >= maxfd)
> >> --
> >> 2.43.0
> >>
> >
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
` (2 preceding siblings ...)
2024-07-17 14:50 ` [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
@ 2024-07-22 15:02 ` Christian Brauner
2024-08-01 19:13 ` Al Viro
3 siblings, 1 reply; 103+ messages in thread
From: Christian Brauner @ 2024-07-22 15:02 UTC (permalink / raw)
To: jack, mjguzik, edumazet, Yu Ma
Cc: Christian Brauner, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, viro
On Wed, 17 Jul 2024 10:50:15 -0400, Yu Ma wrote:
> pts/blogbench-1.1.0 is a benchmark designed to replicate the
> load of a real-world busy file server by multiple threads of
> random reads, writes, and rewrites. When running default configuration
> with multiple parallel threads, hot spin lock contention is observed
> from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
>
> These 3 patches are created to reduce the critical section of file_lock
> in alloc_fd() and close_fd(). As a result, on top of patch 1,
> pts/blogbench-1.1.0 has been improved by 22% for read and 8% for write
> on Intel ICX 160 cores configuration with v6.10-rc7.
>
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
https://git.kernel.org/vfs/vfs/c/19f4a3f5712a
[2/3] fs/file.c: conditionally clear full_fds
https://git.kernel.org/vfs/vfs/c/b483266658a8
[3/3] fs/file.c: add fast path in find_next_fd()
https://git.kernel.org/vfs/vfs/c/3603a42c8c03
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-07-22 15:02 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Christian Brauner
@ 2024-08-01 19:13 ` Al Viro
2024-08-02 11:04 ` Christian Brauner
0 siblings, 1 reply; 103+ messages in thread
From: Al Viro @ 2024-08-01 19:13 UTC (permalink / raw)
To: Christian Brauner
Cc: jack, mjguzik, edumazet, Yu Ma, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Mon, Jul 22, 2024 at 05:02:04PM +0200, Christian Brauner wrote:
> Applied to the vfs.misc branch of the vfs/vfs.git tree.
> Patches in the vfs.misc branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.misc
>
> [1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
> https://git.kernel.org/vfs/vfs/c/19f4a3f5712a
> [2/3] fs/file.c: conditionally clear full_fds
> https://git.kernel.org/vfs/vfs/c/b483266658a8
> [3/3] fs/file.c: add fast path in find_next_fd()
> https://git.kernel.org/vfs/vfs/c/3603a42c8c03
Hmm... Something fishy's going on - those are not reachable by any branches.
As the matter of fact, none of the branches contain _anything_ recent in
fs/file.c or include/linux/fdtable.h; the only thing I see for include/linux/file.h
is (dubious, IMO) DEFINE_FREE(fput,...) - this IS_ERR_OR_NULL in there needs
a review of potential users.
I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
and I'm going to apply those 3 unless anyone objects.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-01 19:13 ` Al Viro
@ 2024-08-02 11:04 ` Christian Brauner
2024-08-02 14:22 ` Al Viro
0 siblings, 1 reply; 103+ messages in thread
From: Christian Brauner @ 2024-08-02 11:04 UTC (permalink / raw)
To: Al Viro
Cc: jack, mjguzik, edumazet, Yu Ma, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
> Hmm... Something fishy's going on - those are not reachable by any branches.
Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
to play around with --onto.
> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> and I'm going to apply those 3 unless anyone objects.
Fine since they aren't in that branch. Otherwise I generally prefer to
just merge a common branch. But that's rarely needed since Linus handles
merge conflicts anyway and doesn't mind.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-02 11:04 ` Christian Brauner
@ 2024-08-02 14:22 ` Al Viro
2024-08-05 6:56 ` Christian Brauner
0 siblings, 1 reply; 103+ messages in thread
From: Al Viro @ 2024-08-02 14:22 UTC (permalink / raw)
To: Christian Brauner
Cc: jack, mjguzik, edumazet, Yu Ma, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > Hmm... Something fishy's going on - those are not reachable by any branches.
>
> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> to play around with --onto.
>
> > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > and I'm going to apply those 3 unless anyone objects.
>
> Fine since they aren't in that branch. Otherwise I generally prefer to
> just merge a common branch.
If it's going to be rebased anyway, I don't see much difference from cherry-pick,
TBH...
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-02 14:22 ` Al Viro
@ 2024-08-05 6:56 ` Christian Brauner
2024-08-12 1:31 ` Ma, Yu
0 siblings, 1 reply; 103+ messages in thread
From: Christian Brauner @ 2024-08-05 6:56 UTC (permalink / raw)
To: Al Viro
Cc: jack, mjguzik, edumazet, Yu Ma, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > Hmm... Something fishy's going on - those are not reachable by any branches.
> >
> > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > to play around with --onto.
> >
> > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > and I'm going to apply those 3 unless anyone objects.
> >
> > Fine since they aren't in that branch. Otherwise I generally prefer to
> > just merge a common branch.
>
> If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> TBH...
Yeah, but I generally don't rebase after -rc1 anymore unles there's
really annoying conflicts.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-17 14:50 ` [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
@ 2024-08-06 13:44 ` kernel test robot
2024-08-14 21:38 ` Al Viro
1 sibling, 0 replies; 103+ messages in thread
From: kernel test robot @ 2024-08-06 13:44 UTC (permalink / raw)
To: Yu Ma
Cc: oe-lkp, lkp, Jan Kara, Tim Chen, linux-fsdevel, ying.huang,
feng.tang, fengwei.yin, brauner, mjguzik, edumazet, yu.ma,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, viro, oliver.sang
Hello,
kernel test robot noticed a 1.2% improvement of will-it-scale.per_process_ops on:
commit: f1139c8e66d5c618aad04a93a2378ad9586464f9 ("[PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-2-yu.ma@intel.com/
patch subject: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
testcase: will-it-scale
test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
parameters:
nr_task: 100%
mode: process
test: dup1
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062146.832faa23-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-srf-2sp1/dup1/will-it-scale
commit:
5f30e082ab ("Merge branch 'vfs.iomap' into vfs.all")
f1139c8e66 ("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
5f30e082ab8b3431 f1139c8e66d5c618aad04a93a23
---------------- ---------------------------
%stddev %change %stddev
\ | \
377983 ± 69% +74.1% 658036 ± 17% numa-meminfo.node0.AnonPages
18.17 ± 10% -48.6% 9.33 ± 35% perf-c2c.DRAM.local
8.796e+08 +1.2% 8.903e+08 will-it-scale.256.processes
3436082 +1.2% 3477810 will-it-scale.per_process_ops
8.796e+08 +1.2% 8.903e+08 will-it-scale.workload
1.517e+11 -4.3% 1.452e+11 perf-stat.i.branch-instructions
0.03 ± 8% +0.0 0.04 ± 36% perf-stat.i.branch-miss-rate%
0.93 +3.9% 0.96 perf-stat.i.cpi
7.13e+11 -3.5% 6.88e+11 perf-stat.i.instructions
1.08 -3.4% 1.04 perf-stat.i.ipc
0.93 +3.4% 0.96 perf-stat.overall.cpi
1.08 -3.3% 1.04 perf-stat.overall.ipc
245130 -4.4% 234451 perf-stat.overall.path-length
1.512e+11 -4.3% 1.447e+11 perf-stat.ps.branch-instructions
7.106e+11 -3.5% 6.857e+11 perf-stat.ps.instructions
2.156e+14 -3.2% 2.087e+14 perf-stat.total.instructions
14.90 -0.7 14.20 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
12.01 -0.7 11.32 perf-profile.calltrace.cycles-pp.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
16.54 -0.7 15.88 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.dup
6.44 -0.6 5.89 perf-profile.calltrace.cycles-pp.alloc_fd.__x64_sys_dup.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
2.86 -0.0 2.82 perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.__close
8.94 -0.0 8.90 perf-profile.calltrace.cycles-pp.filp_flush.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
7.76 -0.0 7.72 perf-profile.calltrace.cycles-pp.locks_remove_posix.filp_flush.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.58 -0.0 2.54 perf-profile.calltrace.cycles-pp.__fput_sync.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
1.11 -0.0 1.10 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
1.33 +0.0 1.35 perf-profile.calltrace.cycles-pp.testcase
0.54 +0.0 0.56 perf-profile.calltrace.cycles-pp.x64_sys_call.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
0.79 +0.0 0.82 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.dup
1.33 +0.0 1.37 perf-profile.calltrace.cycles-pp.close@plt
2.73 +0.1 2.78 perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.05 +0.1 1.11 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.dup
4.35 +0.1 4.42 perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
22.18 +0.3 22.51 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__close
21.50 ± 2% +1.5 23.02 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.dup
12.10 -0.7 11.39 perf-profile.children.cycles-pp.__x64_sys_dup
34.79 -0.7 34.12 perf-profile.children.cycles-pp.do_syscall_64
38.04 -0.6 37.42 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
6.48 -0.6 5.90 perf-profile.children.cycles-pp.alloc_fd
1.86 -0.5 1.41 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.57 -0.1 0.47 perf-profile.children.cycles-pp.fd_install
9.11 -0.0 9.07 perf-profile.children.cycles-pp.filp_flush
7.93 -0.0 7.89 perf-profile.children.cycles-pp.locks_remove_posix
2.61 -0.0 2.58 perf-profile.children.cycles-pp.__fput_sync
1.16 +0.0 1.18 perf-profile.children.cycles-pp.x64_sys_call
0.05 +0.0 0.07 ± 13% perf-profile.children.cycles-pp.clockevents_program_event
0.51 +0.0 0.53 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
2.17 +0.0 2.20 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
5.72 +0.0 5.75 perf-profile.children.cycles-pp._raw_spin_lock
2.10 +0.0 2.13 perf-profile.children.cycles-pp.testcase
2.02 +0.0 2.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.13 ± 2% +0.0 0.17 perf-profile.children.cycles-pp.dup@plt
4.38 +0.1 4.46 perf-profile.children.cycles-pp.file_close_fd
23.00 +0.1 23.11 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
59.27 +0.5 59.73 perf-profile.children.cycles-pp.__close
28.73 +1.1 29.80 perf-profile.children.cycles-pp.entry_SYSCALL_64
1.86 -0.5 1.41 perf-profile.self.cycles-pp.syscall_return_via_sysret
2.28 -0.2 2.12 perf-profile.self.cycles-pp.alloc_fd
0.54 -0.1 0.43 perf-profile.self.cycles-pp.fd_install
7.87 -0.0 7.83 perf-profile.self.cycles-pp.locks_remove_posix
2.47 -0.0 2.44 perf-profile.self.cycles-pp.__fput_sync
1.23 +0.0 1.24 perf-profile.self.cycles-pp.file_close_fd_locked
1.09 +0.0 1.11 perf-profile.self.cycles-pp.x64_sys_call
0.51 +0.0 0.53 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
1.29 +0.0 1.32 perf-profile.self.cycles-pp.testcase
5.66 +0.0 5.69 perf-profile.self.cycles-pp._raw_spin_lock
1.95 +0.0 1.99 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
2.85 +0.0 2.90 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.02 ±141% +0.0 0.06 ± 13% perf-profile.self.cycles-pp.ktime_get
0.00 +0.1 0.07 perf-profile.self.cycles-pp.dup@plt
22.93 +0.1 23.05 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
10.11 +0.2 10.34 perf-profile.self.cycles-pp.dup
13.70 +0.3 13.98 perf-profile.self.cycles-pp.entry_SYSCALL_64
9.84 ± 3% +0.7 10.51 perf-profile.self.cycles-pp.__close
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
2024-07-17 14:50 ` [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-19 17:53 ` Mateusz Guzik
@ 2024-08-06 13:48 ` kernel test robot
1 sibling, 0 replies; 103+ messages in thread
From: kernel test robot @ 2024-08-06 13:48 UTC (permalink / raw)
To: Yu Ma
Cc: oe-lkp, lkp, Jan Kara, Tim Chen, linux-fsdevel, ying.huang,
feng.tang, fengwei.yin, brauner, mjguzik, edumazet, yu.ma,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, viro, oliver.sang
Hello,
kernel test robot noticed a 6.3% improvement of will-it-scale.per_thread_ops on:
commit: b8decf0015a8b1ff02cdac61c0aa54355d8e73d7 ("[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-4-yu.ma@intel.com/
patch subject: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()
testcase: will-it-scale
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:
nr_task: 100%
mode: thread
test: open3
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062152.7e5b5d6d-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/thread/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/open3/will-it-scale
commit:
5bb3423bf9 ("fs/file.c: conditionally clear full_fds")
b8decf0015 ("fs/file.c: add fast path in find_next_fd()")
5bb3423bf9f9d91e b8decf0015a8b1ff02cdac61c0a
---------------- ---------------------------
%stddev %change %stddev
\ | \
848151 +6.2% 901119 ± 2% will-it-scale.224.threads
3785 +6.3% 4022 ± 2% will-it-scale.per_thread_ops
848151 +6.2% 901119 ± 2% will-it-scale.workload
0.28 ± 4% +13.3% 0.32 ± 3% perf-stat.i.MPKI
31.31 ± 3% +2.0 33.28 perf-stat.i.cache-miss-rate%
14955855 ± 4% +13.6% 16995785 ± 4% perf-stat.i.cache-misses
49676581 +6.7% 53009444 ± 3% perf-stat.i.cache-references
43955 ± 4% -12.3% 38549 ± 4% perf-stat.i.cycles-between-cache-misses
0.28 ± 4% +13.4% 0.32 ± 4% perf-stat.overall.MPKI
29.84 ± 3% +1.9 31.78 ± 2% perf-stat.overall.cache-miss-rate%
43445 ± 4% -12.1% 38200 ± 4% perf-stat.overall.cycles-between-cache-misses
19005976 -5.4% 17972604 ± 2% perf-stat.overall.path-length
14869677 ± 4% +13.6% 16898438 ± 4% perf-stat.ps.cache-misses
49821402 +6.7% 53168235 ± 3% perf-stat.ps.cache-references
49.42 -0.1 49.34 perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.40 -0.1 49.32 perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
49.25 -0.1 49.18 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64
49.20 -0.1 49.13 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
49.33 -0.1 49.26 perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.28 -0.1 49.22 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
50.14 +0.0 50.18 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
50.17 +0.0 50.21 perf-profile.calltrace.cycles-pp.open64
0.64 ± 5% +0.1 0.75 ± 6% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.62 ± 5% +0.1 0.74 ± 6% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
49.42 -0.1 49.34 perf-profile.children.cycles-pp.alloc_fd
49.40 -0.1 49.32 perf-profile.children.cycles-pp.file_close_fd
0.06 -0.0 0.05 perf-profile.children.cycles-pp.file_close_fd_locked
0.15 ± 5% +0.0 0.17 ± 4% perf-profile.children.cycles-pp.init_file
0.22 ± 3% +0.0 0.25 ± 3% perf-profile.children.cycles-pp.alloc_empty_file
0.18 ± 6% +0.0 0.22 ± 6% perf-profile.children.cycles-pp.__fput
50.14 +0.0 50.18 perf-profile.children.cycles-pp.__x64_sys_openat
50.18 +0.0 50.22 perf-profile.children.cycles-pp.open64
0.18 ± 14% +0.0 0.23 ± 7% perf-profile.children.cycles-pp.do_dentry_open
0.30 ± 8% +0.1 0.36 ± 8% perf-profile.children.cycles-pp.do_open
0.64 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.do_filp_open
0.63 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.path_openat
0.06 -0.0 0.05 perf-profile.self.cycles-pp.file_close_fd_locked
0.16 ± 2% +0.0 0.18 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
0.08 ± 12% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.__fput
0.05 ± 7% +0.1 0.10 ± 4% perf-profile.self.cycles-pp.alloc_fd
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-05 6:56 ` Christian Brauner
@ 2024-08-12 1:31 ` Ma, Yu
2024-08-12 2:40 ` Al Viro
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-08-12 1:31 UTC (permalink / raw)
To: Christian Brauner, Al Viro
Cc: jack, mjguzik, edumazet, linux-fsdevel, linux-kernel, pan.deng,
tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 8/5/2024 2:56 PM, Christian Brauner wrote:
> On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
>> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
>>>> Hmm... Something fishy's going on - those are not reachable by any branches.
>>> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
>>> to play around with --onto.
>>>
>>>> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
>>>> and I'm going to apply those 3 unless anyone objects.
>>> Fine since they aren't in that branch. Otherwise I generally prefer to
>>> just merge a common branch.
>> If it's going to be rebased anyway, I don't see much difference from cherry-pick,
>> TBH...
> Yeah, but I generally don't rebase after -rc1 anymore unles there's
> really annoying conflicts.
Thanks Christian and Al for your time and efforts. I'm not familiar with
the merging process, may i know about when these patches could be seen
in master ?
Regards
Yu
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-12 1:31 ` Ma, Yu
@ 2024-08-12 2:40 ` Al Viro
2024-08-12 15:09 ` Ma, Yu
2024-11-06 17:44 ` Jan Kara
0 siblings, 2 replies; 103+ messages in thread
From: Al Viro @ 2024-08-12 2:40 UTC (permalink / raw)
To: Ma, Yu
Cc: Christian Brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
>
> On 8/5/2024 2:56 PM, Christian Brauner wrote:
> > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > > > Hmm... Something fishy's going on - those are not reachable by any branches.
> > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > > > to play around with --onto.
> > > >
> > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > > > and I'm going to apply those 3 unless anyone objects.
> > > > Fine since they aren't in that branch. Otherwise I generally prefer to
> > > > just merge a common branch.
> > > If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> > > TBH...
> > Yeah, but I generally don't rebase after -rc1 anymore unles there's
> > really annoying conflicts.
>
> Thanks Christian and Al for your time and efforts. I'm not familiar with the
> merging process, may i know about when these patches could be seen in master
It's in work.fdtable in my tree, will post that series tonight and add to #for-next
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-12 2:40 ` Al Viro
@ 2024-08-12 15:09 ` Ma, Yu
2024-11-06 17:44 ` Jan Kara
1 sibling, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-08-12 15:09 UTC (permalink / raw)
To: Al Viro
Cc: Christian Brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 8/12/2024 10:40 AM, Al Viro wrote:
> On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
>> On 8/5/2024 2:56 PM, Christian Brauner wrote:
>>> On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
>>>> On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
>>>>>> Hmm... Something fishy's going on - those are not reachable by any branches.
>>>>> Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
>>>>> to play around with --onto.
>>>>>
>>>>>> I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
>>>>>> and I'm going to apply those 3 unless anyone objects.
>>>>> Fine since they aren't in that branch. Otherwise I generally prefer to
>>>>> just merge a common branch.
>>>> If it's going to be rebased anyway, I don't see much difference from cherry-pick,
>>>> TBH...
>>> Yeah, but I generally don't rebase after -rc1 anymore unles there's
>>> really annoying conflicts.
>> Thanks Christian and Al for your time and efforts. I'm not familiar with the
>> merging process, may i know about when these patches could be seen in master
> It's in work.fdtable in my tree, will post that series tonight and add to #for-next
Thanks Al for confirmation :)
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-07-17 14:50 ` [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-08-06 13:44 ` kernel test robot
@ 2024-08-14 21:38 ` Al Viro
2024-08-15 2:49 ` Ma, Yu
1 sibling, 1 reply; 103+ messages in thread
From: Al Viro @ 2024-08-14 21:38 UTC (permalink / raw)
To: Yu Ma
Cc: brauner, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Jul 17, 2024 at 10:50:16AM -0400, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
> allocated fd is NULL. Remove this sanity check since it can be assured by
> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
> likely/unlikely and expand_file() call avoidance to reduce the work under
> file_lock.
> + if (unlikely(fd >= fdt->max_fds)) {
> + error = expand_files(files, fd);
> + if (error < 0)
> + goto out;
>
> - /*
> - * If we needed to expand the fs array we
> - * might have blocked - try again.
> - */
> - if (error)
> - goto repeat;
> + /*
> + * If we needed to expand the fs array we
> + * might have blocked - try again.
> + */
> + if (error)
> + goto repeat;
With that change you can't get 0 from expand_files() here, so the
last goto should be unconditional. The only case when expand_files()
returns 0 is when it has found the descriptor already being covered
by fdt; since fdt->max_fds is stabilized by ->files_lock we are
holding here, comparison in expand_files() will give the same
result as it just had.
IOW, that goto repeat should be unconditional. The fun part here is
that this was the only caller that distinguished between 0 and 1...
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-08-14 21:38 ` Al Viro
@ 2024-08-15 2:49 ` Ma, Yu
2024-08-15 3:45 ` Al Viro
0 siblings, 1 reply; 103+ messages in thread
From: Ma, Yu @ 2024-08-15 2:49 UTC (permalink / raw)
To: Al Viro
Cc: brauner, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 8/15/2024 5:38 AM, Al Viro wrote:
> On Wed, Jul 17, 2024 at 10:50:16AM -0400, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the struct file mapping to the
>> allocated fd is NULL. Remove this sanity check since it can be assured by
>> exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
>> likely/unlikely and expand_file() call avoidance to reduce the work under
>> file_lock.
>> + if (unlikely(fd >= fdt->max_fds)) {
>> + error = expand_files(files, fd);
>> + if (error < 0)
>> + goto out;
>>
>> - /*
>> - * If we needed to expand the fs array we
>> - * might have blocked - try again.
>> - */
>> - if (error)
>> - goto repeat;
>> + /*
>> + * If we needed to expand the fs array we
>> + * might have blocked - try again.
>> + */
>> + if (error)
>> + goto repeat;
> With that change you can't get 0 from expand_files() here, so the
> last goto should be unconditional. The only case when expand_files()
> returns 0 is when it has found the descriptor already being covered
> by fdt; since fdt->max_fds is stabilized by ->files_lock we are
> holding here, comparison in expand_files() will give the same
> result as it just had.
>
> IOW, that goto repeat should be unconditional. The fun part here is
> that this was the only caller that distinguished between 0 and 1...
Yes, thanks Al, fully agree with you. The if (error) could be removed
here as the above unlikely would make sure no 0 return here. Should I
submit another version of patch set to update it, or you may help to
update it directly during merge? I'm not very familiar with the rules,
please let me know if I'd update and I'll take action soon. Thanks again
for your careful check here.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-08-15 2:49 ` Ma, Yu
@ 2024-08-15 3:45 ` Al Viro
2024-08-15 8:34 ` Ma, Yu
2024-10-31 7:42 ` Mateusz Guzik
0 siblings, 2 replies; 103+ messages in thread
From: Al Viro @ 2024-08-15 3:45 UTC (permalink / raw)
To: Ma, Yu
Cc: brauner, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
> Yes, thanks Al, fully agree with you. The if (error) could be removed here
> as the above unlikely would make sure no 0 return here. Should I submit
> another version of patch set to update it, or you may help to update it
> directly during merge? I'm not very familiar with the rules, please let me
> know if I'd update and I'll take action soon. Thanks again for your careful
> check here.
See the current work.fdtable...
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-08-15 3:45 ` Al Viro
@ 2024-08-15 8:34 ` Ma, Yu
2024-10-31 7:42 ` Mateusz Guzik
1 sibling, 0 replies; 103+ messages in thread
From: Ma, Yu @ 2024-08-15 8:34 UTC (permalink / raw)
To: Al Viro
Cc: brauner, jack, mjguzik, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen, yu.ma
On 8/15/2024 11:45 AM, Al Viro wrote:
> On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
>
>> Yes, thanks Al, fully agree with you. The if (error) could be removed here
>> as the above unlikely would make sure no 0 return here. Should I submit
>> another version of patch set to update it, or you may help to update it
>> directly during merge? I'm not very familiar with the rules, please let me
>> know if I'd update and I'll take action soon. Thanks again for your careful
>> check here.
> See the current work.fdtable...
Saw it, thanks :)
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-08-15 3:45 ` Al Viro
2024-08-15 8:34 ` Ma, Yu
@ 2024-10-31 7:42 ` Mateusz Guzik
2024-10-31 10:14 ` Christian Brauner
1 sibling, 1 reply; 103+ messages in thread
From: Mateusz Guzik @ 2024-10-31 7:42 UTC (permalink / raw)
To: Al Viro
Cc: Ma, Yu, brauner, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Thu, Aug 15, 2024 at 5:45 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
>
> > Yes, thanks Al, fully agree with you. The if (error) could be removed here
> > as the above unlikely would make sure no 0 return here. Should I submit
> > another version of patch set to update it, or you may help to update it
> > directly during merge? I'm not very familiar with the rules, please let me
> > know if I'd update and I'll take action soon. Thanks again for your careful
> > check here.
>
> See the current work.fdtable...
this patchset seems to have fallen through the cracks. anything
holding up the merge?
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
2024-10-31 7:42 ` Mateusz Guzik
@ 2024-10-31 10:14 ` Christian Brauner
0 siblings, 0 replies; 103+ messages in thread
From: Christian Brauner @ 2024-10-31 10:14 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Al Viro, Ma, Yu, jack, edumazet, linux-fsdevel, linux-kernel,
pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Thu, Oct 31, 2024 at 08:42:18AM +0100, Mateusz Guzik wrote:
> On Thu, Aug 15, 2024 at 5:45 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Thu, Aug 15, 2024 at 10:49:40AM +0800, Ma, Yu wrote:
> >
> > > Yes, thanks Al, fully agree with you. The if (error) could be removed here
> > > as the above unlikely would make sure no 0 return here. Should I submit
> > > another version of patch set to update it, or you may help to update it
> > > directly during merge? I'm not very familiar with the rules, please let me
> > > know if I'd update and I'll take action soon. Thanks again for your careful
> > > check here.
> >
> > See the current work.fdtable...
>
> this patchset seems to have fallen through the cracks. anything
> holding up the merge?
It's in next for this merge window.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-08-12 2:40 ` Al Viro
2024-08-12 15:09 ` Ma, Yu
@ 2024-11-06 17:44 ` Jan Kara
2024-11-06 17:59 ` Al Viro
1 sibling, 1 reply; 103+ messages in thread
From: Jan Kara @ 2024-11-06 17:44 UTC (permalink / raw)
To: Al Viro
Cc: Ma, Yu, Christian Brauner, jack, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Mon 12-08-24 03:40:44, Al Viro wrote:
> On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
> >
> > On 8/5/2024 2:56 PM, Christian Brauner wrote:
> > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > > > > Hmm... Something fishy's going on - those are not reachable by any branches.
> > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > > > > to play around with --onto.
> > > > >
> > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > > > > and I'm going to apply those 3 unless anyone objects.
> > > > > Fine since they aren't in that branch. Otherwise I generally prefer to
> > > > > just merge a common branch.
> > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> > > > TBH...
> > > Yeah, but I generally don't rebase after -rc1 anymore unles there's
> > > really annoying conflicts.
> >
> > Thanks Christian and Al for your time and efforts. I'm not familiar with the
> > merging process, may i know about when these patches could be seen in master
>
> It's in work.fdtable in my tree, will post that series tonight and add to #for-next
Al, it seems you didn't push the patches to Linus during the last merge
window. Do you plan to push them during the coming one?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in
2024-11-06 17:44 ` Jan Kara
@ 2024-11-06 17:59 ` Al Viro
0 siblings, 0 replies; 103+ messages in thread
From: Al Viro @ 2024-11-06 17:59 UTC (permalink / raw)
To: Jan Kara
Cc: Ma, Yu, Christian Brauner, mjguzik, edumazet, linux-fsdevel,
linux-kernel, pan.deng, tianyou.li, tim.c.chen, tim.c.chen
On Wed, Nov 06, 2024 at 06:44:23PM +0100, Jan Kara wrote:
> On Mon 12-08-24 03:40:44, Al Viro wrote:
> > On Mon, Aug 12, 2024 at 09:31:17AM +0800, Ma, Yu wrote:
> > >
> > > On 8/5/2024 2:56 PM, Christian Brauner wrote:
> > > > On Fri, Aug 02, 2024 at 03:22:48PM GMT, Al Viro wrote:
> > > > > On Fri, Aug 02, 2024 at 01:04:44PM +0200, Christian Brauner wrote:
> > > > > > > Hmm... Something fishy's going on - those are not reachable by any branches.
> > > > > > Hm, they probably got dropped when rebasing to v6.11-rc1 and I did have
> > > > > > to play around with --onto.
> > > > > >
> > > > > > > I'm putting together (in viro/vfs.git) a branch for that area (#work.fdtable)
> > > > > > > and I'm going to apply those 3 unless anyone objects.
> > > > > > Fine since they aren't in that branch. Otherwise I generally prefer to
> > > > > > just merge a common branch.
> > > > > If it's going to be rebased anyway, I don't see much difference from cherry-pick,
> > > > > TBH...
> > > > Yeah, but I generally don't rebase after -rc1 anymore unles there's
> > > > really annoying conflicts.
> > >
> > > Thanks Christian and Al for your time and efforts. I'm not familiar with the
> > > merging process, may i know about when these patches could be seen in master
> >
> > It's in work.fdtable in my tree, will post that series tonight and add to #for-next
>
> Al, it seems you didn't push the patches to Linus during the last merge
> window. Do you plan to push them during the coming one?
Yes, I do. I can merge that branch into viro/vfs.git#for-next, but that would be
redundant - note that vfs/vfs.git#workd.fdtable is identical to it (same sha1 of
head) and vfs/vfs.git#vfs.file contains a merge from it and vfs/vfs.git#vfs.all
merges #vfs.file. So it's already in linux-next in the current form and until
something else gets added to the branch I see no point.
^ permalink raw reply [flat|nested] 103+ messages in thread
end of thread, other threads:[~2024-11-06 17:59 UTC | newest]
Thread overview: 103+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-14 16:34 [PATCH 0/3] fs/file.c: optimize the critical section of Yu Ma
2024-06-14 16:34 ` [PATCH 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
2024-06-15 6:31 ` Mateusz Guzik
2024-06-16 4:01 ` Ma, Yu
2024-06-17 17:49 ` Tim Chen
2024-06-19 10:36 ` David Laight
2024-06-19 17:09 ` Ma, Yu
2024-06-14 16:34 ` [PATCH 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-06-14 16:34 ` [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd() Yu Ma
2024-06-15 4:41 ` Mateusz Guzik
2024-06-15 5:07 ` Mateusz Guzik
2024-06-17 17:55 ` Tim Chen
2024-06-17 17:59 ` Mateusz Guzik
2024-06-17 18:04 ` Tim Chen
2024-06-18 8:35 ` Michal Hocko
2024-06-18 9:06 ` Mateusz Guzik
2024-06-18 20:40 ` Tim Chen
2024-06-16 3:47 ` Ma, Yu
2024-06-17 11:23 ` Mateusz Guzik
2024-06-17 17:22 ` Ma, Yu
2024-06-17 8:36 ` Christian Brauner
2024-06-22 15:49 ` [PATCH v2 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-06-22 15:49 ` [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd() Yu Ma
2024-06-25 11:52 ` Jan Kara
2024-06-25 12:53 ` Jan Kara
2024-06-25 15:33 ` Ma, Yu
2024-06-26 11:54 ` Jan Kara
2024-06-26 16:43 ` Tim Chen
2024-06-26 16:52 ` Tim Chen
2024-06-27 12:09 ` Jan Kara
2024-06-27 12:20 ` Mateusz Guzik
2024-06-27 16:21 ` Tim Chen
2024-06-26 19:13 ` Mateusz Guzik
2024-06-27 14:03 ` Jan Kara
2024-06-27 15:33 ` Christian Brauner
2024-06-27 18:27 ` Ma, Yu
2024-06-27 19:59 ` Mateusz Guzik
2024-06-28 9:12 ` Jan Kara
2024-06-29 15:41 ` Ma, Yu
2024-06-29 15:46 ` Mateusz Guzik
2024-06-29 14:23 ` Ma, Yu
2024-06-22 15:49 ` [PATCH v2 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-06-25 11:54 ` Jan Kara
2024-06-25 15:41 ` Ma, Yu
2024-06-22 15:49 ` [PATCH v2 3/3] fs/file.c: remove sanity_check from alloc_fd() Yu Ma
2024-06-25 12:08 ` Jan Kara
2024-06-25 13:09 ` Mateusz Guzik
2024-06-25 13:11 ` Mateusz Guzik
2024-06-25 13:30 ` Jan Kara
2024-06-26 13:10 ` Christian Brauner
2024-07-03 14:33 ` [PATCH v3 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-03 14:33 ` [PATCH v3 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-07-03 14:34 ` Christian Brauner
2024-07-03 14:46 ` Ma, Yu
2024-07-04 10:11 ` Jan Kara
2024-07-04 14:45 ` Ma, Yu
2024-07-04 15:41 ` Jan Kara
2024-07-03 14:33 ` [PATCH v3 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-07-03 14:33 ` [PATCH v3 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-03 14:17 ` Mateusz Guzik
2024-07-03 14:28 ` Ma, Yu
2024-07-04 10:07 ` Jan Kara
2024-07-04 10:03 ` Jan Kara
2024-07-04 14:50 ` Ma, Yu
2024-07-04 17:44 ` Mateusz Guzik
2024-07-04 21:55 ` Jan Kara
2024-07-05 7:56 ` Ma, Yu
2024-07-09 8:32 ` Ma, Yu
2024-07-09 10:17 ` Mateusz Guzik
2024-07-10 23:40 ` Tim Chen
2024-07-11 9:27 ` Ma, Yu
2024-07-13 2:39 ` [PATCH v4 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-13 2:39 ` [PATCH v4 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-07-16 11:11 ` Jan Kara
2024-07-13 2:39 ` [PATCH v4 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-07-13 2:39 ` [PATCH v4 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-16 11:19 ` Jan Kara
2024-07-16 12:37 ` Ma, Yu
2024-07-17 14:50 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Yu Ma
2024-07-17 14:50 ` [PATCH v5 1/3] fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() Yu Ma
2024-08-06 13:44 ` kernel test robot
2024-08-14 21:38 ` Al Viro
2024-08-15 2:49 ` Ma, Yu
2024-08-15 3:45 ` Al Viro
2024-08-15 8:34 ` Ma, Yu
2024-10-31 7:42 ` Mateusz Guzik
2024-10-31 10:14 ` Christian Brauner
2024-07-17 14:50 ` [PATCH v5 2/3] fs/file.c: conditionally clear full_fds Yu Ma
2024-07-17 14:50 ` [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd() Yu Ma
2024-07-19 17:53 ` Mateusz Guzik
2024-07-20 12:57 ` Ma, Yu
2024-07-20 14:22 ` Mateusz Guzik
2024-08-06 13:48 ` kernel test robot
2024-07-22 15:02 ` [PATCH v5 0/3] fs/file.c: optimize the critical section of file_lock in Christian Brauner
2024-08-01 19:13 ` Al Viro
2024-08-02 11:04 ` Christian Brauner
2024-08-02 14:22 ` Al Viro
2024-08-05 6:56 ` Christian Brauner
2024-08-12 1:31 ` Ma, Yu
2024-08-12 2:40 ` Al Viro
2024-08-12 15:09 ` Ma, Yu
2024-11-06 17:44 ` Jan Kara
2024-11-06 17:59 ` Al Viro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).