* [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
@ 2024-12-25 9:42 WangYuli
2024-12-25 13:30 ` Andy Shevchenko
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: WangYuli @ 2024-12-25 9:42 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, yushengjin, zhangdandan, guanwentao,
zhanjun, oliver.sang, ebiederm, colin.king, josh, penberg,
manfred, mingo, jes, hch, aia21, arjan, jgarzik, neukum, oliver,
dada1, axboe, axboe, nickpiggin, dhowells, nathans, rolandd,
tytso, bunk, pbadari, ak, ak, davem, jsipek, jens.axboe, ramsdell,
hch, torvalds, akpm, randy.dunlap, efault, rdunlap, haveblue,
drepper, dm.n9107, jblunck, davidel, mtk.manpages, linux-arch,
vda.linux, jmorris, serue, hca, rth, lethal, tony.luck,
heiko.carstens, oleg, andi, corbet, crquan, mszeredi, miklos,
peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia, jaxboe,
nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen, xemul,
tj, serge.hallyn, gorcunov, levinsasha928, penberg, amwang, bcrl,
muthu.lkml, muthur, mjt, alan, raven, thomas, will.deacon, will,
josef, anatol.pomozov, koverstreet, zab, balbi, gregkh, mfasheh,
jlbec, rusty, asamymuthupa, smani, sbradshaw, jmoyer, sim, ia,
dmonakhov, ebiggers3, socketpair, penguin-kernel, w,
kirill.shutemov, mhocko, vdavydov.dev, vdavydov, hannes, mhocko,
minchan, deepa.kernel, arnd, balbi, swhiteho, konishi.ryusuke,
dsterba, vegard.nossum, axboe, pombredanne, tglx, joe.lawrence,
mpatocka, mcgrof, keescook, linux, jannh, shakeelb, guro, willy,
khlebnikov, kirr, stern, elver, parri.andrea, paulmck, rasibley,
jstancek, avagin, cai, josef, hare, colyli, johannes, sspatil,
alex_y_xu, mgorman, gor, jhubbard, andriy.shevchenko, crope,
yzaikin, bfields, jlayton, kernel, steve, nixiaoming, 0x7f454c46,
kuniyu, alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart, WangYuli
When a user calls the read/write system call and passes a pipe
descriptor, the pipe_read/pipe_write functions are invoked:
1. pipe_read():
1). Checks if the pipe is valid and if there is any data in the
pipe buffer.
2). Waits for data:
*If there is no data in the pipe and the write end is still open,
the current process enters a sleep state (wait_event()) until data
is written.
*If the write end is closed, return 0.
3). Reads data:
*Wakes up the process and copies data from the pipe's memory
buffer to user space.
*When the buffer is full, the writing process will go to sleep,
waiting for the pipe state to change to be awakened (using the
wake_up_interruptible_sync_poll() mechanism). Once data is read
from the buffer, the writing process can continue writing, and the
reading process can continue reading new data.
4). Returns the number of bytes read upon successful read.
2. pipe_write():
1). Checks if the pipe is valid and if there is any available
space in the pipe buffer.
2). Waits for buffer space:
*If the pipe buffer is full and the reading process has not
read any data, pipe_write() may put the current process to sleep
until there is space in the buffer.
*If the read end of the pipe is closed (no process is waiting
to read), an error code -EPIPE is returned, and a SIGPIPE signal may
be sent to the process.
3). Writes data:
*If there is enough space in the pipe buffer, pipe_write() copies
data from the user space buffer to the kernel buffer of the pipe
(using copy_from_user()).
*If the amount of data the user requests to write is larger than
the available space in the buffer, multiple writes may be required,
or the process may wait for new space to be freed.
4). Wakes up waiting reading processes:
*After the data is successfully written, pipe_write() wakes up
any processes that may be waiting to read data (using the
wake_up_interruptible_sync_poll() mechanism).
5). Returns the number of bytes successfully written.
Check if there are any waiting processes in the process wait queue
by introducing wq_has_sleeper() when waking up processes for pipe
read/write operations.
If no processes are waiting, there's no need to execute
wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
Unnecessary wake-ups can lead to context switches, where a process
is woken up to handle I/O events even when there is no immediate
need.
Only wake up processes when there are actually waiting processes to
reduce context switches and system overhead by checking
with wq_has_sleeper().
Additionally, by reducing unnecessary synchronization and wake-up
operations, wq_has_sleeper() can decrease system resource waste and
lock contention, improving overall system performance.
For pipe read/write operations, this eliminates ineffective scheduling
and enhances concurrency.
It's important to note that enabling this option means invoking
wq_has_sleeper() to check for sleeping processes in the wait queue
for every read or write operation.
While this is a lightweight operation, it still incurs some overhead.
In low-load or single-task scenarios, this overhead may not yield
significant benefits and could even introduce minor performance
degradation.
UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
Single-core performance improved by 4.1%, multi-core performance
improved by 4.8%.
Co-developed-by: Shengjin Yu <yushengjin@uniontech.com>
Signed-off-by: Shengjin Yu <yushengjin@uniontech.com>
Co-developed-by: Dandan Zhang <zhangdandan@uniontech.com>
Signed-off-by: Dandan Zhang <zhangdandan@uniontech.com>
Tested-by: Dandan Zhang <zhangdandan@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
---
fs/Kconfig | 13 +++++++++++++
fs/pipe.c | 21 +++++++++++++++------
2 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/fs/Kconfig b/fs/Kconfig
index 64d420e3c475..0dacc46a73fe 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -429,4 +429,17 @@ source "fs/unicode/Kconfig"
config IO_WQ
bool
+config PIPE_SKIP_SLEEPER
+ bool "Skip sleeping processes during pipe read/write"
+ default n
+ help
+ This option introduces a check whether the sleep queue will
+ be awakened during pipe read/write.
+
+ It often leads to a performance improvement. However, in
+ low-load or single-task scenarios, it may introduce minor
+ performance overhead.
+
+ If unsure, say N.
+
endmenu
diff --git a/fs/pipe.c b/fs/pipe.c
index 12b22c2723b7..c085333ae72c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -247,6 +247,15 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
return tail;
}
+static inline bool
+pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
+{
+ if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
+ return wq_has_sleeper(wq_head);
+ else
+ return true;
+}
+
static ssize_t
pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
@@ -377,7 +386,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
* _very_ unlikely case that the pipe was full, but we got
* no data.
*/
- if (unlikely(was_full))
+ if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
@@ -398,9 +407,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
wake_next_reader = false;
mutex_unlock(&pipe->mutex);
- if (was_full)
+ if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
- if (wake_next_reader)
+ if (wake_next_reader && pipe_check_wq_has_sleeper(&pipe->rd_wait))
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
if (ret > 0)
@@ -573,7 +582,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
* become empty while we dropped the lock.
*/
mutex_unlock(&pipe->mutex);
- if (was_empty)
+ if (was_empty && pipe_check_wq_has_sleeper(&pipe->rd_wait))
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
@@ -598,10 +607,10 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
* Epoll nonsensically wants a wakeup whether the pipe
* was already empty or not.
*/
- if (was_empty || pipe->poll_usage)
+ if ((was_empty || pipe->poll_usage) && pipe_check_wq_has_sleeper(&pipe->rd_wait))
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
- if (wake_next_writer)
+ if (wake_next_writer && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
int err = file_update_time(filp);
--
2.45.2
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
@ 2024-12-25 13:30 ` Andy Shevchenko
2024-12-25 13:53 ` Kent Overstreet
2024-12-25 15:42 ` WangYuli
2024-12-26 16:00 ` Oleg Nesterov
2024-12-26 19:02 ` Linus Torvalds
2 siblings, 2 replies; 37+ messages in thread
From: Andy Shevchenko @ 2024-12-25 13:30 UTC (permalink / raw)
To: WangYuli
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, torvalds, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, oleg, andi, corbet, crquan,
mszeredi, miklos, peterz, a.p.zijlstra, earl_chew, npiggin,
npiggin, julia, jaxboe, nikai, dchinner, davej, npiggin,
eric.dumazet, tim.c.chen, xemul, tj, serge.hallyn, gorcunov,
levinsasha928, penberg, amwang, bcrl, muthu.lkml, muthur, mjt,
alan, raven, thomas, will.deacon, will, josef, anatol.pomozov,
koverstreet, zab, balbi, gregkh, mfasheh, jlbec, rusty,
asamymuthupa, smani, sbradshaw, jmoyer, sim, ia, dmonakhov,
ebiggers3, socketpair, penguin-kernel, w, kirill.shutemov, mhocko,
vdavydov.dev, vdavydov, hannes, mhocko, minchan, deepa.kernel,
arnd, balbi, swhiteho, konishi.ryusuke, dsterba, vegard.nossum,
axboe, pombredanne, tglx, joe.lawrence, mpatocka, mcgrof,
keescook, linux, jannh, shakeelb, guro, willy, khlebnikov, kirr,
stern, elver, parri.andrea, paulmck, rasibley, jstancek, avagin,
cai, josef, hare, colyli, johannes, sspatil, alex_y_xu, mgorman,
gor, jhubbard, crope, yzaikin, bfields, jlayton, kernel, steve,
nixiaoming, 0x7f454c46, kuniyu, alexander.h.duyck, kuni1840,
soheil, sridhar.samudrala, Vincenzo.Frascino, chuck.lever,
Kevin.Brodsky, Szabolcs.Nagy, David.Laight, Mark.Rutland,
linux-morello, Luca.Vizzarro, max.kellermann, adobriyan, lukas,
j.granados, djwong, kent.overstreet, linux, kstewart
Don't you think the Cc list is a bit overloaded?
On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> When a user calls the read/write system call and passes a pipe
> descriptor, the pipe_read/pipe_write functions are invoked:
>
> 1. pipe_read():
> 1). Checks if the pipe is valid and if there is any data in the
> pipe buffer.
> 2). Waits for data:
> *If there is no data in the pipe and the write end is still open,
> the current process enters a sleep state (wait_event()) until data
> is written.
> *If the write end is closed, return 0.
> 3). Reads data:
> *Wakes up the process and copies data from the pipe's memory
> buffer to user space.
> *When the buffer is full, the writing process will go to sleep,
> waiting for the pipe state to change to be awakened (using the
> wake_up_interruptible_sync_poll() mechanism). Once data is read
> from the buffer, the writing process can continue writing, and the
> reading process can continue reading new data.
> 4). Returns the number of bytes read upon successful read.
>
> 2. pipe_write():
> 1). Checks if the pipe is valid and if there is any available
> space in the pipe buffer.
> 2). Waits for buffer space:
> *If the pipe buffer is full and the reading process has not
> read any data, pipe_write() may put the current process to sleep
> until there is space in the buffer.
> *If the read end of the pipe is closed (no process is waiting
> to read), an error code -EPIPE is returned, and a SIGPIPE signal may
> be sent to the process.
> 3). Writes data:
> *If there is enough space in the pipe buffer, pipe_write() copies
> data from the user space buffer to the kernel buffer of the pipe
> (using copy_from_user()).
> *If the amount of data the user requests to write is larger than
> the available space in the buffer, multiple writes may be required,
> or the process may wait for new space to be freed.
> 4). Wakes up waiting reading processes:
> *After the data is successfully written, pipe_write() wakes up
> any processes that may be waiting to read data (using the
> wake_up_interruptible_sync_poll() mechanism).
> 5). Returns the number of bytes successfully written.
>
> Check if there are any waiting processes in the process wait queue
> by introducing wq_has_sleeper() when waking up processes for pipe
> read/write operations.
>
> If no processes are waiting, there's no need to execute
> wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
>
> Unnecessary wake-ups can lead to context switches, where a process
> is woken up to handle I/O events even when there is no immediate
> need.
>
> Only wake up processes when there are actually waiting processes to
> reduce context switches and system overhead by checking
> with wq_has_sleeper().
>
> Additionally, by reducing unnecessary synchronization and wake-up
> operations, wq_has_sleeper() can decrease system resource waste and
> lock contention, improving overall system performance.
>
> For pipe read/write operations, this eliminates ineffective scheduling
> and enhances concurrency.
>
> It's important to note that enabling this option means invoking
> wq_has_sleeper() to check for sleeping processes in the wait queue
> for every read or write operation.
>
> While this is a lightweight operation, it still incurs some overhead.
>
> In low-load or single-task scenarios, this overhead may not yield
> significant benefits and could even introduce minor performance
> degradation.
>
> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
>
> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
> With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
>
> Single-core performance improved by 4.1%, multi-core performance
> improved by 4.8%.
...
> +config PIPE_SKIP_SLEEPER
> + bool "Skip sleeping processes during pipe read/write"
> + default n
'n' is the default 'default', no need to have this line.
> + help
> + This option introduces a check whether the sleep queue will
> + be awakened during pipe read/write.
> +
> + It often leads to a performance improvement. However, in
> + low-load or single-task scenarios, it may introduce minor
> + performance overhead.
> + If unsure, say N.
Illogical, it's already N as you stated by putting a redundant line, but after
removing that line it will make sense.
...
> +static inline bool
Have you build this with Clang and `make W=1 ...`?
> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
> + return wq_has_sleeper(wq_head);
> + else
Redundant.
> + return true;
if (!foo)
return true;
return bar(...);
> +}
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 13:30 ` Andy Shevchenko
@ 2024-12-25 13:53 ` Kent Overstreet
2024-12-25 16:04 ` Mateusz Guzik
2024-12-25 15:42 ` WangYuli
1 sibling, 1 reply; 37+ messages in thread
From: Kent Overstreet @ 2024-12-25 13:53 UTC (permalink / raw)
To: Andy Shevchenko
Cc: WangYuli, viro, brauner, jack, linux-fsdevel, linux-kernel,
yushengjin, zhangdandan, guanwentao, zhanjun, oliver.sang,
ebiederm, colin.king, josh, penberg, manfred, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek
On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> Don't you think the Cc list is a bit overloaded?
Indeed, my mail server doesn't let me reply-all.
> On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > +config PIPE_SKIP_SLEEPER
> > + bool "Skip sleeping processes during pipe read/write"
> > + default n
>
> 'n' is the default 'default', no need to have this line.
Actually, I'd say to skip the kconfig option for this. Kconfig options
that affect the behaviour of core code increase our testing burden, and
are another variable to account for when chasing down bugs, and the
potential overhead looks negligable.
Also, did you look at adding this optimization to wake_up()? No-op
wakeups are very common, I think this has wider applicability.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 13:30 ` Andy Shevchenko
2024-12-25 13:53 ` Kent Overstreet
@ 2024-12-25 15:42 ` WangYuli
2024-12-25 16:00 ` Willy Tarreau
1 sibling, 1 reply; 37+ messages in thread
From: WangYuli @ 2024-12-25 15:42 UTC (permalink / raw)
To: Andy Shevchenko, Kent Overstreet
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, torvalds, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, oleg, andi, corbet, crquan,
mszeredi, miklos, peterz, a.p.zijlstra, earl_chew, npiggin,
npiggin, julia, jaxboe, nikai, dchinner, davej, npiggin,
eric.dumazet, tim.c.chen, xemul, tj, serge.hallyn, gorcunov,
levinsasha928, penberg, amwang, bcrl, muthu.lkml, muthur, mjt,
alan, raven, thomas, will.deacon, will, josef, anatol.pomozov,
koverstreet, zab, balbi, gregkh, mfasheh, jlbec, rusty,
asamymuthupa, smani, sbradshaw, jmoyer, sim, ia, dmonakhov,
ebiggers3, socketpair, penguin-kernel, w, kirill.shutemov, mhocko,
vdavydov.dev, vdavydov, hannes, mhocko, minchan, deepa.kernel,
arnd, balbi, swhiteho, konishi.ryusuke, dsterba, vegard.nossum,
axboe, pombredanne, tglx, joe.lawrence, mpatocka, mcgrof,
keescook, linux, jannh, shakeelb, guro, willy, khlebnikov, kirr,
stern, elver, parri.andrea, paulmck, rasibley, jstancek, avagin,
cai, josef, hare, colyli, johannes, sspatil, alex_y_xu, mgorman,
gor, jhubbard, crope, yzaikin, bfields, jlayton, kernel, steve,
nixiaoming, 0x7f454c46, kuniyu, alexander.h.duyck, kuni1840,
soheil, sridhar.samudrala, Vincenzo.Frascino, chuck.lever,
Kevin.Brodsky, Szabolcs.Nagy, David.Laight, Mark.Rutland,
linux-morello, Luca.Vizzarro, max.kellermann, adobriyan, lukas,
j.granados, djwong, kent.overstreet, linux, kstewart
[-- Attachment #1.1.1: Type: text/plain, Size: 5373 bytes --]
On 2024/12/25 21:30, Andy Shevchenko wrote:
> Don't you think the Cc list is a bit overloaded?
Hi,
I apologize for any inconvenience this may cause.
I understand that under normal circumstances, one would simply pass the
modified code path as an argument to the kernel's
scripts/get_maintainer.pl script to determine the appropriate recipients.
However, given the vast and complex nature of the Linux kernel
community, with tens of thousands of developers worldwide, and
considering the varying "customs" of different subsystems, as well as
time zone differences and individual work habits, it's not uncommon for
patches to be sent to mailing lists and subsequently ignored or left
pending.
This patch, for example, has been submitted multiple times without
receiving any response, unfortunately.
My intention is simply to seek your review, and that of other technical
experts like yourself, but I cannot be certain, prior to your response,
which specific experts on which lists would be willing to provide feedback.
I would appreciate any other suggestions you may have.
>> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
>>
>> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
>> With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
>>
>> Single-core performance improved by 4.1%, multi-core performance
>> improved by 4.8%.
> ...
As you know, the kernel is extremely sensitive to performance.
Even a 1% performance improvement can lead to significant efficiency
gains and reduced carbon emissions in production environments, as long
as there is sufficient testing and theoretical analysis to prove that
the improvement is real and not due to measurement error or jitter.
>> +config PIPE_SKIP_SLEEPER
>> + bool "Skip sleeping processes during pipe read/write"
>> + default n
> 'n' is the default 'default', no need to have this line.
OK, I'll drop it. Thanks.
>
>> + help
>> + This option introduces a check whether the sleep queue will
>> + be awakened during pipe read/write.
>> +
>> + It often leads to a performance improvement. However, in
>> + low-load or single-task scenarios, it may introduce minor
>> + performance overhead.
>> + If unsure, say N.
> Illogical, it's already N as you stated by putting a redundant line, but after
> removing that line it will make sense.
>
> ...
As noted, I'll remove "default n" as it serves no purpose.
>
>> +static inline bool
> Have you build this with Clang and `make W=1 ...`?
Hmm...I've noticed a discrepancy in kernel compilation results with and
without "make W=1".
When I use x86_64_defconfig and clang-19.1.1 (Ubuntu 24.10) and run
"make", there are no warnings.
However, when I run "make W=1", the kernel generates a massive number of
errors, causing the compilation to fail prematurely.
e.g.
In file included from arch/x86/kernel/asm-offsets.c:14:
In file included from ./include/linux/suspend.h:5:
In file included from ./include/linux/swap.h:9:
In file included from ./include/linux/memcontrol.h:21:
In file included from ./include/linux/mm.h:2224:
./include/linux/vmstat.h:504:43: error: arithmetic between different
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item')
[-Werror,-Wenum-enum-conversion]
504 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~ ^
505 | item];
| ~~~~
./include/linux/vmstat.h:511:43: error: arithmetic between different
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item')
[-Werror,-Wenum-enum-conversion]
511 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~ ^
512 | NR_VM_NUMA_EVENT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~~
./include/linux/vmstat.h:524:43: error: arithmetic between different
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item')
[-Werror,-Wenum-enum-conversion]
524 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~ ^
525 | NR_VM_NUMA_EVENT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.
And I've observed similar behavior with gcc-14.2.0.
While I'm keen on addressing as many potential compile errors and
warnings in the kernel as possible, it seems like a long-term endeavor.
Regarding this specific code, I'd appreciate your insights on how to
improve it.
>
>> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
>> +{
>> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
>> + return wq_has_sleeper(wq_head);
>> + else
> Redundant.
>
>> + return true;
> if (!foo)
> return true;
>
> return bar(...);
>
>> +}
Yes. I'll rework the code structure here. Thanks.
Thanks,
--
WangYuli
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 645 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 15:42 ` WangYuli
@ 2024-12-25 16:00 ` Willy Tarreau
2024-12-25 16:32 ` WangYuli
0 siblings, 1 reply; 37+ messages in thread
From: Willy Tarreau @ 2024-12-25 16:00 UTC (permalink / raw)
To: WangYuli
Cc: Andy Shevchenko, Kent Overstreet, viro, brauner, jack,
linux-fsdevel, linux-kernel
On Wed, Dec 25, 2024 at 11:42:29PM +0800, WangYuli wrote:
> On 2024/12/25 21:30, Andy Shevchenko wrote:
>
> > Don't you think the Cc list is a bit overloaded?
>
> Hi,
>
> I apologize for any inconvenience this may cause.
>
> I understand that under normal circumstances, one would simply pass the
> modified code path as an argument to the kernel's scripts/get_maintainer.pl
> script to determine the appropriate recipients.
>
> However, given the vast and complex nature of the Linux kernel community,
> with tens of thousands of developers worldwide, and considering the varying
> "customs" of different subsystems, as well as time zone differences and
> individual work habits, it's not uncommon for patches to be sent to mailing
> lists and subsequently ignored or left pending.
(...)
Sorry, but by CCing 191 random addresses like this, that's the best way
to be added to .procmailrc and be completely ignored by everyone. At some
point one should wonder whether that's a common practice or if such
behaviors will be considered offensive by the majority. get_maintainer.pl
only lists the 2 lists and 3 addresses I left in CC (after Kent and Andy
whom I left since they replied to you).
> This patch, for example, has been submitted multiple times without receiving
> any response, unfortunately.
It can happen, but sending to the right people and possibly resending if
it gets lost is usually sufficient to attract attention. Sending to that
many people make you look like someone feeling so important they need to
shout in a loudspeaker to send orders to everyone. Please refrain from
doing this in the future.
Thanks,
Willy
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 13:53 ` Kent Overstreet
@ 2024-12-25 16:04 ` Mateusz Guzik
2024-12-25 16:32 ` Kent Overstreet
0 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-12-25 16:04 UTC (permalink / raw)
To: Kent Overstreet
Cc: Andy Shevchenko, WangYuli, viro, brauner, jack, linux-fsdevel,
linux-kernel, yushengjin, zhangdandan, guanwentao, zhanjun,
oliver.sang, ebiederm, colin.king, josh, penberg, manfred, mingo,
jes, hch, aia21, arjan, jgarzik, neukum, oliver, dada1, axboe,
axboe, nickpiggin, dhowells, nathans, rolandd, tytso, bunk,
pbadari, ak, ak, davem, jsipek
On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > Don't you think the Cc list is a bit overloaded?
>
> Indeed, my mail server doesn't let me reply-all.
>
> > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > +config PIPE_SKIP_SLEEPER
> > > + bool "Skip sleeping processes during pipe read/write"
> > > + default n
> >
> > 'n' is the default 'default', no need to have this line.
>
> Actually, I'd say to skip the kconfig option for this. Kconfig options
> that affect the behaviour of core code increase our testing burden, and
> are another variable to account for when chasing down bugs, and the
> potential overhead looks negligable.
>
I agree the behavior should not be guarded by an option. However,
because of how wq_has_sleeper is implemented (see below) I would argue
this needs to show how often locking can be avoided in real workloads.
The commit message does state this comes with a slowdown for cases which
can't avoid wakeups, but as is I thought the submitter just meant an
extra branch.
> Also, did you look at adding this optimization to wake_up()? No-op
> wakeups are very common, I think this has wider applicability.
I was going to suggest it myself, but then:
static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
{
/*
* We need to be sure we are in sync with the
* add_wait_queue modifications to the wait queue.
*
* This memory barrier should be paired with one on the
* waiting side.
*/
smp_mb();
return waitqueue_active(wq_head);
}
Which means this is in fact quite expensive.
Since wakeup is a lock + an interrupt trip, it would still be
cheaper single-threaded to "merely" suffer a full fence and for cases
where the queue is empty often enough this is definitely the right thing
to do.
On the other hand this executing when the queue is mostly *not* empty
would combat the point.
So unfortunately embedding this in wake_up is a no-go.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 16:04 ` Mateusz Guzik
@ 2024-12-25 16:32 ` Kent Overstreet
2024-12-25 17:22 ` Mateusz Guzik
0 siblings, 1 reply; 37+ messages in thread
From: Kent Overstreet @ 2024-12-25 16:32 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Andy Shevchenko, WangYuli, viro, brauner, jack, linux-fsdevel,
linux-kernel, yushengjin, zhangdandan, guanwentao, zhanjun,
oliver.sang, ebiederm, colin.king, josh, penberg, manfred, mingo,
jes, hch, aia21, arjan, jgarzik, neukum, oliver, dada1, axboe,
axboe, nickpiggin, dhowells, nathans, rolandd, tytso, bunk,
pbadari, ak, ak, davem, jsipek
On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > Don't you think the Cc list is a bit overloaded?
> >
> > Indeed, my mail server doesn't let me reply-all.
> >
> > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > +config PIPE_SKIP_SLEEPER
> > > > + bool "Skip sleeping processes during pipe read/write"
> > > > + default n
> > >
> > > 'n' is the default 'default', no need to have this line.
> >
> > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > that affect the behaviour of core code increase our testing burden, and
> > are another variable to account for when chasing down bugs, and the
> > potential overhead looks negligable.
> >
>
> I agree the behavior should not be guarded by an option. However,
> because of how wq_has_sleeper is implemented (see below) I would argue
> this needs to show how often locking can be avoided in real workloads.
>
> The commit message does state this comes with a slowdown for cases which
> can't avoid wakeups, but as is I thought the submitter just meant an
> extra branch.
>
> > Also, did you look at adding this optimization to wake_up()? No-op
> > wakeups are very common, I think this has wider applicability.
>
> I was going to suggest it myself, but then:
>
> static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> {
> /*
> * We need to be sure we are in sync with the
> * add_wait_queue modifications to the wait queue.
> *
> * This memory barrier should be paired with one on the
> * waiting side.
> */
> smp_mb();
> return waitqueue_active(wq_head);
> }
>
> Which means this is in fact quite expensive.
>
> Since wakeup is a lock + an interrupt trip, it would still be
> cheaper single-threaded to "merely" suffer a full fence and for cases
> where the queue is empty often enough this is definitely the right thing
> to do.
We're comparing against no-op wakeup. A real wakeup does an IPI, which
completely dwarfs the cost of a barrier.
And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
assume it's gotten better, but back when I was looking at waitqueues
nested pushf/popf was horrifically expensive.
But perhaps can we do this with just a release barrier? Similar to how
list_empty_careful() works.
> On the other hand this executing when the queue is mostly *not* empty
> would combat the point.
>
> So unfortunately embedding this in wake_up is a no-go.
You definitely can't say that without knowing how often no-op
wake_up()s occur. It wouldn't be hard to gather that (write a patch to
add a pair of percpu counters, throw it on a few machines running random
workloads) and I think the results might surprise you.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 16:00 ` Willy Tarreau
@ 2024-12-25 16:32 ` WangYuli
2024-12-25 16:56 ` Willy Tarreau
0 siblings, 1 reply; 37+ messages in thread
From: WangYuli @ 2024-12-25 16:32 UTC (permalink / raw)
To: Willy Tarreau
Cc: Andy Shevchenko, Kent Overstreet, viro, brauner, jack,
linux-fsdevel, linux-kernel
[-- Attachment #1.1.1: Type: text/plain, Size: 1589 bytes --]
Hi,
I've reviewed the Contributor Covenant and the Linux Kernel Contributor
Covenant Code of Conduct Interpretation, and I couldn't find anything
suggesting that CCing a large number of people is "unfriendly".
And while I don't believe my actions were malicious, I understand your
concern.
Going forward, I'll be more considerate of the recipients when sending
emails and will avoid CCing more than a hundred people at once in
similar situations.
On 2024/12/26 00:00, Willy Tarreau wrote:
> (...)
>
> Sorry, but by CCing 191 random addresses like this,
I think there may be a misunderstanding.
These all recipients can be found in the git history of fs/pipe.c.
> that's the best way
> to be added to .procmailrc and be completely ignored by everyone. At some
> point one should wonder whether that's a common practice or if such
> behaviors will be considered offensive by the majority. get_maintainer.pl
> only lists the 2 lists and 3 addresses I left in CC (after Kent and Andy
> whom I left since they replied to you).
>
>> This patch, for example, has been submitted multiple times without receiving
>> any response, unfortunately.
> It can happen, but sending to the right people and possibly resending if
> it gets lost is usually sufficient to attract attention. Sending to that
> many people make you look like someone feeling so important they need to
> shout in a loudspeaker to send orders to everyone.
Sincerely hope that's the case.
> Please refrain from
> doing this in the future.
As above.
Thanks,
--
WangYuli
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 645 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 16:32 ` WangYuli
@ 2024-12-25 16:56 ` Willy Tarreau
0 siblings, 0 replies; 37+ messages in thread
From: Willy Tarreau @ 2024-12-25 16:56 UTC (permalink / raw)
To: WangYuli
Cc: Andy Shevchenko, Kent Overstreet, viro, brauner, jack,
linux-fsdevel, linux-kernel
On Thu, Dec 26, 2024 at 12:32:35AM +0800, WangYuli wrote:
> Hi,
>
> I've reviewed the Contributor Covenant and the Linux Kernel Contributor
> Covenant Code of Conduct Interpretation, and I couldn't find anything
> suggesting that CCing a large number of people is "unfriendly".
This is unrelated, it's a matter of basic social interactions and respect
of others' time.
> And while I don't believe my actions were malicious, I understand your
> concern.
>
> Going forward, I'll be more considerate of the recipients when sending
> emails and will avoid CCing more than a hundred people at once in similar
> situations.
"More than a hundred" ? Are you serious ? For what purpose ? I'll explain
you something related to how people consume e-mails: the first thing they
do if they hesitate to process it is to check if someone else in the To
or Cc is more knowledgeable or legitimate than them. If so they often
prefer to let others deal with it. With such a large list, it's impossible
to check all other addresses and virtually *everyone* will consider that
surely one of the hundreds of others is more legitimate. So the more people
you add, the less likely anyone will handle your request.
The common rules that apply here are not made out of nowhere but based on
what works best. Just send to the most relevant outputs of get_maintainer,
wait for one week and if you get no response it's just that these people
were busy and forgot about you, so just kindly ping again to make sure
your message was received. With such a large community it's normal to lose
some messages, and pinging again is normally not considered as an offence
so that's fine. If you get no response after multiple attempts, it might
mean there's something annoying in your message (like sending to tens or
hundreds of people at once).
And really, when a script tells you "send your message to these 3 people"
and you send it to 191, how can you imagine the recipients will think
anything different from "this person feels super important to spam
everyone like this". Why would they even respond if they feel like you
consider you have the right to spam everyone in hope to get your orders
processed immediately ? I'm pretty sure that the majority will just let
it rot by principle.
Good luck,
Willy
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 16:32 ` Kent Overstreet
@ 2024-12-25 17:22 ` Mateusz Guzik
2024-12-25 17:41 ` Kent Overstreet
0 siblings, 1 reply; 37+ messages in thread
From: Mateusz Guzik @ 2024-12-25 17:22 UTC (permalink / raw)
To: Kent Overstreet
Cc: Andy Shevchenko, WangYuli, viro, brauner, jack, linux-fsdevel,
linux-kernel, yushengjin, zhangdandan, guanwentao, zhanjun,
oliver.sang, ebiederm, colin.king, josh, penberg, manfred, mingo,
jes, hch, aia21, arjan, jgarzik, neukum, oliver, dada1, axboe,
axboe, nickpiggin, dhowells, nathans, rolandd, tytso, bunk,
pbadari, ak, ak, davem, jsipek
On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > Don't you think the Cc list is a bit overloaded?
> > >
> > > Indeed, my mail server doesn't let me reply-all.
> > >
> > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > +config PIPE_SKIP_SLEEPER
> > > > > + bool "Skip sleeping processes during pipe read/write"
> > > > > + default n
> > > >
> > > > 'n' is the default 'default', no need to have this line.
> > >
> > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > that affect the behaviour of core code increase our testing burden, and
> > > are another variable to account for when chasing down bugs, and the
> > > potential overhead looks negligable.
> > >
> >
> > I agree the behavior should not be guarded by an option. However,
> > because of how wq_has_sleeper is implemented (see below) I would argue
> > this needs to show how often locking can be avoided in real workloads.
> >
> > The commit message does state this comes with a slowdown for cases which
> > can't avoid wakeups, but as is I thought the submitter just meant an
> > extra branch.
> >
> > > Also, did you look at adding this optimization to wake_up()? No-op
> > > wakeups are very common, I think this has wider applicability.
> >
> > I was going to suggest it myself, but then:
> >
> > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > {
> > /*
> > * We need to be sure we are in sync with the
> > * add_wait_queue modifications to the wait queue.
> > *
> > * This memory barrier should be paired with one on the
> > * waiting side.
> > */
> > smp_mb();
> > return waitqueue_active(wq_head);
> > }
> >
> > Which means this is in fact quite expensive.
> >
> > Since wakeup is a lock + an interrupt trip, it would still be
> > cheaper single-threaded to "merely" suffer a full fence and for cases
> > where the queue is empty often enough this is definitely the right thing
> > to do.
>
> We're comparing against no-op wakeup. A real wakeup does an IPI, which
> completely dwarfs the cost of a barrier.
>
> And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> assume it's gotten better, but back when I was looking at waitqueues
> nested pushf/popf was horrifically expensive.
>
> But perhaps can we do this with just a release barrier? Similar to how
> list_empty_careful() works.
>
> > On the other hand this executing when the queue is mostly *not* empty
> > would combat the point.
> >
> > So unfortunately embedding this in wake_up is a no-go.
>
> You definitely can't say that without knowing how often no-op
> wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> add a pair of percpu counters, throw it on a few machines running random
> workloads) and I think the results might surprise you.
There is some talking past each other here.
I explicitly noted one needs to check what happens in real workloads.
I very much expect there will be consumers where there are no waiters
almost every time and consumers which almost always do have them.
My claim is that this should be handled on a case-by-case basis.
So i whipped out a bpftrace one liner do take a look at the kernel
build, details at the end.
In terms of the total (0 == no waiters, 1 == waiters):
[0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
There is some funzies in the vfs layer which I'm going to sort out myself.
The kernel is tags/next-20241220
As far as pipes go:
@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_read+385
]:
[0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
So this guy literally never had any waiters when wakeup was issued.
faddr2line claims line 405, which I presume is off by one:
401 if (was_full)
402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
403 if (wake_next_reader)
404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
I'm guessing the real empty queue is rd_wait. Definitely a candidate
depending on other workloads, personally I would just patch it as is.
@[
wakeprobe+5
__wake_up_common+63
__wake_up+54
pipe_release+92
]:
[0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ |
a wash, would not touch that no matter what
@[
wakeprobe+5
__wake_up_common+63
__wake_up+54
pipe_release+110
]:
[0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
again no waiters, line claimed is 737, again off by one:
733 /* Was that the last reader or writer, but not the other side? */
734 if (!pipe->readers != !pipe->writers) {
735 │ wake_up_interruptible_all(&pipe->rd_wait);
736 │ wake_up_interruptible_all(&pipe->wr_wait);
737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
so I presume wr_wait? same comment as the entry at claimed line 405
@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_write+773
]:
[0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
again no waiters, claimed line 606
604 if (wake_next_writer)
605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
again would be inclined to patch as is
@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_read+943
]:
[0, 1) 9488 |@@@@@@@@@@@@@ |
[1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
majority of the time there were waiters, would not touch regardless of
other workloads, line 403
401 if (was_full)
402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
403 if (wake_next_reader)
404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
the wr_wait thing
@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_write+729
]:
[0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
ditto concerning not touching, resolved to line 603
601 if (was_empty || pipe->poll_usage)
602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
604 if (wake_next_writer)
605 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
That is to say as far as this workload goes the submitted patch does
avoid some of the lock + irq trips by covering cases where there no
waiters seen in this workload, but also adds the smp_mb thing when it
does not help -- I would remove those spots from the submission.
While I agree a full service treatment would require a bunch of
different workloads, personally I would be inclined justify the change
merely based on the kernel build + leaving bpftrace running over some
a real-world box running random crap.
As for obtaining such info, I failed to convince bpftrace to check if
the waiter list is empty. Instead I resorted to adding a dedicated
func which grabs the bit and probing on that. The func is in a
different file because gcc decided to omit the call otherwise
one-liner:
bpftrace -e 'kprobe:wakeprobe { @[kstack(4)] = lhist(arg0, 0, 1, 1); }'
hack:
diff --git a/fs/file.c b/fs/file.c
index d868cdb95d1e..00d6a34eb174 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1442,3 +1442,11 @@ int iterate_fd(struct files_struct *files, unsigned n,
return res;
}
EXPORT_SYMBOL(iterate_fd);
+
+
+void wakeprobe(int count);
+
+void wakeprobe(int count)
+{
+}
+EXPORT_SYMBOL(wakeprobe);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..8db7b0daf04b 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -57,6 +57,8 @@ void remove_wait_queue(struct wait_queue_head
*wq_head, struct wait_queue_entry
}
EXPORT_SYMBOL(remove_wait_queue);
+void wakeprobe(int count);
+
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
@@ -77,6 +79,8 @@ static int __wake_up_common(struct wait_queue_head
*wq_head, unsigned int mode,
lockdep_assert_held(&wq_head->lock);
+ wakeprobe(wq_has_sleeper(wq_head));
+
curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
if (&curr->entry == &wq_head->head)
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 17:22 ` Mateusz Guzik
@ 2024-12-25 17:41 ` Kent Overstreet
0 siblings, 0 replies; 37+ messages in thread
From: Kent Overstreet @ 2024-12-25 17:41 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Andy Shevchenko, WangYuli, viro, brauner, jack, linux-fsdevel,
linux-kernel, yushengjin, zhangdandan, guanwentao, zhanjun,
oliver.sang, ebiederm, colin.king, josh, penberg, manfred, mingo,
jes, hch, aia21, arjan, jgarzik, neukum, oliver, dada1, axboe,
axboe, nickpiggin, dhowells, nathans, rolandd, tytso, bunk,
pbadari, ak, ak, davem, jsipek
On Wed, Dec 25, 2024 at 06:22:49PM +0100, Mateusz Guzik wrote:
> On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > > Don't you think the Cc list is a bit overloaded?
> > > >
> > > > Indeed, my mail server doesn't let me reply-all.
> > > >
> > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > > +config PIPE_SKIP_SLEEPER
> > > > > > + bool "Skip sleeping processes during pipe read/write"
> > > > > > + default n
> > > > >
> > > > > 'n' is the default 'default', no need to have this line.
> > > >
> > > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > > that affect the behaviour of core code increase our testing burden, and
> > > > are another variable to account for when chasing down bugs, and the
> > > > potential overhead looks negligable.
> > > >
> > >
> > > I agree the behavior should not be guarded by an option. However,
> > > because of how wq_has_sleeper is implemented (see below) I would argue
> > > this needs to show how often locking can be avoided in real workloads.
> > >
> > > The commit message does state this comes with a slowdown for cases which
> > > can't avoid wakeups, but as is I thought the submitter just meant an
> > > extra branch.
> > >
> > > > Also, did you look at adding this optimization to wake_up()? No-op
> > > > wakeups are very common, I think this has wider applicability.
> > >
> > > I was going to suggest it myself, but then:
> > >
> > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > > {
> > > /*
> > > * We need to be sure we are in sync with the
> > > * add_wait_queue modifications to the wait queue.
> > > *
> > > * This memory barrier should be paired with one on the
> > > * waiting side.
> > > */
> > > smp_mb();
> > > return waitqueue_active(wq_head);
> > > }
> > >
> > > Which means this is in fact quite expensive.
> > >
> > > Since wakeup is a lock + an interrupt trip, it would still be
> > > cheaper single-threaded to "merely" suffer a full fence and for cases
> > > where the queue is empty often enough this is definitely the right thing
> > > to do.
> >
> > We're comparing against no-op wakeup. A real wakeup does an IPI, which
> > completely dwarfs the cost of a barrier.
> >
> > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> > assume it's gotten better, but back when I was looking at waitqueues
> > nested pushf/popf was horrifically expensive.
> >
> > But perhaps can we do this with just a release barrier? Similar to how
> > list_empty_careful() works.
> >
> > > On the other hand this executing when the queue is mostly *not* empty
> > > would combat the point.
> > >
> > > So unfortunately embedding this in wake_up is a no-go.
> >
> > You definitely can't say that without knowing how often no-op
> > wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> > add a pair of percpu counters, throw it on a few machines running random
> > workloads) and I think the results might surprise you.
>
> There is some talking past each other here.
>
> I explicitly noted one needs to check what happens in real workloads.
>
> I very much expect there will be consumers where there are no waiters
> almost every time and consumers which almost always do have them.
>
> My claim is that this should be handled on a case-by-case basis.
>
> So i whipped out a bpftrace one liner do take a look at the kernel
> build, details at the end.
>
> In terms of the total (0 == no waiters, 1 == waiters):
> [0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>
> There is some funzies in the vfs layer which I'm going to sort out myself.
>
> The kernel is tags/next-20241220
>
> As far as pipes go:
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_read+385
> ]:
> [0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> So this guy literally never had any waiters when wakeup was issued.
> faddr2line claims line 405, which I presume is off by one:
>
> 401 if (was_full)
> 402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 403 if (wake_next_reader)
> 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
> 405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>
> I'm guessing the real empty queue is rd_wait. Definitely a candidate
> depending on other workloads, personally I would just patch it as is.
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up+54
> pipe_release+92
> ]:
> [0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ |
>
> a wash, would not touch that no matter what
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up+54
> pipe_release+110
> ]:
> [0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> again no waiters, line claimed is 737, again off by one:
> 733 /* Was that the last reader or writer, but not the other side? */
> 734 if (!pipe->readers != !pipe->writers) {
> 735 │ wake_up_interruptible_all(&pipe->rd_wait);
> 736 │ wake_up_interruptible_all(&pipe->wr_wait);
> 737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> 738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>
> so I presume wr_wait? same comment as the entry at claimed line 405
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_write+773
> ]:
> [0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> again no waiters, claimed line 606
> 604 if (wake_next_writer)
> 605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
>
> again would be inclined to patch as is
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_read+943
> ]:
> [0, 1) 9488 |@@@@@@@@@@@@@ |
> [1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> majority of the time there were waiters, would not touch regardless of
> other workloads, line 403
>
> 401 if (was_full)
> 402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 403 if (wake_next_reader)
> 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
>
> the wr_wait thing
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_write+729
> ]:
> [0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> ditto concerning not touching, resolved to line 603
>
> 601 if (was_empty || pipe->poll_usage)
> 602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
> 603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> 604 if (wake_next_writer)
> 605 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>
> That is to say as far as this workload goes the submitted patch does
> avoid some of the lock + irq trips by covering cases where there no
> waiters seen in this workload, but also adds the smp_mb thing when it
> does not help -- I would remove those spots from the submission.
Neat use of bpf, although if you had to patch the kernel anyways I
would've just gone the percpu counter route... :)
Based on those numbers, even in the cases where wake-up dominates it
doesn't dominate by enough that I'd expect the patch to cause a
regression, at least if we can do it with a proper release barrier.
Why is smp_load_release() not a thing? Unusual I suppose, but it is what
we want here...
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
2024-12-25 13:30 ` Andy Shevchenko
@ 2024-12-26 16:00 ` Oleg Nesterov
2024-12-26 19:02 ` Linus Torvalds
2 siblings, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-26 16:00 UTC (permalink / raw)
To: WangYuli
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, torvalds, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, levinsasha928, penberg, amwang,
bcrl, muthu.lkml, muthur, mjt, alan, raven, thomas, will.deacon,
will, josef, anatol.pomozov, koverstreet, zab, balbi, gregkh,
mfasheh, jlbec, rusty, asamymuthupa, smani, sbradshaw, jmoyer,
sim, ia, dmonakhov, ebiggers3, socketpair, penguin-kernel, w,
kirill.shutemov, mhocko, vdavydov.dev, vdavydov, hannes, mhocko,
minchan, deepa.kernel, arnd, balbi, swhiteho, konishi.ryusuke,
dsterba, vegard.nossum, axboe, pombredanne, tglx, joe.lawrence,
mpatocka, mcgrof, keescook, linux, jannh, shakeelb, guro, willy,
khlebnikov, kirr, stern, elver, parri.andrea, paulmck, rasibley,
jstancek, avagin, cai, josef, hare, colyli, johannes, sspatil,
alex_y_xu, mgorman, gor, jhubbard, andriy.shevchenko, crope,
yzaikin, bfields, jlayton, kernel, steve, nixiaoming, 0x7f454c46,
kuniyu, alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart
Quite possibly I missed something, but I have some concerns after
a quick glance...
On 12/25, WangYuli wrote:
>
> +config PIPE_SKIP_SLEEPER
> + bool "Skip sleeping processes during pipe read/write"
> + default n
> + help
> + This option introduces a check whether the sleep queue will
> + be awakened during pipe read/write.
> +
> + It often leads to a performance improvement. However, in
> + low-load or single-task scenarios, it may introduce minor
> + performance overhead.
> +
> + If unsure, say N.
> +
Well, IMO the new config option should be avoided for this optimization.
> +static inline bool
> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
> + return wq_has_sleeper(wq_head);
> + else
> + return true;
> +}
I think that another helper makes more sense:
pipe_wake_xxx(wait_queue_head_t wait, flags)
{
if (wq_has_sleeper(wait))
wake_up_interruptible_sync_poll(wait, flags);
}
-------------------------------------------------------------------------------
But either way I am not sure this optimization is 100% correct, see below.
> @@ -377,7 +386,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> * _very_ unlikely case that the pipe was full, but we got
> * no data.
> */
> - if (unlikely(was_full))
> + if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait))
> wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
OK, at first glance this can't race with wait_event_xxx(wr_wait, pipe_writable),
wq_has_sleeper() and prepare_to_wait_event() have the necessary barriers.
But what about pipe_poll() ?
To oversimplify, pipe_poll(FMODE_WRITE) does
// poll_wait()
__pollwait() -> add_wait_queue(pipe->wr_wait) -> list_add()
check pipe_full()
and I don't see the (in theory) necessary barrier between list_add()
and LOAD(pipe->head/tail).
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
2024-12-25 13:30 ` Andy Shevchenko
2024-12-26 16:00 ` Oleg Nesterov
@ 2024-12-26 19:02 ` Linus Torvalds
2024-12-26 20:11 ` Oleg Nesterov
2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2024-12-26 19:02 UTC (permalink / raw)
To: WangYuli
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap, efault,
rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, oleg, andi, corbet, crquan,
mszeredi, miklos, peterz, a.p.zijlstra, earl_chew, npiggin,
npiggin, julia, jaxboe, nikai, dchinner, davej, npiggin,
eric.dumazet, tim.c.chen, xemul, tj, serge.hallyn, gorcunov,
levinsasha928, penberg, amwang, bcrl, muthu.lkml, muthur, mjt,
alan, raven, thomas, will.deacon, will, josef, anatol.pomozov,
koverstreet, zab, balbi, gregkh, mfasheh, jlbec, rusty,
asamymuthupa, smani, sbradshaw, jmoyer, sim, ia, dmonakhov,
ebiggers3, socketpair, penguin-kernel, w, kirill.shutemov, mhocko,
vdavydov.dev, vdavydov, hannes, mhocko, minchan, deepa.kernel,
arnd, balbi, swhiteho, konishi.ryusuke, dsterba, vegard.nossum,
axboe, pombredanne, tglx, joe.lawrence, mpatocka, mcgrof,
keescook, linux, jannh, shakeelb, guro, willy, khlebnikov, kirr,
stern, elver, parri.andrea, paulmck, rasibley, jstancek, avagin,
cai, josef, hare, colyli, johannes, sspatil, alex_y_xu, mgorman,
gor, jhubbard, andriy.shevchenko, crope, yzaikin, bfields,
jlayton, kernel, steve, nixiaoming, 0x7f454c46, kuniyu,
alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart
On Wed, 25 Dec 2024 at 01:44, WangYuli <wangyuli@uniontech.com> wrote:
>
>
> +config PIPE_SKIP_SLEEPER
> + bool "Skip sleeping processes during pipe read/write"
If the optimization is correct, there is no point to having a config option.
If the optimization is incorrect, there is no point to having the code.
Either way, there's no way we'd ever have a config option for this.
> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> + return wq_has_sleeper(wq_head);
So generally, the reason this is buggy is that it's racy due to either
CPU memory ordering issues or simply because the sleeper is going to
sleep but hasn't *quite* added itself to the wait queue.
Which is why the wakeup path takes the wq head lock.
Which is the only thing you are actually optimizing away.
> + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
> wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
And I think your patch is simply buggy because the race can in fact
exist, because there is no other lock than the waitqueue lock that
enforces memory ordering and protects against the "somebody is just
about to sleep" race.
That said, *if* all the wait queue work was done under the pipe mutex,
we could use the pipe mutex for exclusion instead.
That's actually how it kind of used to work long long ago (not really
- but close: it depended on the BKL originally, and adding waiters to
the wait-queues early before all the tests for full/empty, and
avoiding the races that way)
But now waiters use "wait_event_interruptible_exclusive()" explicitly
outside the pipe mutex, so the waiters and wakers aren't actually
serialized.
And no, you are unlikely to see the races in practice (particularly
the memory ordering ones). So you have to *think* about them, not test
them.
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 19:02 ` Linus Torvalds
@ 2024-12-26 20:11 ` Oleg Nesterov
2024-12-26 20:29 ` Linus Torvalds
2024-12-27 18:39 ` Manfred Spraul
0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-26 20:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: WangYuli, viro, brauner, jack, linux-fsdevel, linux-kernel,
yushengjin, zhangdandan, guanwentao, zhanjun, oliver.sang,
ebiederm, colin.king, josh, penberg, manfred, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, levinsasha928, penberg, amwang,
bcrl, muthu.lkml, muthur, mjt, alan, raven, thomas, will.deacon,
will, josef, anatol.pomozov, koverstreet, zab, balbi, gregkh,
mfasheh, jlbec, rusty, asamymuthupa, smani, sbradshaw, jmoyer,
sim, ia, dmonakhov, ebiggers3, socketpair, penguin-kernel, w,
kirill.shutemov, mhocko, vdavydov.dev, vdavydov, hannes, mhocko,
minchan, deepa.kernel, arnd, balbi, swhiteho, konishi.ryusuke,
dsterba, vegard.nossum, axboe, pombredanne, tglx, joe.lawrence,
mpatocka, mcgrof, keescook, linux, jannh, shakeelb, guro, willy,
khlebnikov, kirr, stern, elver, parri.andrea, paulmck, rasibley,
jstancek, avagin, cai, josef, hare, colyli, johannes, sspatil,
alex_y_xu, mgorman, gor, jhubbard, andriy.shevchenko, crope,
yzaikin, bfields, jlayton, kernel, steve, nixiaoming, 0x7f454c46,
kuniyu, alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart
I mostly agree, see my reply to this patch, but...
On 12/26, Linus Torvalds wrote:
>
> If the optimization is correct, there is no point to having a config option.
>
> If the optimization is incorrect, there is no point to having the code.
>
> Either way, there's no way we'd ever have a config option for this.
Agreed,
> > + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
> > wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>
> End result: you need to explain why the race cannot exist.
Agreed,
> And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,
> But now waiters use "wait_event_interruptible_exclusive()" explicitly
> outside the pipe mutex, so the waiters and wakers aren't actually
> serialized.
This is what I don't understand... Could you spell ?
I _think_ that
wait_event_whatever(WQ, CONDITION);
vs
CONDITION = 1;
if (wq_has_sleeper(WQ))
wake_up_xxx(WQ, ...);
is fine.
Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
have the necessary barriers to serialize the waiters and wakers.
Damn I am sure I missed something ;)
> And no, you are unlikely to see the races in practice (particularly
> the memory ordering ones). So you have to *think* about them, not test
> them.
Yes! agreed again.
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 20:11 ` Oleg Nesterov
@ 2024-12-26 20:29 ` Linus Torvalds
2024-12-26 20:57 ` Oleg Nesterov
2024-12-27 18:39 ` Manfred Spraul
1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2024-12-26 20:29 UTC (permalink / raw)
To: Oleg Nesterov, WangYuli
Cc: linux-fsdevel, Linux Kernel Mailing List, Christian Brauner
[ Ugh, removed the crazy cc list with tons of old addresses ]
On Thu, 26 Dec 2024 at 12:13, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I _think_ that
>
> wait_event_whatever(WQ, CONDITION);
>
> vs
>
> CONDITION = 1;
> if (wq_has_sleeper(WQ))
> wake_up_xxx(WQ, ...);
>
> is fine.
Hmm. I guess wq_has_sleeper() does have a memory barrier, so that
worry of mine was wrong.
So the optimization may be valid (the config option definitely is
not), but I think it needs to be explained much better.
I end up being very nervous about this code because we've had bugs in
this area, exactly because people optimize this code for the unixbench
pipe benchmark.
And then very few real loads have that behavior, although there are
some cases where people really use a pipe as a kind of "token
mechanism" (ie GNU make will do that, I think a few others do too).
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 20:29 ` Linus Torvalds
@ 2024-12-26 20:57 ` Oleg Nesterov
2024-12-27 15:54 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-26 20:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: WangYuli, linux-fsdevel, Linux Kernel Mailing List,
Christian Brauner
On 12/26, Linus Torvalds wrote:
>
> [ Ugh, removed the crazy cc list with tons of old addresses ]
thanks.
> So the optimization may be valid
I don't think so, see my initial reply.
unlike wait_event(), __pollwait() + the head/tail checks in pipe_poll()
doesn't have the necessary barriers (at least in theory) afaics. Between
add_wait_queue()->list_add() and LOAD(head/tail).
> (the config option definitely is
> not), but I think it needs to be explained much better.
>
> I end up being very nervous about this code because we've had bugs in
> this area, exactly because people optimize this code for the unixbench
> pipe benchmark.
Agreed!
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 20:57 ` Oleg Nesterov
@ 2024-12-27 15:54 ` Oleg Nesterov
2024-12-27 16:43 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-27 15:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: WangYuli, linux-fsdevel, Linux Kernel Mailing List,
Christian Brauner
On 12/26, Oleg Nesterov wrote:
>
> On 12/26, Linus Torvalds wrote:
> >
> > So the optimization may be valid
>
> I don't think so, see my initial reply.
>
> unlike wait_event(), __pollwait() + the head/tail checks in pipe_poll()
> doesn't have the necessary barriers (at least in theory) afaics. Between
> add_wait_queue()->list_add() and LOAD(head/tail).
Hmm...
Even if we add the wq_has_sleeper() check, the "wake up" logic would
be still suboptimal. Lets forget this patch for the moment.
Consider
int main(void)
{
int fd[2], cnt;
char c;
pipe(fd);
if (!fork()) {
// wait until the parent blocks in pipe_write() ->
// wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
sleep(1);
for (cnt = 0; cnt < 4096; ++cnt)
read(fd[0], &c, 1);
return 0;
}
// parent
for (;;)
write(fd[1], &c, 1);
}
In this case the child will wakeup the parent 4095 times for no reason,
pipe_writable() == !pipe_pull() will still be true until the last
read(fd[0], &c, 1) does
if (!buf->len)
tail = pipe_update_tail(pipe, buf, tail);
and after that the parent can write the next char.
Or did I completely misread this code??
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-27 15:54 ` Oleg Nesterov
@ 2024-12-27 16:43 ` Oleg Nesterov
0 siblings, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-27 16:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: WangYuli, linux-fsdevel, Linux Kernel Mailing List,
Christian Brauner
On 12/27, Oleg Nesterov wrote:
>
> Consider
>
> int main(void)
> {
> int fd[2], cnt;
> char c;
>
> pipe(fd);
>
> if (!fork()) {
> // wait until the parent blocks in pipe_write() ->
> // wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> sleep(1);
>
> for (cnt = 0; cnt < 4096; ++cnt)
> read(fd[0], &c, 1);
> return 0;
> }
>
> // parent
> for (;;)
> write(fd[1], &c, 1);
> }
>
> In this case the child will wakeup the parent 4095 times for no reason,
> pipe_writable() == !pipe_pull() will still be true until the last
> read(fd[0], &c, 1) does
>
> if (!buf->len)
> tail = pipe_update_tail(pipe, buf, tail);
>
> and after that the parent can write the next char.
perhaps something like below makes sense in this particular case.
Incomplete and ugly, just for illustration.
Oleg.
diff --git a/fs/pipe.c b/fs/pipe.c
index 12b22c2723b7..b8eef9e75639 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
size_t total_len = iov_iter_count(to);
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
- bool was_full, wake_next_reader = false;
+ bool was_full, xxx, wake_next_reader = false;
ssize_t ret;
/* Null read succeeds. */
@@ -277,6 +277,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
unsigned int head = smp_load_acquire(&pipe->head);
unsigned int tail = pipe->tail;
unsigned int mask = pipe->ring_size - 1;
+ xxx = false;
#ifdef CONFIG_WATCH_QUEUE
if (pipe->note_loss) {
@@ -340,8 +341,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
buf->len = 0;
}
- if (!buf->len)
+ if (!buf->len) {
tail = pipe_update_tail(pipe, buf, tail);
+ xxx = true;
+ }
total_len -= chars;
if (!total_len)
break; /* common path: read succeeded */
@@ -398,7 +401,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
wake_next_reader = false;
mutex_unlock(&pipe->mutex);
- if (was_full)
+ if (was_full && xxx)
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
if (wake_next_reader)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 20:11 ` Oleg Nesterov
2024-12-26 20:29 ` Linus Torvalds
@ 2024-12-27 18:39 ` Manfred Spraul
2024-12-28 14:32 ` Oleg Nesterov
1 sibling, 1 reply; 37+ messages in thread
From: Manfred Spraul @ 2024-12-27 18:39 UTC (permalink / raw)
To: Oleg Nesterov, Linus Torvalds
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, oliver.sang,
ebiederm, colin.king, josh, penberg, mingo, jes, hch, aia21,
arjan, jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap, efault,
rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
[-- Attachment #1: Type: text/plain, Size: 3814 bytes --]
Hi,
(I had to remove many cc, my mail server rejected to send)
On 12/26/24 9:11 PM, Oleg Nesterov wrote:
> I mostly agree, see my reply to this patch, but...
>
> On 12/26, Linus Torvalds wrote:
>> If the optimization is correct, there is no point to having a config option.
>>
>> If the optimization is incorrect, there is no point to having the code.
>>
>> Either way, there's no way we'd ever have a config option for this.
> Agreed,
>
>>> + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
>>> wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>> End result: you need to explain why the race cannot exist.
> Agreed,
>
>> And I think your patch is simply buggy
> Agreed again, but probably my reasoning was wrong,
>
>> But now waiters use "wait_event_interruptible_exclusive()" explicitly
>> outside the pipe mutex, so the waiters and wakers aren't actually
>> serialized.
> This is what I don't understand... Could you spell ?
> I _think_ that
>
> wait_event_whatever(WQ, CONDITION);
> vs
>
> CONDITION = 1;
> if (wq_has_sleeper(WQ))
> wake_up_xxx(WQ, ...);
>
> is fine.
This pattern is documented in wait.h:
https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
Thus if there an issue, then the documentation should be updated.
> Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
> have the necessary barriers to serialize the waiters and wakers.
>
> Damn I am sure I missed something ;)
Actually:
Does this work universally? I.e. can we add the optimization into
__wake_up()?
But I do not understand this comment (from 2.6.0)
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
> /* * Note: we use "set_current_state()" _after_ the wait-queue add, *
> because we need a memory barrier there on SMP, so that any *
> wake-function that tests for the wait-queue being active * will be
> guaranteed to see waitqueue addition _or_ subsequent * tests in this
> thread will see the wakeup having taken place. * * The spin_unlock()
> itself is semi-permeable and only protects * one way (it only protects
> stuff inside the critical region and * stops them from bleeding out -
> it would still allow subsequent * loads to move into the the critical
> region). */ voidprepare_to_wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait>(wait_queue_head_t
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t>*q,wait_queue_t
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t>*wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>,intstate) {
> unsignedlongflags; wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->flags&=~WQ_FLAG_EXCLUSIVE
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE>;
> spin_lock_irqsave
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave>(&q->lock,flags);
> if(list_empty
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty>(&wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->task_list
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list>))
> __add_wait_queue
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue>(q,wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>);
> set_current_state
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state>(state);
> spin_unlock_irqrestore
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore>(&q->lock,flags);
> }
set_current_state() now uses smp_store_mb(), which is a memory barrier
_after_ the store. Thus I do not see what enforces that the store
happens before the store for the __add_wait_queue().
--
Manfred
[-- Attachment #2: 0001-__wake_up_common_lock-Optimize-for-empty-wake-queues.patch --]
[-- Type: text/x-patch, Size: 2525 bytes --]
From 7cb8f06ab022e7fc36bacbe65f654822147ec1aa Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Fri, 27 Dec 2024 18:21:41 +0100
Subject: [PATCH] __wake_up_common_lock: Optimize for empty wake queues
wake_up() must wake up every task that is in the wait queue.
But: If there is concurrency between wake_up() and add_wait_queue(),
then (as long as we are not violating the memory ordering rules) we
can just claim that wake_up() happened "first" and add_wait_queue()
happened afterwards.
From memory ordering perspective:
- If the wait_queue() is empty, then wake_up() just does
spin_lock();
list_empty()
spin_unlock();
- add_wait_queue()/prepare_to_wait() do all kind of operations,
but they may become visible only when the spin_unlock_irqrestore()
is done.
Thus, instead of pairing the memory barrier to the spinlock, and
thus writing to a potentially shared cacheline, we load-acquire
the next pointer from the list.
Risks and side effects:
- The guaranteed memory barrier of wake_up() is reduced to load_acquire.
Previously, there was always a spin_lock()/spin_unlock() pair.
- prepare_to_wait() actually does two operations under spinlock:
It adds current to the wait queue, and it calls set_current_state().
The comment above prepare_to_wait() is not clear to me, thus there
might be further side effects.
Only lightly tested.
No benchmark test done.
---
kernel/sched/wait.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..10d02f652ab8 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
+ if (list_empty(&wq_head->head)) {
+ struct list_head *pn;
+
+ /*
+ * pairs with spin_unlock_irqrestore(&wq_head->lock);
+ * We actually do not need to acquire wq_head->lock, we just
+ * need to be sure that there is no prepare_to_wait() that
+ * completed on any CPU before __wake_up was called.
+ * Thus instead of load_acquiring the spinlock and dropping
+ * it again, we load_acquire the next list entry and check
+ * that the list is not empty.
+ */
+ pn = smp_load_acquire(&wq_head->head.next);
+
+ if(pn == &wq_head->head)
+ return 0;
+ }
return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);
--
2.47.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-27 18:39 ` Manfred Spraul
@ 2024-12-28 14:32 ` Oleg Nesterov
2024-12-28 15:22 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-28 14:32 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, viro, brauner, jack, linux-fsdevel, linux-kernel,
oliver.sang, ebiederm, colin.king, josh, penberg, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
On 12/27, Manfred Spraul wrote:
>
> >I _think_ that
> >
> > wait_event_whatever(WQ, CONDITION);
> >vs
> >
> > CONDITION = 1;
> > if (wq_has_sleeper(WQ))
> > wake_up_xxx(WQ, ...);
> >
> >is fine.
>
> This pattern is documented in wait.h:
>
> https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
>
> Thus if there an issue, then the documentation should be updated.
Agreed, basically the same pattern, prepare_to_wait_event() is similar
to prepare_to_wait().
> But I do not understand this comment (from 2.6.0)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
>
> >/* * Note: we use "set_current_state()" _after_ the wait-queue add, *
> >because we need a memory barrier there on SMP, so that any * wake-function
> >that tests for the wait-queue being active * will be guaranteed to see
> >waitqueue addition _or_ subsequent * tests in this thread will see the
> >wakeup having taken place. * * The spin_unlock() itself is semi-permeable
> >and only protects * one way (it only protects stuff inside the critical
> >region and * stops them from bleeding out - it would still allow
> >subsequent * loads to move into the the critical region). */
...
> set_current_state() now uses smp_store_mb(), which is a memory barrier
> _after_ the store.
And afaics this is what we actually need.
> Thus I do not see what enforces that the store happens
> before the store for the __add_wait_queue().
IIUC this is fine, no need to serialize list_add() and STORE(tsk->__state),
they can be reordered.
But we need mb() between __add_wait_queue + __set_current_state (in any
order) and the subsequent "if (CONDITION)" check.
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
> int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> int nr_exclusive, void *key)
> {
> + if (list_empty(&wq_head->head)) {
> + struct list_head *pn;
> +
> + /*
> + * pairs with spin_unlock_irqrestore(&wq_head->lock);
> + * We actually do not need to acquire wq_head->lock, we just
> + * need to be sure that there is no prepare_to_wait() that
> + * completed on any CPU before __wake_up was called.
> + * Thus instead of load_acquiring the spinlock and dropping
> + * it again, we load_acquire the next list entry and check
> + * that the list is not empty.
> + */
> + pn = smp_load_acquire(&wq_head->head.next);
> +
> + if(pn == &wq_head->head)
> + return 0;
> + }
Too subtle for me ;)
I have some concerns, but I need to think a bit more to (try to) actually
understand this change.
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 14:32 ` Oleg Nesterov
@ 2024-12-28 15:22 ` Oleg Nesterov
2024-12-28 16:32 ` Oleg Nesterov
2024-12-28 16:45 ` Manfred Spraul
0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-28 15:22 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, viro, brauner, jack, linux-fsdevel, linux-kernel,
oliver.sang, ebiederm, colin.king, josh, penberg, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
On 12/28, Oleg Nesterov wrote:
>
> > int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> > int nr_exclusive, void *key)
> > {
> > + if (list_empty(&wq_head->head)) {
> > + struct list_head *pn;
> > +
> > + /*
> > + * pairs with spin_unlock_irqrestore(&wq_head->lock);
> > + * We actually do not need to acquire wq_head->lock, we just
> > + * need to be sure that there is no prepare_to_wait() that
> > + * completed on any CPU before __wake_up was called.
> > + * Thus instead of load_acquiring the spinlock and dropping
> > + * it again, we load_acquire the next list entry and check
> > + * that the list is not empty.
> > + */
> > + pn = smp_load_acquire(&wq_head->head.next);
> > +
> > + if(pn == &wq_head->head)
> > + return 0;
> > + }
>
> Too subtle for me ;)
>
> I have some concerns, but I need to think a bit more to (try to) actually
> understand this change.
If nothing else, consider
int CONDITION;
wait_queue_head_t WQ;
void wake(void)
{
CONDITION = 1;
wake_up(WQ);
}
void wait(void)
{
DEFINE_WAIT_FUNC(entry, woken_wake_function);
add_wait_queue(WQ, entry);
if (!CONDITION)
wait_woken(entry, ...);
remove_wait_queue(WQ, entry);
}
this code is correct even if LOAD(CONDITION) can leak into the critical
section in add_wait_queue(), so CPU running wait() can actually do
// add_wait_queue
spin_lock(WQ->lock);
LOAD(CONDITION); // false!
list_add(entry, head);
spin_unlock(WQ->lock);
if (!false) // result of the LOAD above
wait_woken(entry, ...);
Now suppose that another CPU executes wake() between LOAD(CONDITION)
and list_add(entry, head). With your patch wait() will miss the event.
The same for __pollwait(), I think...
No?
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 15:22 ` Oleg Nesterov
@ 2024-12-28 16:32 ` Oleg Nesterov
2024-12-28 18:53 ` Manfred Spraul
2024-12-28 16:45 ` Manfred Spraul
1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-28 16:32 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, viro, brauner, jack, linux-fsdevel, linux-kernel,
oliver.sang, ebiederm, colin.king, josh, penberg, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
On 12/28, Oleg Nesterov wrote:
>
> If nothing else, consider
>
> int CONDITION;
> wait_queue_head_t WQ;
>
> void wake(void)
> {
> CONDITION = 1;
> wake_up(WQ);
> }
>
> void wait(void)
> {
> DEFINE_WAIT_FUNC(entry, woken_wake_function);
>
> add_wait_queue(WQ, entry);
> if (!CONDITION)
> wait_woken(entry, ...);
> remove_wait_queue(WQ, entry);
> }
>
> this code is correct even if LOAD(CONDITION) can leak into the critical
> section in add_wait_queue(), so CPU running wait() can actually do
>
> // add_wait_queue
> spin_lock(WQ->lock);
> LOAD(CONDITION); // false!
> list_add(entry, head);
> spin_unlock(WQ->lock);
>
> if (!false) // result of the LOAD above
> wait_woken(entry, ...);
>
> Now suppose that another CPU executes wake() between LOAD(CONDITION)
> and list_add(entry, head). With your patch wait() will miss the event.
> The same for __pollwait(), I think...
>
> No?
Even simpler,
void wait(void)
{
DEFINE_WAIT(entry);
__set_current_state(XXX);
add_wait_queue(WQ, entry);
if (!CONDITION)
schedule();
remove_wait_queue(WQ, entry);
__set_current_state(TASK_RUNNING);
}
This code is ugly but currently correct unless I am totally confused.
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 15:22 ` Oleg Nesterov
2024-12-28 16:32 ` Oleg Nesterov
@ 2024-12-28 16:45 ` Manfred Spraul
2024-12-29 11:57 ` Oleg Nesterov
1 sibling, 1 reply; 37+ messages in thread
From: Manfred Spraul @ 2024-12-28 16:45 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Hi Oleg,
On 12/28/24 4:22 PM, Oleg Nesterov wrote:
> On 12/28, Oleg Nesterov wrote:
>>> int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
>>> int nr_exclusive, void *key)
>>> {
>>> + if (list_empty(&wq_head->head)) {
>>> + struct list_head *pn;
>>> +
>>> + /*
>>> + * pairs with spin_unlock_irqrestore(&wq_head->lock);
>>> + * We actually do not need to acquire wq_head->lock, we just
>>> + * need to be sure that there is no prepare_to_wait() that
>>> + * completed on any CPU before __wake_up was called.
>>> + * Thus instead of load_acquiring the spinlock and dropping
>>> + * it again, we load_acquire the next list entry and check
>>> + * that the list is not empty.
>>> + */
>>> + pn = smp_load_acquire(&wq_head->head.next);
>>> +
>>> + if(pn == &wq_head->head)
>>> + return 0;
>>> + }
>> Too subtle for me ;)
>>
>> I have some concerns, but I need to think a bit more to (try to) actually
>> understand this change.
> If nothing else, consider
>
> int CONDITION;
> wait_queue_head_t WQ;
>
> void wake(void)
> {
> CONDITION = 1;
> wake_up(WQ);
> }
>
> void wait(void)
> {
> DEFINE_WAIT_FUNC(entry, woken_wake_function);
>
> add_wait_queue(WQ, entry);
> if (!CONDITION)
> wait_woken(entry, ...);
> remove_wait_queue(WQ, entry);
> }
>
> this code is correct even if LOAD(CONDITION) can leak into the critical
> section in add_wait_queue(), so CPU running wait() can actually do
>
> // add_wait_queue
> spin_lock(WQ->lock);
> LOAD(CONDITION); // false!
> list_add(entry, head);
> spin_unlock(WQ->lock);
>
> if (!false) // result of the LOAD above
> wait_woken(entry, ...);
>
> Now suppose that another CPU executes wake() between LOAD(CONDITION)
> and list_add(entry, head). With your patch wait() will miss the event.
> The same for __pollwait(), I think...
>
> No?
Yes, you are right.
CONDITION =1 is worst case written to memory from the store_release() in
spin_unlock().
this pairs with the load_acquire for spin_lock(), thus LOAD(CONDITION)
is safe.
It could still work for prepare_to_wait and thus fs/pipe, since then the
smb_mb() in set_current_state prevents earlier execution.
--
Manfred
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 16:32 ` Oleg Nesterov
@ 2024-12-28 18:53 ` Manfred Spraul
2024-12-29 11:54 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Manfred Spraul @ 2024-12-28 18:53 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Hi Oleg,
On 12/28/24 5:32 PM, Oleg Nesterov wrote:
> On 12/28, Oleg Nesterov wrote:
>> If nothing else, consider
>>
>> int CONDITION;
>> wait_queue_head_t WQ;
>>
>> void wake(void)
>> {
>> CONDITION = 1;
>> wake_up(WQ);
>> }
>>
>> void wait(void)
>> {
>> DEFINE_WAIT_FUNC(entry, woken_wake_function);
>>
>> add_wait_queue(WQ, entry);
>> if (!CONDITION)
>> wait_woken(entry, ...);
>> remove_wait_queue(WQ, entry);
>> }
>>
>> this code is correct even if LOAD(CONDITION) can leak into the critical
>> section in add_wait_queue(), so CPU running wait() can actually do
>>
>> // add_wait_queue
>> spin_lock(WQ->lock);
>> LOAD(CONDITION); // false!
>> list_add(entry, head);
>> spin_unlock(WQ->lock);
>>
>> if (!false) // result of the LOAD above
>> wait_woken(entry, ...);
>>
>> Now suppose that another CPU executes wake() between LOAD(CONDITION)
>> and list_add(entry, head). With your patch wait() will miss the event.
>> The same for __pollwait(), I think...
>>
>> No?
> Even simpler,
>
> void wait(void)
> {
> DEFINE_WAIT(entry);
>
> __set_current_state(XXX);
> add_wait_queue(WQ, entry);
>
> if (!CONDITION)
> schedule();
>
> remove_wait_queue(WQ, entry);
> __set_current_state(TASK_RUNNING);
> }
>
> This code is ugly but currently correct unless I am totally confused.
It is a chance of the add_wait_queue() path, thus impact on all calls.
With (busybox) "find /sys /proc | grep aaaabbbccc", I've seen 16385
wakeup calls with empty queue, and just 6 with an entry in the queue.
But on other workloads, the ratio was more something like 75%
empty/25%with entries.
I.e.: We would have long discussions if the change only helps for some
usecases, and might have negative impact on other use cases.
And: Your proposal is in conflict with
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
But I do not see the issue, the worst possible scenario should be something like:
// add_wait_queue
spin_lock(WQ->lock);
LOAD(CONDITION); // false!
list_add(entry, head);
STORE(current_state)
spin_unlock(WQ->lock);
--
Manfred
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 18:53 ` Manfred Spraul
@ 2024-12-29 11:54 ` Oleg Nesterov
0 siblings, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-29 11:54 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Hi Manfred,
Sorry, I don't understand, perhaps you misunderstood me too.
On 12/28, Manfred Spraul wrote:
>
> >Even simpler,
> >
> > void wait(void)
> > {
> > DEFINE_WAIT(entry);
> >
> > __set_current_state(XXX);
> > add_wait_queue(WQ, entry);
> >
> > if (!CONDITION)
> > schedule();
> >
> > remove_wait_queue(WQ, entry);
> > __set_current_state(TASK_RUNNING);
> > }
> >
> >This code is ugly but currently correct unless I am totally confused.
What I tried to say: the code above is another (simpler) example of
the currently correct (afaics) code which will be broken by your patch.
Of course, wait() assumes that
void wake(void)
{
CONDITION = 1;
wake_up(WQ);
}
calls __wake_up_common_lock() and takes WQ->lock unconditionally, and
thus wait() doesn't need the additional barries.
> And: Your proposal is in conflict with
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
I proposed nothing ;) But yes sure, this code doesn't match the comment
above waitqueue_active(), and that is why the wake() paths can't check
list_empty() to avoid __wake_up_common_lock().
> But I do not see the issue, the worst possible scenario should be something like:
>
> // add_wait_queue
> spin_lock(WQ->lock);
> LOAD(CONDITION); // false!
> list_add(entry, head);
> STORE(current_state)
> spin_unlock(WQ->lock);
Again, wake() can happen between LOAD() and list_add()...
But sorry again, I guess I completely misunderstood you...
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 16:45 ` Manfred Spraul
@ 2024-12-29 11:57 ` Oleg Nesterov
2024-12-29 12:41 ` Manfred Spraul
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-29 11:57 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
On 12/28, Manfred Spraul wrote:
>
> On 12/28/24 4:22 PM, Oleg Nesterov wrote:
> >
> >Now suppose that another CPU executes wake() between LOAD(CONDITION)
> >and list_add(entry, head). With your patch wait() will miss the event.
> >The same for __pollwait(), I think...
...
> It could still work for prepare_to_wait and thus fs/pipe, since then the
> smb_mb() in set_current_state prevents earlier execution.
Not sure, please see the note about __pollwait() above.
I think that your patch (and the original patch from WangYuli) has the same
proble with pipe_poll()->poll_wait()->__pollwait().
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-29 11:57 ` Oleg Nesterov
@ 2024-12-29 12:41 ` Manfred Spraul
2024-12-29 13:05 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Manfred Spraul @ 2024-12-29 12:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Hi Oleg,
On 12/29/24 12:57 PM, Oleg Nesterov wrote:
> On 12/28, Manfred Spraul wrote:
>> On 12/28/24 4:22 PM, Oleg Nesterov wrote:
>>> Now suppose that another CPU executes wake() between LOAD(CONDITION)
>>> and list_add(entry, head). With your patch wait() will miss the event.
>>> The same for __pollwait(), I think...
> ...
>
>> It could still work for prepare_to_wait and thus fs/pipe, since then the
>> smb_mb() in set_current_state prevents earlier execution.
From now, I'll try to follow standard patterns:
every memory barrier must be paired. And adding barriers to common
functions for potentially rare situations is now allowed.
(unless it is a bugfix).
And then enumerate all codepaths:
For the wait_event users: We have a smp_mb() in prepare_to_wait(), it
could pair with the barrier in wq_has_sleepers().
> Not sure, please see the note about __pollwait() above.
>
> I think that your patch (and the original patch from WangYuli) has the same
> proble with pipe_poll()->poll_wait()->__pollwait().
What is the memory barrier for pipe_poll()?
There is poll_wait()->__pollwait()->add_wait_queue()->spin_unlock().
thus only store_release.
And then READ_ONCE(), i.e. no memory barrier.
Thus the CPU would be free to load pipe->head and pipe->tail before
adding the entry to the poll table.
Correct?
--
Manfred
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-29 12:41 ` Manfred Spraul
@ 2024-12-29 13:05 ` Oleg Nesterov
2024-12-29 13:13 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-29 13:05 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
On 12/29, Manfred Spraul wrote:
>
> >I think that your patch (and the original patch from WangYuli) has the same
> >proble with pipe_poll()->poll_wait()->__pollwait().
>
> What is the memory barrier for pipe_poll()?
>
> There is poll_wait()->__pollwait()->add_wait_queue()->spin_unlock(). thus
> only store_release.
>
> And then READ_ONCE(), i.e. no memory barrier.
>
> Thus the CPU would be free to load pipe->head and pipe->tail before adding
> the entry to the poll table.
>
> Correct?
Yes, this was my thinking.
See also my initial reply to WangYuli
https://lore.kernel.org/all/20241226160007.GA11118@redhat.com/
Do you agree?
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-29 13:05 ` Oleg Nesterov
@ 2024-12-29 13:13 ` Oleg Nesterov
2024-12-29 19:54 ` Manfred Spraul
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-29 13:13 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Sorry for the noise...
and currently this is fine. But if we want to add the wq_has_sleeper()
checks into fs/pipe.c then pipe_poll() needs smp_mb() after it calls
poll_wait().
Agreed?
On 12/29, Oleg Nesterov wrote:
>
> On 12/29, Manfred Spraul wrote:
> >
> > >I think that your patch (and the original patch from WangYuli) has the same
> > >proble with pipe_poll()->poll_wait()->__pollwait().
> >
> > What is the memory barrier for pipe_poll()?
> >
> > There is poll_wait()->__pollwait()->add_wait_queue()->spin_unlock(). thus
> > only store_release.
> >
> > And then READ_ONCE(), i.e. no memory barrier.
> >
> > Thus the CPU would be free to load pipe->head and pipe->tail before adding
> > the entry to the poll table.
> >
> > Correct?
>
> Yes, this was my thinking.
>
> See also my initial reply to WangYuli
> https://lore.kernel.org/all/20241226160007.GA11118@redhat.com/
>
> Do you agree?
>
> Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-29 13:13 ` Oleg Nesterov
@ 2024-12-29 19:54 ` Manfred Spraul
2024-12-30 15:38 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Manfred Spraul @ 2024-12-29 19:54 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Hi Oleg,
On 12/29/24 2:13 PM, Oleg Nesterov wrote:
> Sorry for the noise...
>
> and currently this is fine. But if we want to add the wq_has_sleeper()
> checks into fs/pipe.c then pipe_poll() needs smp_mb() after it calls
> poll_wait().
>
> Agreed?
Yes, agreed.
Just the comment in pipe_poll() was a bit tricky for me.
https://elixir.bootlin.com/linux/v6.12.6/source/fs/pipe.c#L670
If i understand it right, from memory ordering perspective, it is
undefined if pipe->head/tail is written before or after updating the
linked list in the poll table entry.
But: Reading cannot happen before taking the spinlock.
CPU1:
pipe_poll()-> poll_wait()->__pollwait()->add_wait_queue()->spin_lock()
LOAD(pipe->head, pipe->tail)
<<<<< insert here CPU2
<add to wait queue>
pipe_poll()-> poll_wait()->__pollwait()->add_wait_queue()->spin_unlock()
<use pipe->head, pipe->tail>
CPU2:
pipe_update_tail()
mutex_unlock()
wake_up_interruptible_sync_poll
spin_lock() >>>>> this closes the potential race
perhaps:
/*
* .. and only then can you do the racy tests. That way,
* if something changes and you got it wrong, the poll
* table entry will wake you up and fix it.
+ * The memory barrier for this READ_ONCE is provided by
+ * poll_wait()->__pollwait()->add_wait_queue()->spin_lock()
+ * It pairs with the spin_lock() in wake_up
*/
head = READ_ONCE(pipe->head);
> On 12/29, Oleg Nesterov wrote:
>> On 12/29, Manfred Spraul wrote:
>>>> I think that your patch (and the original patch from WangYuli) has the same
>>>> proble with pipe_poll()->poll_wait()->__pollwait().
>>> What is the memory barrier for pipe_poll()?
>>>
>>> There is poll_wait()->__pollwait()->add_wait_queue()->spin_unlock(). thus
>>> only store_release.
>>>
>>> And then READ_ONCE(), i.e. no memory barrier.
>>>
>>> Thus the CPU would be free to load pipe->head and pipe->tail before adding
>>> the entry to the poll table.
>>>
>>> Correct?
>> Yes, this was my thinking.
>>
>> See also my initial reply to WangYuli
>> https://lore.kernel.org/all/20241226160007.GA11118@redhat.com/
>>
>> Do you agree?
>>
>> Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-29 19:54 ` Manfred Spraul
@ 2024-12-30 15:38 ` Oleg Nesterov
2024-12-31 11:14 ` Manfred Spraul
2025-01-04 21:15 ` RFC: Checkpatch: Introduce list of functions that need memory barriers Manfred Spraul
0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-30 15:38 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner
Hi Manfred,
On 12/29, Manfred Spraul wrote:
>
> Hi Oleg,
>
> On 12/29/24 2:13 PM, Oleg Nesterov wrote:
> >Sorry for the noise...
> >
> >and currently this is fine.
Heh, please see below.
> But if we want to add the wq_has_sleeper()
> >checks into fs/pipe.c then pipe_poll() needs smp_mb() after it calls
> >poll_wait().
> >
> >Agreed?
>
> Yes, agreed.
>
> Just the comment in pipe_poll() was a bit tricky for me.
Well yes, but... It turns out I didn't grep enough.
See fs/splice.c and wakeup_pipe_readers/writers (which should use
wq_has_sleeper() for grep sake). And I don't understand why do these helpers
use key == NULL...
So it seems that pipe_poll() already needs smp_mb() to fix the current code,
at least in theory. I'll recheck and send the patch(es).
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-30 15:38 ` Oleg Nesterov
@ 2024-12-31 11:14 ` Manfred Spraul
2024-12-31 19:38 ` Linus Torvalds
2025-01-04 21:15 ` RFC: Checkpatch: Introduce list of functions that need memory barriers Manfred Spraul
1 sibling, 1 reply; 37+ messages in thread
From: Manfred Spraul @ 2024-12-31 11:14 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner, 1vier1,
Manfred Spraul
Hi Oleg,
just FYI, I did some quick tests with:
- your changes to fs/pipe.c
- my change, to skip locking in wake-up (and some smp_mb())
- statistics, to check how often wake_up is called/how often the list is
wait queue is actually empty
Known issue: Statistic printing every 10 seconds doesn't work, it prints
at eratic times. And the comment in __wake_up is still wrong, the
memory barrier would pair with smp_mb() after updating wq_head->head.
Result: (all with a 2 core 4 thread i3, fully idle system)
- your change has no impact on 'find /proc /sys | grep doesnotexist'
(using busybox)
- Running your test app for around 100 seconds
- 3 wakeups with non-empty queue
- 26 wakeup with empty queue
- 2107 __add_wait_queue
- find|grep produces insane numbers of wakeup. I've seen 20k, I've
now seen 50k wakeup calls. With just around 2k __add_wait_queue,
...
Thus, at least for pipe:
Should we add the missing memory barriers and switch to
wait_queue_active() in front of all wakeup calls?
---
fs/pipe.c | 13 +++++++------
include/linux/wait.h | 5 +++++
kernel/sched/wait.c | 45 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+), 6 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c
index 12b22c2723b7..27ffb650f131 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
size_t total_len = iov_iter_count(to);
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
- bool was_full, wake_next_reader = false;
+ bool wake_writer = false, wake_next_reader = false;
ssize_t ret;
/* Null read succeeds. */
@@ -271,7 +271,6 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
* (WF_SYNC), because we want them to get going and generate more
* data for us.
*/
- was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
for (;;) {
/* Read ->head with a barrier vs post_one_notification() */
unsigned int head = smp_load_acquire(&pipe->head);
@@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
buf->len = 0;
}
- if (!buf->len)
+ if (!buf->len) {
+ wake_writer |= pipe_full(head, tail, pipe->max_usage);
tail = pipe_update_tail(pipe, buf, tail);
+ }
total_len -= chars;
if (!total_len)
break; /* common path: read succeeded */
@@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
* _very_ unlikely case that the pipe was full, but we got
* no data.
*/
- if (unlikely(was_full))
+ if (unlikely(wake_writer))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
@@ -391,14 +392,14 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
return -ERESTARTSYS;
mutex_lock(&pipe->mutex);
- was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
wake_next_reader = true;
+ wake_writer = false;
}
if (pipe_empty(pipe->head, pipe->tail))
wake_next_reader = false;
mutex_unlock(&pipe->mutex);
- if (was_full)
+ if (wake_writer)
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
if (wake_next_reader)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6d90ad974408..0fdad3c3c513 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -166,6 +166,7 @@ extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wai
extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+extern atomic_t g_add_count;
static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
struct list_head *head = &wq_head->head;
@@ -177,6 +178,8 @@ static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait
head = &wq->entry;
}
list_add(&wq_entry->entry, head);
+ smp_mb();
+ atomic_inc(&g_add_count);
}
/*
@@ -192,6 +195,8 @@ __add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_en
static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
list_add_tail(&wq_entry->entry, &wq_head->head);
+ smp_mb();
+ atomic_inc(&g_add_count);
}
static inline void
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..07487429dddf 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -110,6 +110,10 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
return nr_exclusive - remaining;
}
+#if 1
+atomic_t g_add_count = ATOMIC_INIT(0);
+#endif
+
/**
* __wake_up - wake up threads blocked on a waitqueue.
* @wq_head: the waitqueue
@@ -124,6 +128,47 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
+#if 1
+static atomic_t g_slow = ATOMIC_INIT(0);
+static atomic_t g_mid = ATOMIC_INIT(0);
+static atomic_t g_fast = ATOMIC_INIT(0);
+static u64 printtime = 10*HZ;
+#endif
+ if (list_empty(&wq_head->head)) {
+ struct list_head *pn;
+
+ /*
+ * pairs with spin_unlock_irqrestore(&wq_head->lock);
+ * We actually do not need to acquire wq_head->lock, we just
+ * need to be sure that there is no prepare_to_wait() that
+ * completed on any CPU before __wake_up was called.
+ * Thus instead of load_acquiring the spinlock and dropping
+ * it again, we load_acquire the next list entry and check
+ * that the list is not empty.
+ */
+ pn = smp_load_acquire(&wq_head->head.next);
+
+ if(pn == &wq_head->head) {
+#if 1
+ atomic_inc(&g_fast);
+#endif
+ return 0;
+ } else {
+#if 1
+ atomic_inc(&g_mid);
+#endif
+ }
+ } else {
+#if 1
+ atomic_inc(&g_slow);
+#endif
+ }
+#if 1
+ if (get_jiffies_64() > printtime) {
+ printtime = get_jiffies_64() + 10*HZ;
+ pr_info("__wakeup: slow/obvious: %d, mid/nearly raced: %d, fast: %d, add: %d.\n", atomic_read(&g_slow), atomic_read(&g_mid), atomic_read(&g_fast), atomic_read(&g_add_count));
+ }
+#endif
return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);
--
2.47.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-31 11:14 ` Manfred Spraul
@ 2024-12-31 19:38 ` Linus Torvalds
2024-12-31 20:24 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2024-12-31 19:38 UTC (permalink / raw)
To: Manfred Spraul
Cc: Oleg Nesterov, WangYuli, linux-fsdevel, Linux Kernel Mailing List,
Christian Brauner, 1vier1
On Tue, 31 Dec 2024 at 03:14, Manfred Spraul <manfred@colorfullife.com> wrote:
>
> Should we add the missing memory barriers and switch to
> wait_queue_active() in front of all wakeup calls?
If we *really* want to optimize this, we could even get rid of the
memory barrier at least on x86, because
(a) mutex_unlock() is a full memory barrier on x86 (it involves a
locked cmpxchg)
(b) the condition is always set inside the locked region
(c) the wakeup is after releasing the lock
but this is architecture-specific (ie "mutex_unlock()" is not
*guaranteed* to be a memory barrier (ie on other architectures it
might be only a release barrier).
We have "smp_mb__after_atomic()" and "smp_mb__after_spinlock()", but
we don't have a "smp_mb__after_mutex_unlock()".
So we'd have to add a new helper or config option.
Anyway, I'm perfectly happy to get these optimizations, but because of
historical trouble in this area, I want any patches to be very clearly
documented.
Oleg's patch to only wake up writers when readers have actually opened
up a slot may not make any actual difference (because readers in
*practice* always do big reads), but I like it because it feels
obviously correct and doesn't have any locking or memory ordering
subtleties (and actually makes the logic more logical and
straightforward).
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-31 19:38 ` Linus Torvalds
@ 2024-12-31 20:24 ` Oleg Nesterov
2024-12-31 22:31 ` Linus Torvalds
0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2024-12-31 20:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Manfred Spraul, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner, 1vier1
On 12/31, Linus Torvalds wrote:
>
> Oleg's patch to only wake up writers when readers have actually opened
> up a slot may not make any actual difference (because readers in
> *practice* always do big reads),
Yes, yes, that patch is mostly cleanup/simplification. I'll write the
changelog and send it after the holidays. Plus probably another one to fix
the theoretical problem (I need to recheck) in wakeup_pipe_readers/writers.
But let me ask another question right now. what do you think about another
minor change below?
Again, mostly to make this logic more understandable. Although I am not
sure I really understand it...
Oleg.
diff --git a/fs/pipe.c b/fs/pipe.c
index 82fede0f2111..ac3e7584726a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -661,8 +661,11 @@ pipe_poll(struct file *filp, poll_table *wait)
struct pipe_inode_info *pipe = filp->private_data;
unsigned int head, tail;
+#ifdef CONFIG_EPOLL
/* Epoll has some historical nasty semantics, this enables them */
- WRITE_ONCE(pipe->poll_usage, true);
+ if ((filp->f_mode & FMODE_READ) && filp->f_ep)
+ WRITE_ONCE(pipe->poll_usage, true);
+#endif
/*
* Reading pipe state only -- no need for acquiring the semaphore.
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-31 20:24 ` Oleg Nesterov
@ 2024-12-31 22:31 ` Linus Torvalds
2025-01-02 13:57 ` Oleg Nesterov
0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2024-12-31 22:31 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Manfred Spraul, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner, 1vier1
On Tue, 31 Dec 2024 at 12:25, Oleg Nesterov <oleg@redhat.com> wrote:
>
> But let me ask another question right now. what do you think about another
> minor change below?
Probably ok. Although I'm not convinced it makes things any more readable.
> Again, mostly to make this logic more understandable. Although I am not
> sure I really understand it...
So see commit fe67f4dd8daa ("pipe: do FASYNC notifications for every
pipe IO, not just state changes") on why that crazy poll_usage thing
exists.
Basically, epoll (and FASYNC) create *events*, not "state
transitions". And people have sadly depended on getting an event for
each pipe write, rather than getting an event on state transition.
So epoll and FASYNC can't depend on "users only care about the pipe
becoming readable". Which is why we have that poll_usage thing, and
why the SIGIO is unconditional.
The
#ifdef CONFIG_EPOLL
addition is straightforward enough and matches the existing comment.
But you adding the FMODE_READ test should probably get a new comment
about how we only do this for epoll readability, not for writability..
Technically epoll might notice "read done" vs "became writable", but
nobody ever reported that, so we never did that poll_usage hack for
the pipe_read() side. So "poll_usage" realyl is very much a hack for
"user space depended on _this_ particular thing being an event, not
that other thing".
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-31 22:31 ` Linus Torvalds
@ 2025-01-02 13:57 ` Oleg Nesterov
0 siblings, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2025-01-02 13:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Manfred Spraul, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner, 1vier1
On 12/31, Linus Torvalds wrote:
>
> On Tue, 31 Dec 2024 at 12:25, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > But let me ask another question right now. what do you think about another
> > minor change below?
>
> Probably ok. Although I'm not convinced it makes things any more readable.
OK, lets forget it for now.
> > Again, mostly to make this logic more understandable. Although I am not
> > sure I really understand it...
>
> So see commit fe67f4dd8daa ("pipe: do FASYNC notifications for every
> pipe IO, not just state changes") on why that crazy poll_usage thing
> exists.
Ah. Yes, yes, thanks, I have already read this commit/changelog, because
I was confused by the unconditional kill_fasync()'s in pipe_read/write.
So I guess I mostly understand the "edge-triggered" issues.
As for epoll, I even wrote the stupid test-case:
int main(void)
{
int pfd[2], efd;
struct epoll_event evt = { .events = EPOLLIN | EPOLLET };
pipe(pfd);
efd = epoll_create1(0);
epoll_ctl(efd, EPOLL_CTL_ADD, pfd[0], &evt);
for (int i = 0; i < 2; ++i) {
write(pfd[1], "", 1);
assert(epoll_wait(efd, &evt, 1, 0) == 1);
}
return 0;
}
without the pipe->poll_usage check in pipe_write() assert() fails on the
2nd iteration. BTW, I think pipe_write() needs READ_ONCE(pipe->poll_usage),
KCSAN can complain.
> The
>
> #ifdef CONFIG_EPOLL
>
> addition is straightforward enough and matches the existing comment.
>
> But you adding the FMODE_READ test should probably get a new comment
> about how we only do this for epoll readability, not for writability..
Agreed. perhaps I'll try to make V2 later...
The unconditional WRITE_ONCE(pipe->poll_usage) in pipe_poll() may hide
some subtle race between pipe_write() and the "normal" select/poll, that
is why I'd like to make ->poll_usage depend on filp->f_ep != NULL.
Thanks!
Oleg.
^ permalink raw reply [flat|nested] 37+ messages in thread
* RFC: Checkpatch: Introduce list of functions that need memory barriers.
2024-12-30 15:38 ` Oleg Nesterov
2024-12-31 11:14 ` Manfred Spraul
@ 2025-01-04 21:15 ` Manfred Spraul
1 sibling, 0 replies; 37+ messages in thread
From: Manfred Spraul @ 2025-01-04 21:15 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Joe Perches, Linus Torvalds, WangYuli, linux-fsdevel,
Linux Kernel Mailing List, Christian Brauner, 1vier1,
Manfred Spraul
(depending on output of the discussion in
https://lore.kernel.org/lkml/20250102163320.GA17691@redhat.com/T/#u
It does not make sense to change it now, but I do not see a reason
to single out waitqueue_active().
The code is copy&paste, it seems to work.
There is already a recommendation for spin_is_locked()
)
2nd spinoff from the fs/pipe discussion, and ortogonal to both
the initial topic (use waitqueue_active()) and the 2nd topic
(do not wake up writers if the pipe is not writable)
Memory barriers must be paired to be effective, thus it is
mandatory to add comments that explain the pairing.
Several functions depend on the caller to take care of this.
There is already a request to add a comment for waitqueue_active(),
but there are further comparable functions:
wq_has_sleepers(): No barrier is needed if the function
is paired with prepare_to_wait(). With add_wait_queue(),
a barrier is needed after the add_wait_queue() call.
- spin_is_locked(): the ACQUIRE barrier from spin_lock()
is on the load, not on the store. Thus spin_is_locked()
may return false even though the lock is already taken.
Avoid to use it outside of debug code.
(and, for completeness)
- waitqueue_active(): Usually, a memory barrier before
the call is needed, and if add_wait_queue() is used, also
a barrier after add_wait_queue. See wait.h for details.
-
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
scripts/checkpatch.pl | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 9eed3683ad76..8bf5849ee108 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -6685,11 +6685,17 @@ sub process {
"__smp memory barriers shouldn't be used outside barrier.h and asm-generic\n" . $herecurr);
}
-# check for waitqueue_active without a comment.
- if ($line =~ /\bwaitqueue_active\s*\(/) {
+# check for functions that are only safe with memory barriers without a comment.
+ my $need_barriers = qr{
+ waitqueue_active|
+ wq_has_sleeper|
+ spin_is_locked
+ }x;
+
+ if ($line =~ /\b(?:$need_barriers)\s*\(/) {
if (!ctx_has_comment($first_line, $linenr)) {
- WARN("WAITQUEUE_ACTIVE",
- "waitqueue_active without comment\n" . $herecurr);
+ WARN("NEED_MEMORY_BARRIERS",
+ "function that usually depend on manual memory barriers without comment\n" . $herecurr);
}
}
--
2.47.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
end of thread, other threads:[~2025-01-04 21:16 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
2024-12-25 13:30 ` Andy Shevchenko
2024-12-25 13:53 ` Kent Overstreet
2024-12-25 16:04 ` Mateusz Guzik
2024-12-25 16:32 ` Kent Overstreet
2024-12-25 17:22 ` Mateusz Guzik
2024-12-25 17:41 ` Kent Overstreet
2024-12-25 15:42 ` WangYuli
2024-12-25 16:00 ` Willy Tarreau
2024-12-25 16:32 ` WangYuli
2024-12-25 16:56 ` Willy Tarreau
2024-12-26 16:00 ` Oleg Nesterov
2024-12-26 19:02 ` Linus Torvalds
2024-12-26 20:11 ` Oleg Nesterov
2024-12-26 20:29 ` Linus Torvalds
2024-12-26 20:57 ` Oleg Nesterov
2024-12-27 15:54 ` Oleg Nesterov
2024-12-27 16:43 ` Oleg Nesterov
2024-12-27 18:39 ` Manfred Spraul
2024-12-28 14:32 ` Oleg Nesterov
2024-12-28 15:22 ` Oleg Nesterov
2024-12-28 16:32 ` Oleg Nesterov
2024-12-28 18:53 ` Manfred Spraul
2024-12-29 11:54 ` Oleg Nesterov
2024-12-28 16:45 ` Manfred Spraul
2024-12-29 11:57 ` Oleg Nesterov
2024-12-29 12:41 ` Manfred Spraul
2024-12-29 13:05 ` Oleg Nesterov
2024-12-29 13:13 ` Oleg Nesterov
2024-12-29 19:54 ` Manfred Spraul
2024-12-30 15:38 ` Oleg Nesterov
2024-12-31 11:14 ` Manfred Spraul
2024-12-31 19:38 ` Linus Torvalds
2024-12-31 20:24 ` Oleg Nesterov
2024-12-31 22:31 ` Linus Torvalds
2025-01-02 13:57 ` Oleg Nesterov
2025-01-04 21:15 ` RFC: Checkpatch: Introduce list of functions that need memory barriers Manfred Spraul
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).