* [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
@ 2024-12-25 9:42 WangYuli
2024-12-25 13:30 ` Andy Shevchenko
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: WangYuli @ 2024-12-25 9:42 UTC (permalink / raw)
To: viro, brauner, jack
Cc: linux-fsdevel, linux-kernel, yushengjin, zhangdandan, guanwentao,
zhanjun, oliver.sang, ebiederm, colin.king, josh, penberg,
manfred, mingo, jes, hch, aia21, arjan, jgarzik, neukum, oliver,
dada1, axboe, axboe, nickpiggin, dhowells, nathans, rolandd,
tytso, bunk, pbadari, ak, ak, davem, jsipek, jens.axboe, ramsdell,
hch, torvalds, akpm, randy.dunlap, efault, rdunlap, haveblue,
drepper, dm.n9107, jblunck, davidel, mtk.manpages, linux-arch,
vda.linux, jmorris, serue, hca, rth, lethal, tony.luck,
heiko.carstens, oleg, andi, corbet, crquan, mszeredi, miklos,
peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia, jaxboe,
nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen, xemul,
tj, serge.hallyn, gorcunov, levinsasha928, penberg, amwang, bcrl,
muthu.lkml, muthur, mjt, alan, raven, thomas, will.deacon, will,
josef, anatol.pomozov, koverstreet, zab, balbi, gregkh, mfasheh,
jlbec, rusty, asamymuthupa, smani, sbradshaw, jmoyer, sim, ia,
dmonakhov, ebiggers3, socketpair, penguin-kernel, w,
kirill.shutemov, mhocko, vdavydov.dev, vdavydov, hannes, mhocko,
minchan, deepa.kernel, arnd, balbi, swhiteho, konishi.ryusuke,
dsterba, vegard.nossum, axboe, pombredanne, tglx, joe.lawrence,
mpatocka, mcgrof, keescook, linux, jannh, shakeelb, guro, willy,
khlebnikov, kirr, stern, elver, parri.andrea, paulmck, rasibley,
jstancek, avagin, cai, josef, hare, colyli, johannes, sspatil,
alex_y_xu, mgorman, gor, jhubbard, andriy.shevchenko, crope,
yzaikin, bfields, jlayton, kernel, steve, nixiaoming, 0x7f454c46,
kuniyu, alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart, WangYuli
When a user calls the read/write system call and passes a pipe
descriptor, the pipe_read/pipe_write functions are invoked:
1. pipe_read():
1). Checks if the pipe is valid and if there is any data in the
pipe buffer.
2). Waits for data:
*If there is no data in the pipe and the write end is still open,
the current process enters a sleep state (wait_event()) until data
is written.
*If the write end is closed, return 0.
3). Reads data:
*Wakes up the process and copies data from the pipe's memory
buffer to user space.
*When the buffer is full, the writing process will go to sleep,
waiting for the pipe state to change to be awakened (using the
wake_up_interruptible_sync_poll() mechanism). Once data is read
from the buffer, the writing process can continue writing, and the
reading process can continue reading new data.
4). Returns the number of bytes read upon successful read.
2. pipe_write():
1). Checks if the pipe is valid and if there is any available
space in the pipe buffer.
2). Waits for buffer space:
*If the pipe buffer is full and the reading process has not
read any data, pipe_write() may put the current process to sleep
until there is space in the buffer.
*If the read end of the pipe is closed (no process is waiting
to read), an error code -EPIPE is returned, and a SIGPIPE signal may
be sent to the process.
3). Writes data:
*If there is enough space in the pipe buffer, pipe_write() copies
data from the user space buffer to the kernel buffer of the pipe
(using copy_from_user()).
*If the amount of data the user requests to write is larger than
the available space in the buffer, multiple writes may be required,
or the process may wait for new space to be freed.
4). Wakes up waiting reading processes:
*After the data is successfully written, pipe_write() wakes up
any processes that may be waiting to read data (using the
wake_up_interruptible_sync_poll() mechanism).
5). Returns the number of bytes successfully written.
Check if there are any waiting processes in the process wait queue
by introducing wq_has_sleeper() when waking up processes for pipe
read/write operations.
If no processes are waiting, there's no need to execute
wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
Unnecessary wake-ups can lead to context switches, where a process
is woken up to handle I/O events even when there is no immediate
need.
Only wake up processes when there are actually waiting processes to
reduce context switches and system overhead by checking
with wq_has_sleeper().
Additionally, by reducing unnecessary synchronization and wake-up
operations, wq_has_sleeper() can decrease system resource waste and
lock contention, improving overall system performance.
For pipe read/write operations, this eliminates ineffective scheduling
and enhances concurrency.
It's important to note that enabling this option means invoking
wq_has_sleeper() to check for sleeping processes in the wait queue
for every read or write operation.
While this is a lightweight operation, it still incurs some overhead.
In low-load or single-task scenarios, this overhead may not yield
significant benefits and could even introduce minor performance
degradation.
UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
Single-core performance improved by 4.1%, multi-core performance
improved by 4.8%.
Co-developed-by: Shengjin Yu <yushengjin@uniontech.com>
Signed-off-by: Shengjin Yu <yushengjin@uniontech.com>
Co-developed-by: Dandan Zhang <zhangdandan@uniontech.com>
Signed-off-by: Dandan Zhang <zhangdandan@uniontech.com>
Tested-by: Dandan Zhang <zhangdandan@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
---
fs/Kconfig | 13 +++++++++++++
fs/pipe.c | 21 +++++++++++++++------
2 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/fs/Kconfig b/fs/Kconfig
index 64d420e3c475..0dacc46a73fe 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -429,4 +429,17 @@ source "fs/unicode/Kconfig"
config IO_WQ
bool
+config PIPE_SKIP_SLEEPER
+ bool "Skip sleeping processes during pipe read/write"
+ default n
+ help
+ This option introduces a check whether the sleep queue will
+ be awakened during pipe read/write.
+
+ It often leads to a performance improvement. However, in
+ low-load or single-task scenarios, it may introduce minor
+ performance overhead.
+
+ If unsure, say N.
+
endmenu
diff --git a/fs/pipe.c b/fs/pipe.c
index 12b22c2723b7..c085333ae72c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -247,6 +247,15 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
return tail;
}
+static inline bool
+pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
+{
+ if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
+ return wq_has_sleeper(wq_head);
+ else
+ return true;
+}
+
static ssize_t
pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
@@ -377,7 +386,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
* _very_ unlikely case that the pipe was full, but we got
* no data.
*/
- if (unlikely(was_full))
+ if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
@@ -398,9 +407,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
wake_next_reader = false;
mutex_unlock(&pipe->mutex);
- if (was_full)
+ if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
- if (wake_next_reader)
+ if (wake_next_reader && pipe_check_wq_has_sleeper(&pipe->rd_wait))
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
if (ret > 0)
@@ -573,7 +582,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
* become empty while we dropped the lock.
*/
mutex_unlock(&pipe->mutex);
- if (was_empty)
+ if (was_empty && pipe_check_wq_has_sleeper(&pipe->rd_wait))
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
@@ -598,10 +607,10 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
* Epoll nonsensically wants a wakeup whether the pipe
* was already empty or not.
*/
- if (was_empty || pipe->poll_usage)
+ if ((was_empty || pipe->poll_usage) && pipe_check_wq_has_sleeper(&pipe->rd_wait))
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
- if (wake_next_writer)
+ if (wake_next_writer && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
int err = file_update_time(filp);
--
2.45.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
@ 2024-12-25 13:30 ` Andy Shevchenko
2024-12-25 15:42 ` WangYuli
2024-12-26 16:00 ` Oleg Nesterov
2024-12-26 19:02 ` Linus Torvalds
2 siblings, 1 reply; 10+ messages in thread
From: Andy Shevchenko @ 2024-12-25 13:30 UTC (permalink / raw)
To: WangYuli
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, torvalds, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, oleg, andi, corbet, crquan,
mszeredi, miklos, peterz, a.p.zijlstra, earl_chew, npiggin,
npiggin, julia, jaxboe, nikai, dchinner, davej, npiggin,
eric.dumazet, tim.c.chen, xemul, tj, serge.hallyn, gorcunov,
levinsasha928, penberg, amwang, bcrl, muthu.lkml, muthur, mjt,
alan, raven, thomas, will.deacon, will, josef, anatol.pomozov,
koverstreet, zab, balbi, gregkh, mfasheh, jlbec, rusty,
asamymuthupa, smani, sbradshaw, jmoyer, sim, ia, dmonakhov,
ebiggers3, socketpair, penguin-kernel, w, kirill.shutemov, mhocko,
vdavydov.dev, vdavydov, hannes, mhocko, minchan, deepa.kernel,
arnd, balbi, swhiteho, konishi.ryusuke, dsterba, vegard.nossum,
axboe, pombredanne, tglx, joe.lawrence, mpatocka, mcgrof,
keescook, linux, jannh, shakeelb, guro, willy, khlebnikov, kirr,
stern, elver, parri.andrea, paulmck, rasibley, jstancek, avagin,
cai, josef, hare, colyli, johannes, sspatil, alex_y_xu, mgorman,
gor, jhubbard, crope, yzaikin, bfields, jlayton, kernel, steve,
nixiaoming, 0x7f454c46, kuniyu, alexander.h.duyck, kuni1840,
soheil, sridhar.samudrala, Vincenzo.Frascino, chuck.lever,
Kevin.Brodsky, Szabolcs.Nagy, David.Laight, Mark.Rutland,
linux-morello, Luca.Vizzarro, max.kellermann, adobriyan, lukas,
j.granados, djwong, kent.overstreet, linux, kstewart
Don't you think the Cc list is a bit overloaded?
On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> When a user calls the read/write system call and passes a pipe
> descriptor, the pipe_read/pipe_write functions are invoked:
>
> 1. pipe_read():
> 1). Checks if the pipe is valid and if there is any data in the
> pipe buffer.
> 2). Waits for data:
> *If there is no data in the pipe and the write end is still open,
> the current process enters a sleep state (wait_event()) until data
> is written.
> *If the write end is closed, return 0.
> 3). Reads data:
> *Wakes up the process and copies data from the pipe's memory
> buffer to user space.
> *When the buffer is full, the writing process will go to sleep,
> waiting for the pipe state to change to be awakened (using the
> wake_up_interruptible_sync_poll() mechanism). Once data is read
> from the buffer, the writing process can continue writing, and the
> reading process can continue reading new data.
> 4). Returns the number of bytes read upon successful read.
>
> 2. pipe_write():
> 1). Checks if the pipe is valid and if there is any available
> space in the pipe buffer.
> 2). Waits for buffer space:
> *If the pipe buffer is full and the reading process has not
> read any data, pipe_write() may put the current process to sleep
> until there is space in the buffer.
> *If the read end of the pipe is closed (no process is waiting
> to read), an error code -EPIPE is returned, and a SIGPIPE signal may
> be sent to the process.
> 3). Writes data:
> *If there is enough space in the pipe buffer, pipe_write() copies
> data from the user space buffer to the kernel buffer of the pipe
> (using copy_from_user()).
> *If the amount of data the user requests to write is larger than
> the available space in the buffer, multiple writes may be required,
> or the process may wait for new space to be freed.
> 4). Wakes up waiting reading processes:
> *After the data is successfully written, pipe_write() wakes up
> any processes that may be waiting to read data (using the
> wake_up_interruptible_sync_poll() mechanism).
> 5). Returns the number of bytes successfully written.
>
> Check if there are any waiting processes in the process wait queue
> by introducing wq_has_sleeper() when waking up processes for pipe
> read/write operations.
>
> If no processes are waiting, there's no need to execute
> wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
>
> Unnecessary wake-ups can lead to context switches, where a process
> is woken up to handle I/O events even when there is no immediate
> need.
>
> Only wake up processes when there are actually waiting processes to
> reduce context switches and system overhead by checking
> with wq_has_sleeper().
>
> Additionally, by reducing unnecessary synchronization and wake-up
> operations, wq_has_sleeper() can decrease system resource waste and
> lock contention, improving overall system performance.
>
> For pipe read/write operations, this eliminates ineffective scheduling
> and enhances concurrency.
>
> It's important to note that enabling this option means invoking
> wq_has_sleeper() to check for sleeping processes in the wait queue
> for every read or write operation.
>
> While this is a lightweight operation, it still incurs some overhead.
>
> In low-load or single-task scenarios, this overhead may not yield
> significant benefits and could even introduce minor performance
> degradation.
>
> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
>
> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
> With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
>
> Single-core performance improved by 4.1%, multi-core performance
> improved by 4.8%.
...
> +config PIPE_SKIP_SLEEPER
> + bool "Skip sleeping processes during pipe read/write"
> + default n
'n' is the default 'default', no need to have this line.
> + help
> + This option introduces a check whether the sleep queue will
> + be awakened during pipe read/write.
> +
> + It often leads to a performance improvement. However, in
> + low-load or single-task scenarios, it may introduce minor
> + performance overhead.
> + If unsure, say N.
Illogical, it's already N as you stated by putting a redundant line, but after
removing that line it will make sense.
...
> +static inline bool
Have you build this with Clang and `make W=1 ...`?
> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
> + return wq_has_sleeper(wq_head);
> + else
Redundant.
> + return true;
if (!foo)
return true;
return bar(...);
> +}
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 13:30 ` Andy Shevchenko
@ 2024-12-25 15:42 ` WangYuli
0 siblings, 0 replies; 10+ messages in thread
From: WangYuli @ 2024-12-25 15:42 UTC (permalink / raw)
To: Andy Shevchenko, Kent Overstreet
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, torvalds, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, oleg, andi, corbet, crquan,
mszeredi, miklos, peterz, a.p.zijlstra, earl_chew, npiggin,
npiggin, julia, jaxboe, nikai, dchinner, davej, npiggin,
eric.dumazet, tim.c.chen, xemul, tj, serge.hallyn, gorcunov,
levinsasha928, penberg, amwang, bcrl, muthu.lkml, muthur, mjt,
alan, raven, thomas, will.deacon, will, josef, anatol.pomozov,
koverstreet, zab, balbi, gregkh, mfasheh, jlbec, rusty,
asamymuthupa, smani, sbradshaw, jmoyer, sim, ia, dmonakhov,
ebiggers3, socketpair, penguin-kernel, w, kirill.shutemov, mhocko,
vdavydov.dev, vdavydov, hannes, mhocko, minchan, deepa.kernel,
arnd, balbi, swhiteho, konishi.ryusuke, dsterba, vegard.nossum,
axboe, pombredanne, tglx, joe.lawrence, mpatocka, mcgrof,
keescook, linux, jannh, shakeelb, guro, willy, khlebnikov, kirr,
stern, elver, parri.andrea, paulmck, rasibley, jstancek, avagin,
cai, josef, hare, colyli, johannes, sspatil, alex_y_xu, mgorman,
gor, jhubbard, crope, yzaikin, bfields, jlayton, kernel, steve,
nixiaoming, 0x7f454c46, kuniyu, alexander.h.duyck, kuni1840,
soheil, sridhar.samudrala, Vincenzo.Frascino, chuck.lever,
Kevin.Brodsky, Szabolcs.Nagy, David.Laight, Mark.Rutland,
linux-morello, Luca.Vizzarro, max.kellermann, adobriyan, lukas,
j.granados, djwong, kent.overstreet, linux, kstewart
[-- Attachment #1.1.1: Type: text/plain, Size: 5373 bytes --]
On 2024/12/25 21:30, Andy Shevchenko wrote:
> Don't you think the Cc list is a bit overloaded?
Hi,
I apologize for any inconvenience this may cause.
I understand that under normal circumstances, one would simply pass the
modified code path as an argument to the kernel's
scripts/get_maintainer.pl script to determine the appropriate recipients.
However, given the vast and complex nature of the Linux kernel
community, with tens of thousands of developers worldwide, and
considering the varying "customs" of different subsystems, as well as
time zone differences and individual work habits, it's not uncommon for
patches to be sent to mailing lists and subsequently ignored or left
pending.
This patch, for example, has been submitted multiple times without
receiving any response, unfortunately.
My intention is simply to seek your review, and that of other technical
experts like yourself, but I cannot be certain, prior to your response,
which specific experts on which lists would be willing to provide feedback.
I would appreciate any other suggestions you may have.
>> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
>>
>> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
>> With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
>>
>> Single-core performance improved by 4.1%, multi-core performance
>> improved by 4.8%.
> ...
As you know, the kernel is extremely sensitive to performance.
Even a 1% performance improvement can lead to significant efficiency
gains and reduced carbon emissions in production environments, as long
as there is sufficient testing and theoretical analysis to prove that
the improvement is real and not due to measurement error or jitter.
>> +config PIPE_SKIP_SLEEPER
>> + bool "Skip sleeping processes during pipe read/write"
>> + default n
> 'n' is the default 'default', no need to have this line.
OK, I'll drop it. Thanks.
>
>> + help
>> + This option introduces a check whether the sleep queue will
>> + be awakened during pipe read/write.
>> +
>> + It often leads to a performance improvement. However, in
>> + low-load or single-task scenarios, it may introduce minor
>> + performance overhead.
>> + If unsure, say N.
> Illogical, it's already N as you stated by putting a redundant line, but after
> removing that line it will make sense.
>
> ...
As noted, I'll remove "default n" as it serves no purpose.
>
>> +static inline bool
> Have you build this with Clang and `make W=1 ...`?
Hmm...I've noticed a discrepancy in kernel compilation results with and
without "make W=1".
When I use x86_64_defconfig and clang-19.1.1 (Ubuntu 24.10) and run
"make", there are no warnings.
However, when I run "make W=1", the kernel generates a massive number of
errors, causing the compilation to fail prematurely.
e.g.
In file included from arch/x86/kernel/asm-offsets.c:14:
In file included from ./include/linux/suspend.h:5:
In file included from ./include/linux/swap.h:9:
In file included from ./include/linux/memcontrol.h:21:
In file included from ./include/linux/mm.h:2224:
./include/linux/vmstat.h:504:43: error: arithmetic between different
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item')
[-Werror,-Wenum-enum-conversion]
504 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~ ^
505 | item];
| ~~~~
./include/linux/vmstat.h:511:43: error: arithmetic between different
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item')
[-Werror,-Wenum-enum-conversion]
511 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~ ^
512 | NR_VM_NUMA_EVENT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~~
./include/linux/vmstat.h:524:43: error: arithmetic between different
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item')
[-Werror,-Wenum-enum-conversion]
524 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~ ^
525 | NR_VM_NUMA_EVENT_ITEMS +
| ~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.
And I've observed similar behavior with gcc-14.2.0.
While I'm keen on addressing as many potential compile errors and
warnings in the kernel as possible, it seems like a long-term endeavor.
Regarding this specific code, I'd appreciate your insights on how to
improve it.
>
>> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
>> +{
>> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
>> + return wq_has_sleeper(wq_head);
>> + else
> Redundant.
>
>> + return true;
> if (!foo)
> return true;
>
> return bar(...);
>
>> +}
Yes. I'll rework the code structure here. Thanks.
Thanks,
--
WangYuli
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 645 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
2024-12-25 13:30 ` Andy Shevchenko
@ 2024-12-26 16:00 ` Oleg Nesterov
2024-12-26 19:02 ` Linus Torvalds
2 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2024-12-26 16:00 UTC (permalink / raw)
To: WangYuli
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, torvalds, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, levinsasha928, penberg, amwang,
bcrl, muthu.lkml, muthur, mjt, alan, raven, thomas, will.deacon,
will, josef, anatol.pomozov, koverstreet, zab, balbi, gregkh,
mfasheh, jlbec, rusty, asamymuthupa, smani, sbradshaw, jmoyer,
sim, ia, dmonakhov, ebiggers3, socketpair, penguin-kernel, w,
kirill.shutemov, mhocko, vdavydov.dev, vdavydov, hannes, mhocko,
minchan, deepa.kernel, arnd, balbi, swhiteho, konishi.ryusuke,
dsterba, vegard.nossum, axboe, pombredanne, tglx, joe.lawrence,
mpatocka, mcgrof, keescook, linux, jannh, shakeelb, guro, willy,
khlebnikov, kirr, stern, elver, parri.andrea, paulmck, rasibley,
jstancek, avagin, cai, josef, hare, colyli, johannes, sspatil,
alex_y_xu, mgorman, gor, jhubbard, andriy.shevchenko, crope,
yzaikin, bfields, jlayton, kernel, steve, nixiaoming, 0x7f454c46,
kuniyu, alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart
Quite possibly I missed something, but I have some concerns after
a quick glance...
On 12/25, WangYuli wrote:
>
> +config PIPE_SKIP_SLEEPER
> + bool "Skip sleeping processes during pipe read/write"
> + default n
> + help
> + This option introduces a check whether the sleep queue will
> + be awakened during pipe read/write.
> +
> + It often leads to a performance improvement. However, in
> + low-load or single-task scenarios, it may introduce minor
> + performance overhead.
> +
> + If unsure, say N.
> +
Well, IMO the new config option should be avoided for this optimization.
> +static inline bool
> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
> + return wq_has_sleeper(wq_head);
> + else
> + return true;
> +}
I think that another helper makes more sense:
pipe_wake_xxx(wait_queue_head_t wait, flags)
{
if (wq_has_sleeper(wait))
wake_up_interruptible_sync_poll(wait, flags);
}
-------------------------------------------------------------------------------
But either way I am not sure this optimization is 100% correct, see below.
> @@ -377,7 +386,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> * _very_ unlikely case that the pipe was full, but we got
> * no data.
> */
> - if (unlikely(was_full))
> + if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait))
> wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
OK, at first glance this can't race with wait_event_xxx(wr_wait, pipe_writable),
wq_has_sleeper() and prepare_to_wait_event() have the necessary barriers.
But what about pipe_poll() ?
To oversimplify, pipe_poll(FMODE_WRITE) does
// poll_wait()
__pollwait() -> add_wait_queue(pipe->wr_wait) -> list_add()
check pipe_full()
and I don't see the (in theory) necessary barrier between list_add()
and LOAD(pipe->head/tail).
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
2024-12-25 13:30 ` Andy Shevchenko
2024-12-26 16:00 ` Oleg Nesterov
@ 2024-12-26 19:02 ` Linus Torvalds
2024-12-26 20:11 ` Oleg Nesterov
2 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-12-26 19:02 UTC (permalink / raw)
To: WangYuli
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, yushengjin,
zhangdandan, guanwentao, zhanjun, oliver.sang, ebiederm,
colin.king, josh, penberg, manfred, mingo, jes, hch, aia21, arjan,
jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap, efault,
rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, oleg, andi, corbet, crquan,
mszeredi, miklos, peterz, a.p.zijlstra, earl_chew, npiggin,
npiggin, julia, jaxboe, nikai, dchinner, davej, npiggin,
eric.dumazet, tim.c.chen, xemul, tj, serge.hallyn, gorcunov,
levinsasha928, penberg, amwang, bcrl, muthu.lkml, muthur, mjt,
alan, raven, thomas, will.deacon, will, josef, anatol.pomozov,
koverstreet, zab, balbi, gregkh, mfasheh, jlbec, rusty,
asamymuthupa, smani, sbradshaw, jmoyer, sim, ia, dmonakhov,
ebiggers3, socketpair, penguin-kernel, w, kirill.shutemov, mhocko,
vdavydov.dev, vdavydov, hannes, mhocko, minchan, deepa.kernel,
arnd, balbi, swhiteho, konishi.ryusuke, dsterba, vegard.nossum,
axboe, pombredanne, tglx, joe.lawrence, mpatocka, mcgrof,
keescook, linux, jannh, shakeelb, guro, willy, khlebnikov, kirr,
stern, elver, parri.andrea, paulmck, rasibley, jstancek, avagin,
cai, josef, hare, colyli, johannes, sspatil, alex_y_xu, mgorman,
gor, jhubbard, andriy.shevchenko, crope, yzaikin, bfields,
jlayton, kernel, steve, nixiaoming, 0x7f454c46, kuniyu,
alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart
On Wed, 25 Dec 2024 at 01:44, WangYuli <wangyuli@uniontech.com> wrote:
>
>
> +config PIPE_SKIP_SLEEPER
> + bool "Skip sleeping processes during pipe read/write"
If the optimization is correct, there is no point to having a config option.
If the optimization is incorrect, there is no point to having the code.
Either way, there's no way we'd ever have a config option for this.
> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> + return wq_has_sleeper(wq_head);
So generally, the reason this is buggy is that it's racy due to either
CPU memory ordering issues or simply because the sleeper is going to
sleep but hasn't *quite* added itself to the wait queue.
Which is why the wakeup path takes the wq head lock.
Which is the only thing you are actually optimizing away.
> + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
> wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
And I think your patch is simply buggy because the race can in fact
exist, because there is no other lock than the waitqueue lock that
enforces memory ordering and protects against the "somebody is just
about to sleep" race.
That said, *if* all the wait queue work was done under the pipe mutex,
we could use the pipe mutex for exclusion instead.
That's actually how it kind of used to work long long ago (not really
- but close: it depended on the BKL originally, and adding waiters to
the wait-queues early before all the tests for full/empty, and
avoiding the races that way)
But now waiters use "wait_event_interruptible_exclusive()" explicitly
outside the pipe mutex, so the waiters and wakers aren't actually
serialized.
And no, you are unlikely to see the races in practice (particularly
the memory ordering ones). So you have to *think* about them, not test
them.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 19:02 ` Linus Torvalds
@ 2024-12-26 20:11 ` Oleg Nesterov
2024-12-27 18:39 ` Manfred Spraul
0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2024-12-26 20:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: WangYuli, viro, brauner, jack, linux-fsdevel, linux-kernel,
yushengjin, zhangdandan, guanwentao, zhanjun, oliver.sang,
ebiederm, colin.king, josh, penberg, manfred, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, levinsasha928, penberg, amwang,
bcrl, muthu.lkml, muthur, mjt, alan, raven, thomas, will.deacon,
will, josef, anatol.pomozov, koverstreet, zab, balbi, gregkh,
mfasheh, jlbec, rusty, asamymuthupa, smani, sbradshaw, jmoyer,
sim, ia, dmonakhov, ebiggers3, socketpair, penguin-kernel, w,
kirill.shutemov, mhocko, vdavydov.dev, vdavydov, hannes, mhocko,
minchan, deepa.kernel, arnd, balbi, swhiteho, konishi.ryusuke,
dsterba, vegard.nossum, axboe, pombredanne, tglx, joe.lawrence,
mpatocka, mcgrof, keescook, linux, jannh, shakeelb, guro, willy,
khlebnikov, kirr, stern, elver, parri.andrea, paulmck, rasibley,
jstancek, avagin, cai, josef, hare, colyli, johannes, sspatil,
alex_y_xu, mgorman, gor, jhubbard, andriy.shevchenko, crope,
yzaikin, bfields, jlayton, kernel, steve, nixiaoming, 0x7f454c46,
kuniyu, alexander.h.duyck, kuni1840, soheil, sridhar.samudrala,
Vincenzo.Frascino, chuck.lever, Kevin.Brodsky, Szabolcs.Nagy,
David.Laight, Mark.Rutland, linux-morello, Luca.Vizzarro,
max.kellermann, adobriyan, lukas, j.granados, djwong,
kent.overstreet, linux, kstewart
I mostly agree, see my reply to this patch, but...
On 12/26, Linus Torvalds wrote:
>
> If the optimization is correct, there is no point to having a config option.
>
> If the optimization is incorrect, there is no point to having the code.
>
> Either way, there's no way we'd ever have a config option for this.
Agreed,
> > + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
> > wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>
> End result: you need to explain why the race cannot exist.
Agreed,
> And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,
> But now waiters use "wait_event_interruptible_exclusive()" explicitly
> outside the pipe mutex, so the waiters and wakers aren't actually
> serialized.
This is what I don't understand... Could you spell ?
I _think_ that
wait_event_whatever(WQ, CONDITION);
vs
CONDITION = 1;
if (wq_has_sleeper(WQ))
wake_up_xxx(WQ, ...);
is fine.
Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
have the necessary barriers to serialize the waiters and wakers.
Damn I am sure I missed something ;)
> And no, you are unlikely to see the races in practice (particularly
> the memory ordering ones). So you have to *think* about them, not test
> them.
Yes! agreed again.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-26 20:11 ` Oleg Nesterov
@ 2024-12-27 18:39 ` Manfred Spraul
2024-12-28 14:32 ` Oleg Nesterov
0 siblings, 1 reply; 10+ messages in thread
From: Manfred Spraul @ 2024-12-27 18:39 UTC (permalink / raw)
To: Oleg Nesterov, Linus Torvalds
Cc: viro, brauner, jack, linux-fsdevel, linux-kernel, oliver.sang,
ebiederm, colin.king, josh, penberg, mingo, jes, hch, aia21,
arjan, jgarzik, neukum, oliver, dada1, axboe, axboe, nickpiggin,
dhowells, nathans, rolandd, tytso, bunk, pbadari, ak, ak, davem,
jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap, efault,
rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
[-- Attachment #1: Type: text/plain, Size: 3814 bytes --]
Hi,
(I had to remove many cc, my mail server rejected to send)
On 12/26/24 9:11 PM, Oleg Nesterov wrote:
> I mostly agree, see my reply to this patch, but...
>
> On 12/26, Linus Torvalds wrote:
>> If the optimization is correct, there is no point to having a config option.
>>
>> If the optimization is incorrect, there is no point to having the code.
>>
>> Either way, there's no way we'd ever have a config option for this.
> Agreed,
>
>>> + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
>>> wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>> End result: you need to explain why the race cannot exist.
> Agreed,
>
>> And I think your patch is simply buggy
> Agreed again, but probably my reasoning was wrong,
>
>> But now waiters use "wait_event_interruptible_exclusive()" explicitly
>> outside the pipe mutex, so the waiters and wakers aren't actually
>> serialized.
> This is what I don't understand... Could you spell ?
> I _think_ that
>
> wait_event_whatever(WQ, CONDITION);
> vs
>
> CONDITION = 1;
> if (wq_has_sleeper(WQ))
> wake_up_xxx(WQ, ...);
>
> is fine.
This pattern is documented in wait.h:
https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
Thus if there an issue, then the documentation should be updated.
> Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
> have the necessary barriers to serialize the waiters and wakers.
>
> Damn I am sure I missed something ;)
Actually:
Does this work universally? I.e. can we add the optimization into
__wake_up()?
But I do not understand this comment (from 2.6.0)
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
> /* * Note: we use "set_current_state()" _after_ the wait-queue add, *
> because we need a memory barrier there on SMP, so that any *
> wake-function that tests for the wait-queue being active * will be
> guaranteed to see waitqueue addition _or_ subsequent * tests in this
> thread will see the wakeup having taken place. * * The spin_unlock()
> itself is semi-permeable and only protects * one way (it only protects
> stuff inside the critical region and * stops them from bleeding out -
> it would still allow subsequent * loads to move into the the critical
> region). */ voidprepare_to_wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait>(wait_queue_head_t
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t>*q,wait_queue_t
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t>*wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>,intstate) {
> unsignedlongflags; wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->flags&=~WQ_FLAG_EXCLUSIVE
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE>;
> spin_lock_irqsave
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave>(&q->lock,flags);
> if(list_empty
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty>(&wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->task_list
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list>))
> __add_wait_queue
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue>(q,wait
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>);
> set_current_state
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state>(state);
> spin_unlock_irqrestore
> <https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore>(&q->lock,flags);
> }
set_current_state() now uses smp_store_mb(), which is a memory barrier
_after_ the store. Thus I do not see what enforces that the store
happens before the store for the __add_wait_queue().
--
Manfred
[-- Attachment #2: 0001-__wake_up_common_lock-Optimize-for-empty-wake-queues.patch --]
[-- Type: text/x-patch, Size: 2525 bytes --]
From 7cb8f06ab022e7fc36bacbe65f654822147ec1aa Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Fri, 27 Dec 2024 18:21:41 +0100
Subject: [PATCH] __wake_up_common_lock: Optimize for empty wake queues
wake_up() must wake up every task that is in the wait queue.
But: If there is concurrency between wake_up() and add_wait_queue(),
then (as long as we are not violating the memory ordering rules) we
can just claim that wake_up() happened "first" and add_wait_queue()
happened afterwards.
From memory ordering perspective:
- If the wait_queue() is empty, then wake_up() just does
spin_lock();
list_empty()
spin_unlock();
- add_wait_queue()/prepare_to_wait() do all kind of operations,
but they may become visible only when the spin_unlock_irqrestore()
is done.
Thus, instead of pairing the memory barrier to the spinlock, and
thus writing to a potentially shared cacheline, we load-acquire
the next pointer from the list.
Risks and side effects:
- The guaranteed memory barrier of wake_up() is reduced to load_acquire.
Previously, there was always a spin_lock()/spin_unlock() pair.
- prepare_to_wait() actually does two operations under spinlock:
It adds current to the wait queue, and it calls set_current_state().
The comment above prepare_to_wait() is not clear to me, thus there
might be further side effects.
Only lightly tested.
No benchmark test done.
---
kernel/sched/wait.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..10d02f652ab8 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
+ if (list_empty(&wq_head->head)) {
+ struct list_head *pn;
+
+ /*
+ * pairs with spin_unlock_irqrestore(&wq_head->lock);
+ * We actually do not need to acquire wq_head->lock, we just
+ * need to be sure that there is no prepare_to_wait() that
+ * completed on any CPU before __wake_up was called.
+ * Thus instead of load_acquiring the spinlock and dropping
+ * it again, we load_acquire the next list entry and check
+ * that the list is not empty.
+ */
+ pn = smp_load_acquire(&wq_head->head.next);
+
+ if(pn == &wq_head->head)
+ return 0;
+ }
return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);
--
2.47.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-27 18:39 ` Manfred Spraul
@ 2024-12-28 14:32 ` Oleg Nesterov
2024-12-28 15:22 ` Oleg Nesterov
0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2024-12-28 14:32 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, viro, brauner, jack, linux-fsdevel, linux-kernel,
oliver.sang, ebiederm, colin.king, josh, penberg, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
On 12/27, Manfred Spraul wrote:
>
> >I _think_ that
> >
> > wait_event_whatever(WQ, CONDITION);
> >vs
> >
> > CONDITION = 1;
> > if (wq_has_sleeper(WQ))
> > wake_up_xxx(WQ, ...);
> >
> >is fine.
>
> This pattern is documented in wait.h:
>
> https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
>
> Thus if there an issue, then the documentation should be updated.
Agreed, basically the same pattern, prepare_to_wait_event() is similar
to prepare_to_wait().
> But I do not understand this comment (from 2.6.0)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
>
> >/* * Note: we use "set_current_state()" _after_ the wait-queue add, *
> >because we need a memory barrier there on SMP, so that any * wake-function
> >that tests for the wait-queue being active * will be guaranteed to see
> >waitqueue addition _or_ subsequent * tests in this thread will see the
> >wakeup having taken place. * * The spin_unlock() itself is semi-permeable
> >and only protects * one way (it only protects stuff inside the critical
> >region and * stops them from bleeding out - it would still allow
> >subsequent * loads to move into the the critical region). */
...
> set_current_state() now uses smp_store_mb(), which is a memory barrier
> _after_ the store.
And afaics this is what we actually need.
> Thus I do not see what enforces that the store happens
> before the store for the __add_wait_queue().
IIUC this is fine, no need to serialize list_add() and STORE(tsk->__state),
they can be reordered.
But we need mb() between __add_wait_queue + __set_current_state (in any
order) and the subsequent "if (CONDITION)" check.
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
> int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> int nr_exclusive, void *key)
> {
> + if (list_empty(&wq_head->head)) {
> + struct list_head *pn;
> +
> + /*
> + * pairs with spin_unlock_irqrestore(&wq_head->lock);
> + * We actually do not need to acquire wq_head->lock, we just
> + * need to be sure that there is no prepare_to_wait() that
> + * completed on any CPU before __wake_up was called.
> + * Thus instead of load_acquiring the spinlock and dropping
> + * it again, we load_acquire the next list entry and check
> + * that the list is not empty.
> + */
> + pn = smp_load_acquire(&wq_head->head.next);
> +
> + if(pn == &wq_head->head)
> + return 0;
> + }
Too subtle for me ;)
I have some concerns, but I need to think a bit more to (try to) actually
understand this change.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 14:32 ` Oleg Nesterov
@ 2024-12-28 15:22 ` Oleg Nesterov
2024-12-28 16:32 ` Oleg Nesterov
0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2024-12-28 15:22 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, viro, brauner, jack, linux-fsdevel, linux-kernel,
oliver.sang, ebiederm, colin.king, josh, penberg, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
On 12/28, Oleg Nesterov wrote:
>
> > int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> > int nr_exclusive, void *key)
> > {
> > + if (list_empty(&wq_head->head)) {
> > + struct list_head *pn;
> > +
> > + /*
> > + * pairs with spin_unlock_irqrestore(&wq_head->lock);
> > + * We actually do not need to acquire wq_head->lock, we just
> > + * need to be sure that there is no prepare_to_wait() that
> > + * completed on any CPU before __wake_up was called.
> > + * Thus instead of load_acquiring the spinlock and dropping
> > + * it again, we load_acquire the next list entry and check
> > + * that the list is not empty.
> > + */
> > + pn = smp_load_acquire(&wq_head->head.next);
> > +
> > + if(pn == &wq_head->head)
> > + return 0;
> > + }
>
> Too subtle for me ;)
>
> I have some concerns, but I need to think a bit more to (try to) actually
> understand this change.
If nothing else, consider
int CONDITION;
wait_queue_head_t WQ;
void wake(void)
{
CONDITION = 1;
wake_up(WQ);
}
void wait(void)
{
DEFINE_WAIT_FUNC(entry, woken_wake_function);
add_wait_queue(WQ, entry);
if (!CONDITION)
wait_woken(entry, ...);
remove_wait_queue(WQ, entry);
}
this code is correct even if LOAD(CONDITION) can leak into the critical
section in add_wait_queue(), so CPU running wait() can actually do
// add_wait_queue
spin_lock(WQ->lock);
LOAD(CONDITION); // false!
list_add(entry, head);
spin_unlock(WQ->lock);
if (!false) // result of the LOAD above
wait_woken(entry, ...);
Now suppose that another CPU executes wake() between LOAD(CONDITION)
and list_add(entry, head). With your patch wait() will miss the event.
The same for __pollwait(), I think...
No?
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write
2024-12-28 15:22 ` Oleg Nesterov
@ 2024-12-28 16:32 ` Oleg Nesterov
0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2024-12-28 16:32 UTC (permalink / raw)
To: Manfred Spraul
Cc: Linus Torvalds, viro, brauner, jack, linux-fsdevel, linux-kernel,
oliver.sang, ebiederm, colin.king, josh, penberg, mingo, jes, hch,
aia21, arjan, jgarzik, neukum, oliver, dada1, axboe, axboe,
nickpiggin, dhowells, nathans, rolandd, tytso, bunk, pbadari, ak,
ak, davem, jsipek, jens.axboe, ramsdell, hch, akpm, randy.dunlap,
efault, rdunlap, haveblue, drepper, dm.n9107, jblunck, davidel,
mtk.manpages, linux-arch, vda.linux, jmorris, serue, hca, rth,
lethal, tony.luck, heiko.carstens, andi, corbet, crquan, mszeredi,
miklos, peterz, a.p.zijlstra, earl_chew, npiggin, npiggin, julia,
jaxboe, nikai, dchinner, davej, npiggin, eric.dumazet, tim.c.chen,
xemul, tj, serge.hallyn, gorcunov, bcrl, alan, will.deacon, will,
zab, balbi, gregkh, rusty, socketpair, penguin-kernel, mhocko,
axboe, tglx, mcgrof, linux, willy, paulmck, kernel, linux-morello
On 12/28, Oleg Nesterov wrote:
>
> If nothing else, consider
>
> int CONDITION;
> wait_queue_head_t WQ;
>
> void wake(void)
> {
> CONDITION = 1;
> wake_up(WQ);
> }
>
> void wait(void)
> {
> DEFINE_WAIT_FUNC(entry, woken_wake_function);
>
> add_wait_queue(WQ, entry);
> if (!CONDITION)
> wait_woken(entry, ...);
> remove_wait_queue(WQ, entry);
> }
>
> this code is correct even if LOAD(CONDITION) can leak into the critical
> section in add_wait_queue(), so CPU running wait() can actually do
>
> // add_wait_queue
> spin_lock(WQ->lock);
> LOAD(CONDITION); // false!
> list_add(entry, head);
> spin_unlock(WQ->lock);
>
> if (!false) // result of the LOAD above
> wait_woken(entry, ...);
>
> Now suppose that another CPU executes wake() between LOAD(CONDITION)
> and list_add(entry, head). With your patch wait() will miss the event.
> The same for __pollwait(), I think...
>
> No?
Even simpler,
void wait(void)
{
DEFINE_WAIT(entry);
__set_current_state(XXX);
add_wait_queue(WQ, entry);
if (!CONDITION)
schedule();
remove_wait_queue(WQ, entry);
__set_current_state(TASK_RUNNING);
}
This code is ugly but currently correct unless I am totally confused.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-12-28 16:33 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-25 9:42 [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write WangYuli
2024-12-25 13:30 ` Andy Shevchenko
2024-12-25 15:42 ` WangYuli
2024-12-26 16:00 ` Oleg Nesterov
2024-12-26 19:02 ` Linus Torvalds
2024-12-26 20:11 ` Oleg Nesterov
2024-12-27 18:39 ` Manfred Spraul
2024-12-28 14:32 ` Oleg Nesterov
2024-12-28 15:22 ` Oleg Nesterov
2024-12-28 16:32 ` Oleg Nesterov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).