linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* On possible data race in pollwake() / poll_schedule_timeout()
@ 2025-06-17 11:04 Dmitry Antipov
  2025-06-18 15:20 ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Dmitry Antipov @ 2025-06-17 11:04 UTC (permalink / raw)
  To: Peter Zijlstra, Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

Running both syzbot reproducers and even during regular system boots,
KCSAN is likely to report the data race around using 'triggered' flag
of 'struct poll_wqueues'. Suspected race may be either read vs. write
(observed locally during system boot):

BUG: KCSAN: data-race in poll_schedule_timeout / pollwake

write to 0xffffc90004397b90 of 4 bytes by task 5619 on cpu 4:
  pollwake+0xd1/0x130
  __wake_up_common_lock+0x7f/0xd0
  sock_def_readable+0x20e/0x590
  unix_dgram_sendmsg+0xa3a/0x1050
  unix_seqpacket_sendmsg+0xdb/0x140
  __sock_sendmsg+0x151/0x190
  sock_write_iter+0x172/0x1c0
  vfs_write+0x66d/0x6f0
  ksys_write+0xe7/0x1b0
  __x64_sys_write+0x4a/0x60
  x64_sys_call+0x2f35/0x32b0
  do_syscall_64+0xfa/0x3b0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffffc90004397b90 of 4 bytes by task 5620 on cpu 2:
  poll_schedule_timeout+0x96/0x160
  do_sys_poll+0x966/0xb30
  __se_sys_ppoll+0x1c3/0x210
  __x64_sys_ppoll+0x71/0x90
  x64_sys_call+0x3079/0x32b0
  do_syscall_64+0xfa/0x3b0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000000 -> 0x00000001

or concurrent write (example taken from
https://syzkaller.appspot.com/bug?extid=4c7af974f816af4ede2a):

BUG: KCSAN: data-race in pollwake / pollwake

write to 0xffffc90000e539e0 of 4 bytes by task 3308 on cpu 1:
  __pollwake fs/select.c:195 [inline]
  pollwake+0xb6/0x100 fs/select.c:215
  __wake_up_common kernel/sched/wait.c:89 [inline]
  __wake_up_common_lock kernel/sched/wait.c:106 [inline]
  __wake_up_sync_key+0x4f/0x80 kernel/sched/wait.c:173
  anon_pipe_write+0x8ba/0xaa0 fs/pipe.c:594
  new_sync_write fs/read_write.c:593 [inline]
  vfs_write+0x4a0/0x8e0 fs/read_write.c:686
  ksys_write+0xda/0x1a0 fs/read_write.c:738
  __do_sys_write fs/read_write.c:749 [inline]
  __se_sys_write fs/read_write.c:746 [inline]
  __x64_sys_write+0x40/0x50 fs/read_write.c:746
  x64_sys_call+0x2cdd/0x2fb0 arch/x86/include/generated/asm/syscalls_64.h:2
  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
  do_syscall_64+0xd2/0x200 arch/x86/entry/syscall_64.c:94
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

write to 0xffffc90000e539e0 of 4 bytes by task 4163 on cpu 0:
  __pollwake fs/select.c:195 [inline]
  pollwake+0xb6/0x100 fs/select.c:215
  __wake_up_common kernel/sched/wait.c:89 [inline]
  __wake_up_common_lock kernel/sched/wait.c:106 [inline]
  __wake_up_sync_key+0x4f/0x80 kernel/sched/wait.c:173
  anon_pipe_write+0x8ba/0xaa0 fs/pipe.c:594
  new_sync_write fs/read_write.c:593 [inline]
  vfs_write+0x4a0/0x8e0 fs/read_write.c:686
  ksys_write+0xda/0x1a0 fs/read_write.c:738
  __do_sys_write fs/read_write.c:749 [inline]
  __se_sys_write fs/read_write.c:746 [inline]
  __x64_sys_write+0x40/0x50 fs/read_write.c:746
  x64_sys_call+0x2cdd/0x2fb0 arch/x86/include/generated/asm/syscalls_64.h:2
  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
  do_syscall_64+0xd2/0x200 arch/x86/entry/syscall_64.c:94
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000000 -> 0x00000001

Using _ONCE() seems makes KCSAN quiet, i.e.:

diff --git a/fs/select.c b/fs/select.c
index 9fb650d03d52..082cf60c7e23 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -192,7 +192,7 @@ static int __pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *k
          * and is paired with smp_store_mb() in poll_schedule_timeout.
          */
         smp_wmb();
-       pwq->triggered = 1;
+       WRITE_ONCE(pwq->triggered, 1);

         /*
          * Perform the default wake up operation using a dummy
@@ -237,7 +237,7 @@ static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
         int rc = -EINTR;

         set_current_state(state);
-       if (!pwq->triggered)
+       if (!READ_ONCE(pwq->triggered))
                 rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
         __set_current_state(TASK_RUNNING);

but I'm curious whether this is a real fix for the real bug or
KCSAN is just unable to handle smp_wmb()/smp_store_mb() trick.

Dmitry





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: On possible data race in pollwake() / poll_schedule_timeout()
  2025-06-17 11:04 On possible data race in pollwake() / poll_schedule_timeout() Dmitry Antipov
@ 2025-06-18 15:20 ` Jan Kara
  2025-06-18 17:08   ` Dmitry Antipov
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2025-06-18 15:20 UTC (permalink / raw)
  To: Dmitry Antipov
  Cc: Peter Zijlstra, Alexander Viro, Christian Brauner, Jan Kara,
	linux-fsdevel

On Tue 17-06-25 14:04:37, Dmitry Antipov wrote:
> Running both syzbot reproducers and even during regular system boots,
> KCSAN is likely to report the data race around using 'triggered' flag
> of 'struct poll_wqueues'. Suspected race may be either read vs. write
> (observed locally during system boot):
> 
> BUG: KCSAN: data-race in poll_schedule_timeout / pollwake
> 
> write to 0xffffc90004397b90 of 4 bytes by task 5619 on cpu 4:
>  pollwake+0xd1/0x130
>  __wake_up_common_lock+0x7f/0xd0
>  sock_def_readable+0x20e/0x590
>  unix_dgram_sendmsg+0xa3a/0x1050
>  unix_seqpacket_sendmsg+0xdb/0x140
>  __sock_sendmsg+0x151/0x190
>  sock_write_iter+0x172/0x1c0
>  vfs_write+0x66d/0x6f0
>  ksys_write+0xe7/0x1b0
>  __x64_sys_write+0x4a/0x60
>  x64_sys_call+0x2f35/0x32b0
>  do_syscall_64+0xfa/0x3b0
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> read to 0xffffc90004397b90 of 4 bytes by task 5620 on cpu 2:
>  poll_schedule_timeout+0x96/0x160
>  do_sys_poll+0x966/0xb30
>  __se_sys_ppoll+0x1c3/0x210
>  __x64_sys_ppoll+0x71/0x90
>  x64_sys_call+0x3079/0x32b0
>  do_syscall_64+0xfa/0x3b0
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> value changed: 0x00000000 -> 0x00000001
> 
> or concurrent write (example taken from
> https://syzkaller.appspot.com/bug?extid=4c7af974f816af4ede2a):
> 
> BUG: KCSAN: data-race in pollwake / pollwake
> 
> write to 0xffffc90000e539e0 of 4 bytes by task 3308 on cpu 1:
>  __pollwake fs/select.c:195 [inline]
>  pollwake+0xb6/0x100 fs/select.c:215
>  __wake_up_common kernel/sched/wait.c:89 [inline]
>  __wake_up_common_lock kernel/sched/wait.c:106 [inline]
>  __wake_up_sync_key+0x4f/0x80 kernel/sched/wait.c:173
>  anon_pipe_write+0x8ba/0xaa0 fs/pipe.c:594
>  new_sync_write fs/read_write.c:593 [inline]
>  vfs_write+0x4a0/0x8e0 fs/read_write.c:686
>  ksys_write+0xda/0x1a0 fs/read_write.c:738
>  __do_sys_write fs/read_write.c:749 [inline]
>  __se_sys_write fs/read_write.c:746 [inline]
>  __x64_sys_write+0x40/0x50 fs/read_write.c:746
>  x64_sys_call+0x2cdd/0x2fb0 arch/x86/include/generated/asm/syscalls_64.h:2
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xd2/0x200 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> write to 0xffffc90000e539e0 of 4 bytes by task 4163 on cpu 0:
>  __pollwake fs/select.c:195 [inline]
>  pollwake+0xb6/0x100 fs/select.c:215
>  __wake_up_common kernel/sched/wait.c:89 [inline]
>  __wake_up_common_lock kernel/sched/wait.c:106 [inline]
>  __wake_up_sync_key+0x4f/0x80 kernel/sched/wait.c:173
>  anon_pipe_write+0x8ba/0xaa0 fs/pipe.c:594
>  new_sync_write fs/read_write.c:593 [inline]
>  vfs_write+0x4a0/0x8e0 fs/read_write.c:686
>  ksys_write+0xda/0x1a0 fs/read_write.c:738
>  __do_sys_write fs/read_write.c:749 [inline]
>  __se_sys_write fs/read_write.c:746 [inline]
>  __x64_sys_write+0x40/0x50 fs/read_write.c:746
>  x64_sys_call+0x2cdd/0x2fb0 arch/x86/include/generated/asm/syscalls_64.h:2
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xd2/0x200 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> value changed: 0x00000000 -> 0x00000001
> 
> Using _ONCE() seems makes KCSAN quiet, i.e.:
> 
> diff --git a/fs/select.c b/fs/select.c
> index 9fb650d03d52..082cf60c7e23 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -192,7 +192,7 @@ static int __pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *k
>          * and is paired with smp_store_mb() in poll_schedule_timeout.
>          */
>         smp_wmb();
> -       pwq->triggered = 1;
> +       WRITE_ONCE(pwq->triggered, 1);
> 
>         /*
>          * Perform the default wake up operation using a dummy
> @@ -237,7 +237,7 @@ static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
>         int rc = -EINTR;
> 
>         set_current_state(state);
> -       if (!pwq->triggered)
> +       if (!READ_ONCE(pwq->triggered))
>                 rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
>         __set_current_state(TASK_RUNNING);
> 
> but I'm curious whether this is a real fix for the real bug or
> KCSAN is just unable to handle smp_wmb()/smp_store_mb() trick.

So KCSAN is really trigger-happy about issues like this. There's no
practical issue here because it is hard to imagine how the compiler could
compile the above code using some intermediate values stored into
'triggered' or multiple fetches from 'triggered'. But for the cleanliness
of code and silencing of KCSAN your changes make sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On possible data race in pollwake() / poll_schedule_timeout()
  2025-06-18 15:20 ` Jan Kara
@ 2025-06-18 17:08   ` Dmitry Antipov
  2025-06-19  8:44     ` Jan Kara
  2025-06-19 10:05     ` Christian Brauner
  0 siblings, 2 replies; 5+ messages in thread
From: Dmitry Antipov @ 2025-06-18 17:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Peter Zijlstra, Alexander Viro, Christian Brauner, linux-fsdevel

On 6/18/25 6:20 PM, Jan Kara wrote:

> So KCSAN is really trigger-happy about issues like this. There's no
> practical issue here because it is hard to imagine how the compiler could
> compile the above code using some intermediate values stored into
> 'triggered' or multiple fetches from 'triggered'. But for the cleanliness
> of code and silencing of KCSAN your changes make sense.

Thanks. Surely I've read Documentation/memory-barriers.txt more than
once, but, just for this particual case: is _ONCE() pair from the above
expected to work in the same way as:

diff --git a/fs/select.c b/fs/select.c
index 9fb650d03d52..1a4096fd3a95 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -191,8 +191,7 @@ static int __pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *k
          * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
          * and is paired with smp_store_mb() in poll_schedule_timeout.
          */
-       smp_wmb();
-       pwq->triggered = 1;
+       smp_store_release(&pwq->triggered, 1);

         /*
          * Perform the default wake up operation using a dummy
@@ -237,7 +236,7 @@ static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
         int rc = -EINTR;

         set_current_state(state);
-       if (!pwq->triggered)
+       if (!smp_load_acquire(&pwq->triggered))
                 rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
         __set_current_state(TASK_RUNNING);

@@ -252,7 +251,7 @@ static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
          * this problem doesn't exist for the first iteration as
          * add_wait_queue() has full barrier semantics.
          */
-       smp_store_mb(pwq->triggered, 0);
+       smp_store_release(&pwq->triggered, 0);

         return rc;
  }

Dmitry


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: On possible data race in pollwake() / poll_schedule_timeout()
  2025-06-18 17:08   ` Dmitry Antipov
@ 2025-06-19  8:44     ` Jan Kara
  2025-06-19 10:05     ` Christian Brauner
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Kara @ 2025-06-19  8:44 UTC (permalink / raw)
  To: Dmitry Antipov
  Cc: Jan Kara, Peter Zijlstra, Alexander Viro, Christian Brauner,
	linux-fsdevel

On Wed 18-06-25 20:08:22, Dmitry Antipov wrote:
> On 6/18/25 6:20 PM, Jan Kara wrote:
> 
> > So KCSAN is really trigger-happy about issues like this. There's no
> > practical issue here because it is hard to imagine how the compiler could
> > compile the above code using some intermediate values stored into
> > 'triggered' or multiple fetches from 'triggered'. But for the cleanliness
> > of code and silencing of KCSAN your changes make sense.
> 
> Thanks. Surely I've read Documentation/memory-barriers.txt more than
> once, but, just for this particual case: is _ONCE() pair from the above
> expected to work in the same way as:

Firstly, I would not mess with this subtle code in unobvious ways unless
necessary ;)

> diff --git a/fs/select.c b/fs/select.c
> index 9fb650d03d52..1a4096fd3a95 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -191,8 +191,7 @@ static int __pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *k
>          * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
>          * and is paired with smp_store_mb() in poll_schedule_timeout.
>          */
> -       smp_wmb();
> -       pwq->triggered = 1;
> +       smp_store_release(&pwq->triggered, 1);

Yes, this should be equivalent AFAICS.

> @@ -237,7 +236,7 @@ static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
>         int rc = -EINTR;
> 
>         set_current_state(state);
> -       if (!pwq->triggered)
> +       if (!smp_load_acquire(&pwq->triggered))

set_current_state() already contains a full barrier. Thus
smp_load_acquire() would add unnecessary overhead here AFAICT.

>                 rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
>         __set_current_state(TASK_RUNNING);
> 
> @@ -252,7 +251,7 @@ static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
>          * this problem doesn't exist for the first iteration as
>          * add_wait_queue() has full barrier semantics.
>          */
> -       smp_store_mb(pwq->triggered, 0);
> +       smp_store_release(&pwq->triggered, 0);

This is IMO wrong and would need very good explanation why you think this
is safe to do. Full barrier is stronger than just a RELEASE operation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On possible data race in pollwake() / poll_schedule_timeout()
  2025-06-18 17:08   ` Dmitry Antipov
  2025-06-19  8:44     ` Jan Kara
@ 2025-06-19 10:05     ` Christian Brauner
  1 sibling, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2025-06-19 10:05 UTC (permalink / raw)
  To: Dmitry Antipov; +Cc: Jan Kara, Peter Zijlstra, Alexander Viro, linux-fsdevel

On Wed, Jun 18, 2025 at 08:08:22PM +0300, Dmitry Antipov wrote:
> On 6/18/25 6:20 PM, Jan Kara wrote:
> 
> > So KCSAN is really trigger-happy about issues like this. There's no
> > practical issue here because it is hard to imagine how the compiler could
> > compile the above code using some intermediate values stored into
> > 'triggered' or multiple fetches from 'triggered'. But for the cleanliness
> > of code and silencing of KCSAN your changes make sense.
> 
> Thanks. Surely I've read Documentation/memory-barriers.txt more than
> once, but, just for this particual case: is _ONCE() pair from the above
> expected to work in the same way as:

It's fine for this case.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-06-19 10:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17 11:04 On possible data race in pollwake() / poll_schedule_timeout() Dmitry Antipov
2025-06-18 15:20 ` Jan Kara
2025-06-18 17:08   ` Dmitry Antipov
2025-06-19  8:44     ` Jan Kara
2025-06-19 10:05     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).