Netdev List
 help / color / mirror / Atom feed
* [PATCH net] vsock: keep poll shutdown state consistent
@ 2026-05-16  3:47 Ziyu Zhang
  2026-05-18 10:16 ` Stefano Garzarella
  0 siblings, 1 reply; 5+ messages in thread
From: Ziyu Zhang @ 2026-05-16  3:47 UTC (permalink / raw)
  To: Stefano Garzarella, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	virtualization, netdev, linux-kernel, baijiaju1990, r33s3n6,
	gality369, zhenghaoran154, hanguidong02, zzzccc427, Ziyu Zhang

vsock_poll() reads vsk->peer_shutdown before taking the socket
lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
lock to report EOF readability. A shutdown packet can update
peer_shutdown while poll is waiting for the lock, so one poll invocation
can report EPOLLIN without the corresponding HUP/RDHUP bits.

Keep non-connectible sockets on a single lockless READ_ONCE()
snapshot. For connectible sockets, defer shutdown-derived poll bits
until after lock_sock() and use one READ_ONCE() snapshot for both EOF
readability and HUP/RDHUP. This preserves shutdowns that arrive before
the lock is acquired and keeps all peer-shutdown-derived bits consistent
for a poll pass.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
---
 net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index adcba1b7b..bed42347b 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
 	return err;
 }
 
+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
+{
+	__poll_t mask = 0;
+
+	/* INET sockets treat local write shutdown and peer write shutdown as a
+	 * case of EPOLLHUP set.
+	 */
+	if (sk->sk_shutdown == SHUTDOWN_MASK ||
+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
+	     (peer_shutdown & SEND_SHUTDOWN)))
+		mask |= EPOLLHUP;
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    peer_shutdown & SEND_SHUTDOWN)
+		mask |= EPOLLRDHUP;
+
+	return mask;
+}
+
 static __poll_t vsock_poll(struct file *file, struct socket *sock,
 			       poll_table *wait)
 {
@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 		/* Signify that there has been an error on this socket. */
 		mask |= EPOLLERR;
 
-	/* INET sockets treat local write shutdown and peer write shutdown as a
-	 * case of EPOLLHUP set.
-	 */
-	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
-	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
-	     (vsk->peer_shutdown & SEND_SHUTDOWN))) {
-		mask |= EPOLLHUP;
-	}
-
-	if (sk->sk_shutdown & RCV_SHUTDOWN ||
-	    vsk->peer_shutdown & SEND_SHUTDOWN) {
-		mask |= EPOLLRDHUP;
-	}
+	if (!sock_type_connectible(sk->sk_type))
+		mask |= vsock_poll_shutdown(sk,
+					    READ_ONCE(vsk->peer_shutdown));
 
 	if (sk_is_readable(sk))
 		mask |= EPOLLIN | EPOLLRDNORM;
@@ -1171,6 +1180,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 
 	} else if (sock_type_connectible(sk->sk_type)) {
 		const struct vsock_transport *transport;
+		u32 peer_shutdown;
 
 		lock_sock(sk);
 
@@ -1203,10 +1213,12 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 		 * terminated should also be considered read, and we check the
 		 * shutdown flag for that.
 		 */
+		peer_shutdown = READ_ONCE(vsk->peer_shutdown);
 		if (sk->sk_shutdown & RCV_SHUTDOWN ||
-		    vsk->peer_shutdown & SEND_SHUTDOWN) {
+		    peer_shutdown & SEND_SHUTDOWN) {
 			mask |= EPOLLIN | EPOLLRDNORM;
 		}
+		mask |= vsock_poll_shutdown(sk, peer_shutdown);
 
 		/* Connected sockets that can produce data can be written. */
 		if (transport && sk->sk_state == TCP_ESTABLISHED) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net] vsock: keep poll shutdown state consistent
  2026-05-16  3:47 [PATCH net] vsock: keep poll shutdown state consistent Ziyu Zhang
@ 2026-05-18 10:16 ` Stefano Garzarella
  2026-05-18 12:39   ` ziyu zhang
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Garzarella @ 2026-05-18 10:16 UTC (permalink / raw)
  To: Ziyu Zhang
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	virtualization, netdev, linux-kernel, baijiaju1990, r33s3n6,
	gality369, zhenghaoran154, hanguidong02, zzzccc427

On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote:
>vsock_poll() reads vsk->peer_shutdown before taking the socket
>lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
>lock to report EOF readability. A shutdown packet can update
>peer_shutdown while poll is waiting for the lock, so one poll invocation
>can report EPOLLIN without the corresponding HUP/RDHUP bits.
>
>Keep non-connectible sockets on a single lockless READ_ONCE()

Should this be paired with WRITE_ONCE() on writers?

>snapshot. For connectible sockets, defer shutdown-derived poll bits
>until after lock_sock() and use one READ_ONCE() snapshot for both EOF
>readability and HUP/RDHUP. This preserves shutdowns that arrive before
>the lock is acquired and keeps all peer-shutdown-derived bits consistent
>for a poll pass.
>
>Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
>Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
>---
> net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
> 1 file changed, 26 insertions(+), 14 deletions(-)
>
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index adcba1b7b..bed42347b 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
> 	return err;
> }
>
>+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
>+{
>+	__poll_t mask = 0;
>+
>+	/* INET sockets treat local write shutdown and peer write shutdown as a
>+	 * case of EPOLLHUP set.
>+	 */
>+	if (sk->sk_shutdown == SHUTDOWN_MASK ||
>+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
>+	     (peer_shutdown & SEND_SHUTDOWN)))
>+		mask |= EPOLLHUP;
>+
>+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
>+	    peer_shutdown & SEND_SHUTDOWN)
>+		mask |= EPOLLRDHUP;
>+
>+	return mask;
>+}
>+
> static __poll_t vsock_poll(struct file *file, struct socket *sock,
> 			       poll_table *wait)
> {
>@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> 		/* Signify that there has been an error on this socket. */
> 		mask |= EPOLLERR;
>
>-	/* INET sockets treat local write shutdown and peer write shutdown as a
>-	 * case of EPOLLHUP set.
>-	 */
>-	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
>-	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
>-	     (vsk->peer_shutdown & SEND_SHUTDOWN))) {
>-		mask |= EPOLLHUP;
>-	}
>-
>-	if (sk->sk_shutdown & RCV_SHUTDOWN ||
>-	    vsk->peer_shutdown & SEND_SHUTDOWN) {
>-		mask |= EPOLLRDHUP;
>-	}
>+	if (!sock_type_connectible(sk->sk_type))
>+		mask |= vsock_poll_shutdown(sk,
>+					    READ_ONCE(vsk->peer_shutdown));

Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ?

Not a strong opinion about that, but in any case IMO we should add a 
comment here to explain why we are doing only for not connectible 
sockets.

That said, if we use WRITE_ONCE in the writers, do we really need to 
move this after the lock_sock for the connectable ones?

>
> 	if (sk_is_readable(sk))
> 		mask |= EPOLLIN | EPOLLRDNORM;
>@@ -1171,6 +1180,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
>
> 	} else if (sock_type_connectible(sk->sk_type)) {
> 		const struct vsock_transport *transport;
>+		u32 peer_shutdown;
>
> 		lock_sock(sk);
>
>@@ -1203,10 +1213,12 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> 		 * terminated should also be considered read, and we check the
> 		 * shutdown flag for that.
> 		 */
>+		peer_shutdown = READ_ONCE(vsk->peer_shutdown);
> 		if (sk->sk_shutdown & RCV_SHUTDOWN ||
>-		    vsk->peer_shutdown & SEND_SHUTDOWN) {
>+		    peer_shutdown & SEND_SHUTDOWN) {
> 			mask |= EPOLLIN | EPOLLRDNORM;
> 		}
>+		mask |= vsock_poll_shutdown(sk, peer_shutdown);

nit: to keep the order the same as before, I would move this call just 
before this `if` block, but I don't think it makes any difference in the 
end.

>
> 		/* Connected sockets that can produce data can be written. */
> 		if (transport && sk->sk_state == TCP_ESTABLISHED) {
>-- 
>2.43.0
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] vsock: keep poll shutdown state consistent
  2026-05-18 10:16 ` Stefano Garzarella
@ 2026-05-18 12:39   ` ziyu zhang
  2026-05-19  9:36     ` Stefano Garzarella
  0 siblings, 1 reply; 5+ messages in thread
From: ziyu zhang @ 2026-05-18 12:39 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	virtualization, netdev, linux-kernel, baijiaju1990, r33s3n6,
	gality369, zhenghaoran154, hanguidong02, zzzccc427

On Mon, 18 May 2026 at 18:16, Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote:
> >vsock_poll() reads vsk->peer_shutdown before taking the socket
> >lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
> >lock to report EOF readability. A shutdown packet can update
> >peer_shutdown while poll is waiting for the lock, so one poll invocation
> >can report EPOLLIN without the corresponding HUP/RDHUP bits.
> >
> >Keep non-connectible sockets on a single lockless READ_ONCE()
>
> Should this be paired with WRITE_ONCE() on writers?

Yes, since the poll path uses lockless READ_ONCE() snapshots of
peer_shutdown, the writer side should be annotated with WRITE_ONCE() as
well. I will add that in v2.

>
> >snapshot. For connectible sockets, defer shutdown-derived poll bits
> >until after lock_sock() and use one READ_ONCE() snapshot for both EOF
> >readability and HUP/RDHUP. This preserves shutdowns that arrive before
> >the lock is acquired and keeps all peer-shutdown-derived bits consistent
> >for a poll pass.
> >
> >Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
> >Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
> >---
> > net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
> > 1 file changed, 26 insertions(+), 14 deletions(-)
> >
> >diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >index adcba1b7b..bed42347b 100644
> >--- a/net/vmw_vsock/af_vsock.c
> >+++ b/net/vmw_vsock/af_vsock.c
> >@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
> >       return err;
> > }
> >
> >+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
> >+{
> >+      __poll_t mask = 0;
> >+
> >+      /* INET sockets treat local write shutdown and peer write shutdown as a
> >+       * case of EPOLLHUP set.
> >+       */
> >+      if (sk->sk_shutdown == SHUTDOWN_MASK ||
> >+          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >+           (peer_shutdown & SEND_SHUTDOWN)))
> >+              mask |= EPOLLHUP;
> >+
> >+      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >+          peer_shutdown & SEND_SHUTDOWN)
> >+              mask |= EPOLLRDHUP;
> >+
> >+      return mask;
> >+}
> >+
> > static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >                              poll_table *wait)
> > {
> >@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >               /* Signify that there has been an error on this socket. */
> >               mask |= EPOLLERR;
> >
> >-      /* INET sockets treat local write shutdown and peer write shutdown as a
> >-       * case of EPOLLHUP set.
> >-       */
> >-      if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
> >-          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >-           (vsk->peer_shutdown & SEND_SHUTDOWN))) {
> >-              mask |= EPOLLHUP;
> >-      }
> >-
> >-      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >-          vsk->peer_shutdown & SEND_SHUTDOWN) {
> >-              mask |= EPOLLRDHUP;
> >-      }
> >+      if (!sock_type_connectible(sk->sk_type))
> >+              mask |= vsock_poll_shutdown(sk,
> >+                                          READ_ONCE(vsk->peer_shutdown));
>
> Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ?
>
> Not a strong opinion about that, but in any case IMO we should add a
> comment here to explain why we are doing only for not connectible
> sockets.
>
> That said, if we use WRITE_ONCE in the writers, do we really need to
> move this after the lock_sock for the connectable ones?

Yes, I will move the non-connectible handling into the SOCK_DGRAM branch
and add a comment there.

For connectible sockets, my current understanding is that
READ_ONCE()/WRITE_ONCE() make the individual lockless accesses explicit, but
they do not make two separate reads in one vsock_poll() invocation observe the
same peer_shutdown value. So I still think using one peer_shutdown snapshot
after lock_sock() is useful for keeping the returned mask internally
consistent. Please let me know if you think WRITE_ONCE() is enough for this
case.

>
> >
> >       if (sk_is_readable(sk))
> >               mask |= EPOLLIN | EPOLLRDNORM;
> >@@ -1171,6 +1180,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >
> >       } else if (sock_type_connectible(sk->sk_type)) {
> >               const struct vsock_transport *transport;
> >+              u32 peer_shutdown;
> >
> >               lock_sock(sk);
> >
> >@@ -1203,10 +1213,12 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >                * terminated should also be considered read, and we check the
> >                * shutdown flag for that.
> >                */
> >+              peer_shutdown = READ_ONCE(vsk->peer_shutdown);
> >               if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >-                  vsk->peer_shutdown & SEND_SHUTDOWN) {
> >+                  peer_shutdown & SEND_SHUTDOWN) {
> >                       mask |= EPOLLIN | EPOLLRDNORM;
> >               }
> >+              mask |= vsock_poll_shutdown(sk, peer_shutdown);
>
> nit: to keep the order the same as before, I would move this call just
> before this `if` block, but I don't think it makes any difference in the
> end.

Sure, I will restore the previous ordering in v2.

Thanks,
Ziyu

>
> >
> >               /* Connected sockets that can produce data can be written. */
> >               if (transport && sk->sk_state == TCP_ESTABLISHED) {
> >--
> >2.43.0
> >
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] vsock: keep poll shutdown state consistent
  2026-05-18 12:39   ` ziyu zhang
@ 2026-05-19  9:36     ` Stefano Garzarella
  2026-05-19 15:58       ` ziyu zhang
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Garzarella @ 2026-05-19  9:36 UTC (permalink / raw)
  To: ziyu zhang
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	virtualization, netdev, linux-kernel, baijiaju1990, r33s3n6,
	gality369, zhenghaoran154, hanguidong02, zzzccc427

On Mon, May 18, 2026 at 08:39:37PM +0800, ziyu zhang wrote:
>On Mon, 18 May 2026 at 18:16, Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote:
>> >vsock_poll() reads vsk->peer_shutdown before taking the socket
>> >lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
>> >lock to report EOF readability. A shutdown packet can update
>> >peer_shutdown while poll is waiting for the lock, so one poll invocation
>> >can report EPOLLIN without the corresponding HUP/RDHUP bits.
>> >
>> >Keep non-connectible sockets on a single lockless READ_ONCE()
>>
>> Should this be paired with WRITE_ONCE() on writers?
>
>Yes, since the poll path uses lockless READ_ONCE() snapshots of
>peer_shutdown, the writer side should be annotated with WRITE_ONCE() as
>well. I will add that in v2.
>
>>
>> >snapshot. For connectible sockets, defer shutdown-derived poll bits
>> >until after lock_sock() and use one READ_ONCE() snapshot for both EOF
>> >readability and HUP/RDHUP. This preserves shutdowns that arrive before
>> >the lock is acquired and keeps all peer-shutdown-derived bits consistent
>> >for a poll pass.
>> >
>> >Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
>> >Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
>> >---
>> > net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
>> > 1 file changed, 26 insertions(+), 14 deletions(-)
>> >
>> >diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> >index adcba1b7b..bed42347b 100644
>> >--- a/net/vmw_vsock/af_vsock.c
>> >+++ b/net/vmw_vsock/af_vsock.c
>> >@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
>> >       return err;
>> > }
>> >
>> >+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
>> >+{
>> >+      __poll_t mask = 0;
>> >+
>> >+      /* INET sockets treat local write shutdown and peer write shutdown as a
>> >+       * case of EPOLLHUP set.
>> >+       */
>> >+      if (sk->sk_shutdown == SHUTDOWN_MASK ||
>> >+          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
>> >+           (peer_shutdown & SEND_SHUTDOWN)))
>> >+              mask |= EPOLLHUP;
>> >+
>> >+      if (sk->sk_shutdown & RCV_SHUTDOWN ||
>> >+          peer_shutdown & SEND_SHUTDOWN)
>> >+              mask |= EPOLLRDHUP;
>> >+
>> >+      return mask;
>> >+}
>> >+
>> > static __poll_t vsock_poll(struct file *file, struct socket *sock,
>> >                              poll_table *wait)
>> > {
>> >@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
>> >               /* Signify that there has been an error on this socket. */
>> >               mask |= EPOLLERR;
>> >
>> >-      /* INET sockets treat local write shutdown and peer write shutdown as a
>> >-       * case of EPOLLHUP set.
>> >-       */
>> >-      if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
>> >-          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
>> >-           (vsk->peer_shutdown & SEND_SHUTDOWN))) {
>> >-              mask |= EPOLLHUP;
>> >-      }
>> >-
>> >-      if (sk->sk_shutdown & RCV_SHUTDOWN ||
>> >-          vsk->peer_shutdown & SEND_SHUTDOWN) {
>> >-              mask |= EPOLLRDHUP;
>> >-      }
>> >+      if (!sock_type_connectible(sk->sk_type))
>> >+              mask |= vsock_poll_shutdown(sk,
>> >+                                          READ_ONCE(vsk->peer_shutdown));
>>
>> Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ?
>>
>> Not a strong opinion about that, but in any case IMO we should add a
>> comment here to explain why we are doing only for not connectible
>> sockets.
>>
>> That said, if we use WRITE_ONCE in the writers, do we really need to
>> move this after the lock_sock for the connectable ones?
>
>Yes, I will move the non-connectible handling into the SOCK_DGRAM branch
>and add a comment there.
>
>For connectible sockets, my current understanding is that
>READ_ONCE()/WRITE_ONCE() make the individual lockless accesses explicit, but
>they do not make two separate reads in one vsock_poll() invocation observe the
>same peer_shutdown value. So I still think using one peer_shutdown snapshot
>after lock_sock() is useful for keeping the returned mask internally
>consistent. Please let me know if you think WRITE_ONCE() is enough for this
>case.

What will be the issue of "do not make two separate reads in one 
vsock_poll() invocation observe the same peer_shutdown value." ?

In any case, I'm not against it; I just want to understand the issue 
better.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] vsock: keep poll shutdown state consistent
  2026-05-19  9:36     ` Stefano Garzarella
@ 2026-05-19 15:58       ` ziyu zhang
  0 siblings, 0 replies; 5+ messages in thread
From: ziyu zhang @ 2026-05-19 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	virtualization, netdev, linux-kernel, baijiaju1990, r33s3n6,
	gality369, zhenghaoran154, hanguidong02, zzzccc427

On Tue, 19 May 2026 at 17:36, Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Mon, May 18, 2026 at 08:39:37PM +0800, ziyu zhang wrote:
> >On Mon, 18 May 2026 at 18:16, Stefano Garzarella <sgarzare@redhat.com> wrote:
> >>
> >> On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote:
> >> >vsock_poll() reads vsk->peer_shutdown before taking the socket
> >> >lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
> >> >lock to report EOF readability. A shutdown packet can update
> >> >peer_shutdown while poll is waiting for the lock, so one poll invocation
> >> >can report EPOLLIN without the corresponding HUP/RDHUP bits.
> >> >
> >> >Keep non-connectible sockets on a single lockless READ_ONCE()
> >>
> >> Should this be paired with WRITE_ONCE() on writers?
> >
> >Yes, since the poll path uses lockless READ_ONCE() snapshots of
> >peer_shutdown, the writer side should be annotated with WRITE_ONCE() as
> >well. I will add that in v2.
> >
> >>
> >> >snapshot. For connectible sockets, defer shutdown-derived poll bits
> >> >until after lock_sock() and use one READ_ONCE() snapshot for both EOF
> >> >readability and HUP/RDHUP. This preserves shutdowns that arrive before
> >> >the lock is acquired and keeps all peer-shutdown-derived bits consistent
> >> >for a poll pass.
> >> >
> >> >Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
> >> >Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
> >> >---
> >> > net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
> >> > 1 file changed, 26 insertions(+), 14 deletions(-)
> >> >
> >> >diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >> >index adcba1b7b..bed42347b 100644
> >> >--- a/net/vmw_vsock/af_vsock.c
> >> >+++ b/net/vmw_vsock/af_vsock.c
> >> >@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
> >> >       return err;
> >> > }
> >> >
> >> >+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
> >> >+{
> >> >+      __poll_t mask = 0;
> >> >+
> >> >+      /* INET sockets treat local write shutdown and peer write shutdown as a
> >> >+       * case of EPOLLHUP set.
> >> >+       */
> >> >+      if (sk->sk_shutdown == SHUTDOWN_MASK ||
> >> >+          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >> >+           (peer_shutdown & SEND_SHUTDOWN)))
> >> >+              mask |= EPOLLHUP;
> >> >+
> >> >+      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >> >+          peer_shutdown & SEND_SHUTDOWN)
> >> >+              mask |= EPOLLRDHUP;
> >> >+
> >> >+      return mask;
> >> >+}
> >> >+
> >> > static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >> >                              poll_table *wait)
> >> > {
> >> >@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >> >               /* Signify that there has been an error on this socket. */
> >> >               mask |= EPOLLERR;
> >> >
> >> >-      /* INET sockets treat local write shutdown and peer write shutdown as a
> >> >-       * case of EPOLLHUP set.
> >> >-       */
> >> >-      if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
> >> >-          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >> >-           (vsk->peer_shutdown & SEND_SHUTDOWN))) {
> >> >-              mask |= EPOLLHUP;
> >> >-      }
> >> >-
> >> >-      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >> >-          vsk->peer_shutdown & SEND_SHUTDOWN) {
> >> >-              mask |= EPOLLRDHUP;
> >> >-      }
> >> >+      if (!sock_type_connectible(sk->sk_type))
> >> >+              mask |= vsock_poll_shutdown(sk,
> >> >+                                          READ_ONCE(vsk->peer_shutdown));
> >>
> >> Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ?
> >>
> >> Not a strong opinion about that, but in any case IMO we should add a
> >> comment here to explain why we are doing only for not connectible
> >> sockets.
> >>
> >> That said, if we use WRITE_ONCE in the writers, do we really need to
> >> move this after the lock_sock for the connectable ones?
> >
> >Yes, I will move the non-connectible handling into the SOCK_DGRAM branch
> >and add a comment there.
> >
> >For connectible sockets, my current understanding is that
> >READ_ONCE()/WRITE_ONCE() make the individual lockless accesses explicit, but
> >they do not make two separate reads in one vsock_poll() invocation observe the
> >same peer_shutdown value. So I still think using one peer_shutdown snapshot
> >after lock_sock() is useful for keeping the returned mask internally
> >consistent. Please let me know if you think WRITE_ONCE() is enough for this
> >case.
>
> What will be the issue of "do not make two separate reads in one
> vsock_poll() invocation observe the same peer_shutdown value." ?
>
> In any case, I'm not against it; I just want to understand the issue
> better.

The issue I was trying to avoid is a transiently inconsistent poll mask
from one vsock_poll() pass.

For example, the early peer_shutdown read can see 0, so the
shutdown-derived bits are not set, especially EPOLLRDHUP, and EPOLLHUP
in the cases where the existing logic would set it. Then, before
lock_sock() succeeds, the peer shutdown can be processed. The later read
after lock_sock() can see SEND_SHUTDOWN and set EPOLLIN|EPOLLRDNORM for
EOF readability.

So the returned mask can say that EOF is readable, but miss the
shutdown-derived indication from the same peer_shutdown state. If
userspace is waiting specifically for EPOLLRDHUP, that notification can
be missed or delayed for that poll pass.

I agree this is small and transient. A following read() or another poll
pass will observe the shutdown state. My intention with the single
snapshot is only to avoid mixing old and new peer_shutdown values in one
returned mask.

I will send a v2 as a new thread with WRITE_ONCE() on the writers, the
DGRAM comment, and your ordering suggestion.

Thanks,
Ziyu

>
> Thanks,
> Stefano
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-19 15:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-16  3:47 [PATCH net] vsock: keep poll shutdown state consistent Ziyu Zhang
2026-05-18 10:16 ` Stefano Garzarella
2026-05-18 12:39   ` ziyu zhang
2026-05-19  9:36     ` Stefano Garzarella
2026-05-19 15:58       ` ziyu zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox