* Re: [PATCH RFC net-next v4 7/8] vsock: Add lockless sendmsg() support
From: Stefano Garzarella @ 2023-06-22 16:37 UTC (permalink / raw)
To: Bobby Eshleman
Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan,
Vishnu Dasa, VMware PV-Drivers Reviewers, Dan Carpenter,
Simon Horman, Krasnov Arseniy, kvm, virtualization, netdev,
linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v4-7-0cebbb2ae899@bytedance.com>
On Sat, Jun 10, 2023 at 12:58:34AM +0000, Bobby Eshleman wrote:
>Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
>it does not scale when many senders share a socket.
>
>Prior to this patch the socket lock is used to protect both reads and
>writes to the local_addr, remote_addr, transport, and buffer size
>variables of a vsock socket. What follows are the new protection schemes
>for these fields that ensure a race-free and usually lock-free
>multi-sender sendmsg() path for vsock dgrams.
>
>- local_addr
>local_addr changes as a result of binding a socket. The write path
>for local_addr is bind() and various vsock_auto_bind() call sites.
>After a socket has been bound via vsock_auto_bind() or bind(), subsequent
>calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
>rejects the user request and vsock_auto_bind() early exits.
>Therefore, the local addr can not change while a parallel thread is
>in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
>Change: only acquire lock for auto-binding as-needed in sendmsg().
>
>- buffer size variables
>Not used by dgram, so they do not need protection. No change.
>
>- remote_addr and transport
>Because a remote_addr update may result in a changed transport, but we
>would like to be able to read these two fields lock-free but coherently
>in the vsock send path, this patch packages these two fields into a new
>struct vsock_remote_info that is referenced by an RCU-protected pointer.
>
>Writes are synchronized as usual by the socket lock. Reads only take
>place in RCU read-side critical sections. When remote_addr or transport
>is updated, a new remote info is allocated. Old readers still see the
>old coherent remote_addr/transport pair, and new readers will refer to
>the new coherent. The coherency between remote_addr and transport
>previously provided by the socket lock alone is now also preserved by
>RCU, except with the highly-scalable lock-free read-side.
>
>Helpers are introduced for accessing and updating the new pointer.
>
>The new structure is contains an rcu_head so that kfree_rcu() can be
>used. This removes the need of writers to use synchronize_rcu() after
>freeing old structures which is simply more efficient and reduces code
>churn where remote_addr/transport are already being updated inside RCU
>read-side sections.
>
>Only virtio has been tested, but updates were necessary to the VMCI and
>hyperv code. Unfortunately the author does not have access to
>VMCI/hyperv systems so those changes are untested.
@Dexuan, @Vishnu, @Bryan, can you test this?
>
>Perf Tests (results from patch v2)
>vCPUS: 16
>Threads: 16
>Payload: 4KB
>Test Runs: 5
>Type: SOCK_DGRAM
>
>Before: 245.2 MB/s
>After: 509.2 MB/s (+107%)
>
>Notably, on the same test system, vsock dgram even outperforms
>multi-threaded UDP over virtio-net with vhost and MQ support enabled.
>
>Throughput metrics for single-threaded SOCK_DGRAM and
>single/multi-threaded SOCK_STREAM showed no statistically signficant
>throughput changes (lowest p-value reaching 0.27), with the range of the
>mean difference ranging between -5% to +1%.
>
Quite nice. Did you see any improvements also on stream/seqpacket
sockets?
However this is a big change, maybe I would move it to another series,
because it takes time to be reviewed and tested properly.
WDYT?
Thanks,
Stefano
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Saurabh Singh Sengar @ 2023-06-22 17:46 UTC (permalink / raw)
To: Greg KH, Saurabh Sengar
Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
Michael Kelley (LINUX), corbet@lwn.net,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-doc@vger.kernel.org
In-Reply-To: <2023061430-facedown-getting-d9f7@gregkh>
> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Thursday, June 15, 2023 2:46 AM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> corbet@lwn.net; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org;
> linux-doc@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
>
> On Wed, Jun 14, 2023 at 11:15:09AM -0700, Saurabh Sengar wrote:
> > Common userspace interface for read/write from VMBus ringbuffer.
> > This implementation is open for use by any userspace driver or
> > application seeking direct control over VMBus ring buffers.
> > A significant part of this code is borrowed from DPDK.
>
> " "?
>
> Anyway, this does not explain what this is at all.
>
> And if you "borrowed" it from DPDK, that feels odd, are you sure you are
> allowed to do so?
Yes, we have confirmed this with our legal team, as the previous code is
covered under an appropriate open-source license, we are fine to reuse it here.
- Saurabh
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Saurabh Singh Sengar @ 2023-06-22 17:47 UTC (permalink / raw)
To: Greg KH
Cc: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, Michael Kelley (LINUX), corbet@lwn.net,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-doc@vger.kernel.org
In-Reply-To: <2023062020-swung-sensitive-4866@gregkh>
> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, June 20, 2023 12:36 PM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>
> Cc: Saurabh Sengar <ssengar@linux.microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; Michael Kelley
> (LINUX) <mikelley@microsoft.com>; corbet@lwn.net; linux-
> kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> doc@vger.kernel.org
> Subject: Re: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
>
> On Tue, Jun 20, 2023 at 05:25:06AM +0000, Saurabh Singh Sengar wrote:
> > > > diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c
> > > > new file mode 100644 index 000000000000..d44a06d45b03
> > > > --- /dev/null
> > > > +++ b/tools/hv/vmbus_bufring.c
> > > > @@ -0,0 +1,322 @@
> > > > +// SPDX-License-Identifier: BSD-3-Clause
> > > > +/*
> > > > + * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
> > > > + * Copyright (c) 2012 NetApp Inc.
> > > > + * Copyright (c) 2012 Citrix Inc.
> > > > + * All rights reserved.
> > >
> > > No copyright for the work you did?
> >
> > I have added 2023 Microsoft Corp. Please let me know if I need to add
> > anything more.
>
> Please contact your lawyers about EXACTLY what you need to do here, that's
> up to them, not me!
I have cross verified this with our legal team and they are fine with it.
- Saurabh
>
> greg k-h
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Greg KH @ 2023-06-22 18:12 UTC (permalink / raw)
To: Saurabh Singh Sengar
Cc: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, Michael Kelley (LINUX), corbet@lwn.net,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-doc@vger.kernel.org
In-Reply-To: <SEZP153MB0742B43EC4E490E2FEACDA4ABE22A@SEZP153MB0742.APCP153.PROD.OUTLOOK.COM>
On Thu, Jun 22, 2023 at 05:47:58PM +0000, Saurabh Singh Sengar wrote:
>
>
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Tuesday, June 20, 2023 12:36 PM
> > To: Saurabh Singh Sengar <ssengar@microsoft.com>
> > Cc: Saurabh Sengar <ssengar@linux.microsoft.com>; KY Srinivasan
> > <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> > wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; Michael Kelley
> > (LINUX) <mikelley@microsoft.com>; corbet@lwn.net; linux-
> > kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> > doc@vger.kernel.org
> > Subject: Re: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
> >
> > On Tue, Jun 20, 2023 at 05:25:06AM +0000, Saurabh Singh Sengar wrote:
> > > > > diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c
> > > > > new file mode 100644 index 000000000000..d44a06d45b03
> > > > > --- /dev/null
> > > > > +++ b/tools/hv/vmbus_bufring.c
> > > > > @@ -0,0 +1,322 @@
> > > > > +// SPDX-License-Identifier: BSD-3-Clause
> > > > > +/*
> > > > > + * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
> > > > > + * Copyright (c) 2012 NetApp Inc.
> > > > > + * Copyright (c) 2012 Citrix Inc.
> > > > > + * All rights reserved.
> > > >
> > > > No copyright for the work you did?
> > >
> > > I have added 2023 Microsoft Corp. Please let me know if I need to add
> > > anything more.
> >
> > Please contact your lawyers about EXACTLY what you need to do here, that's
> > up to them, not me!
>
> I have cross verified this with our legal team and they are fine with it.
Wonderful, please have them sign-off on the patch as well when you
resubmit it.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH RFC net-next v4 6/8] virtio/vsock: support dgrams
From: Arseniy Krasnov @ 2023-06-22 18:46 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Bryan Tan, Vishnu Dasa, VMware PV-Drivers Reviewers,
Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
linux-kernel, linux-hyperv, bpf
In-Reply-To: <ppx75eomyyb354knfkwbwin3il2ot7hf5cefwrt6ztpcbc3pps@q736cq5v4bdh>
On 22.06.2023 19:09, Stefano Garzarella wrote:
> On Sun, Jun 11, 2023 at 11:49:02PM +0300, Arseniy Krasnov wrote:
>> Hello Bobby!
>>
>> On 10.06.2023 03:58, Bobby Eshleman wrote:
>>> This commit adds support for datagrams over virtio/vsock.
>>>
>>> Message boundaries are preserved on a per-skb and per-vq entry basis.
>>
>> I'm a little bit confused about the following case: let vhost sends 4097 bytes
>> datagram to the guest. Guest uses 4096 RX buffers in it's virtio queue, each
>> buffer has attached empty skb to it. Vhost places first 4096 bytes to the first
>> buffer of guests RX queue, and 1 last byte to the second buffer. Now IIUC guest
>> has two skb in it rx queue, and user in guest wants to read data - does it read
>> 4097 bytes, while guest has two skb - 4096 bytes and 1 bytes? In seqpacket there is
>> special marker in header which shows where message ends, and how it works here?
>
> I think the main difference is that DGRAM is not connection-oriented, so
> we don't have a stream and we can't split the packet into 2 (maybe we
> could, but we have no guarantee that the second one for example will be
> not discarded because there is no space).
>
> So I think it is acceptable as a restriction to keep it simple.
Ah, I see, idea is that any "corruptions" of data could be considered as
"DGRAM is not reliable anyway, so that's it" :)
>
> My only doubt is, should we make the RX buffer size configurable,
> instead of always using 4k?
I guess this is useful only for DGRAM usage, when we want to tune buffers
for some specific case - may be for exact length of messages (for example if we have
4096 buffers, while senders wants to send 5000 bytes always by each 'send()' - I think it
will be really strange that reader ALWAYS dequeues 4096 and 4 bytes as two packets).
For stream types of socket I think size of rx buffers is not big deal in most of cases.
Thanks, Arseniy
>
> Thanks,
> Stefano
>
^ permalink raw reply
* Re: [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Arseniy Krasnov @ 2023-06-22 19:23 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Bryan Tan, Vishnu Dasa, VMware PV-Drivers Reviewers,
Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
linux-kernel, linux-hyperv, bpf
In-Reply-To: <63ko2n5fwjdefot6rzcxdftfh6pilg6vmqn66v4ue5dgf4oz53@tntmdijw4ghr>
On 22.06.2023 17:51, Stefano Garzarella wrote:
> On Sun, Jun 11, 2023 at 11:43:15PM +0300, Arseniy Krasnov wrote:
>> Hello Bobby! Thanks for this patchset! Small comment below:
>>
>> On 10.06.2023 03:58, Bobby Eshleman wrote:
>>> This commit drops the transport->dgram_dequeue callback and makes
>>> vsock_dgram_recvmsg() generic. It also adds additional transport
>>> callbacks for use by the generic vsock_dgram_recvmsg(), such as for
>>> parsing skbs for CID/port which vary in format per transport.
>>>
>>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>>> ---
>>> drivers/vhost/vsock.c | 4 +-
>>> include/linux/virtio_vsock.h | 3 ++
>>> include/net/af_vsock.h | 13 ++++++-
>>> net/vmw_vsock/af_vsock.c | 51 ++++++++++++++++++++++++-
>>> net/vmw_vsock/hyperv_transport.c | 17 +++++++--
>>> net/vmw_vsock/virtio_transport.c | 4 +-
>>> net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
>>> net/vmw_vsock/vmci_transport.c | 68 +++++++++++++--------------------
>>> net/vmw_vsock/vsock_loopback.c | 4 +-
>>> 9 files changed, 132 insertions(+), 50 deletions(-)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index 6578db78f0ae..c8201c070b4b 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
>>> .cancel_pkt = vhost_transport_cancel_pkt,
>>>
>>> .dgram_enqueue = virtio_transport_dgram_enqueue,
>>> - .dgram_dequeue = virtio_transport_dgram_dequeue,
>>> .dgram_bind = virtio_transport_dgram_bind,
>>> .dgram_allow = virtio_transport_dgram_allow,
>>> + .dgram_get_cid = virtio_transport_dgram_get_cid,
>>> + .dgram_get_port = virtio_transport_dgram_get_port,
>>> + .dgram_get_length = virtio_transport_dgram_get_length,
>>>
>>> .stream_enqueue = virtio_transport_stream_enqueue,
>>> .stream_dequeue = virtio_transport_stream_dequeue,
>>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>>> index c58453699ee9..23521a318cf0 100644
>>> --- a/include/linux/virtio_vsock.h
>>> +++ b/include/linux/virtio_vsock.h
>>> @@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
>>> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>>> struct sockaddr_vm *addr);
>>> bool virtio_transport_dgram_allow(u32 cid, u32 port);
>>> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
>>> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
>>> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
>>>
>>> int virtio_transport_connect(struct vsock_sock *vsk);
>>>
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index 0e7504a42925..7bedb9ee7e3e 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -120,11 +120,20 @@ struct vsock_transport {
>>>
>>> /* DGRAM. */
>>> int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
>>> - int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
>>> - size_t len, int flags);
>>> int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
>>> struct msghdr *, size_t len);
>>> bool (*dgram_allow)(u32 cid, u32 port);
>>> + int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
>>> + int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
>>> + int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
>>> +
>>> + /* The number of bytes into the buffer at which the payload starts, as
>>> + * first seen by the receiving socket layer. For example, if the
>>> + * transport presets the skb pointers using skb_pull(sizeof(header))
>>> + * than this would be zero, otherwise it would be the size of the
>>> + * header.
>>> + */
>>> + const size_t dgram_payload_offset;
>>>
>>> /* STREAM. */
>>> /* TODO: stream_bind() */
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index efb8a0937a13..ffb4dd8b6ea7 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
>>> int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>>> size_t len, int flags)
>>> {
>>> + const struct vsock_transport *transport;
>>> #ifdef CONFIG_BPF_SYSCALL
>>> const struct proto *prot;
>>> #endif
>>> struct vsock_sock *vsk;
>>> + struct sk_buff *skb;
>>> + size_t payload_len;
>>> struct sock *sk;
>>> + int err;
>>>
>>> sk = sock->sk;
>>> vsk = vsock_sk(sk);
>>> @@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>>> return prot->recvmsg(sk, msg, len, flags, NULL);
>>> #endif
>>>
>>> - return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
>>> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
>>> + return -EOPNOTSUPP;
>>> +
>>> + transport = vsk->transport;
>>> +
>>> + /* Retrieve the head sk_buff from the socket's receive queue. */
>>> + err = 0;
>>> + skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
>>> + if (!skb)
>>> + return err;
>>> +
>>> + err = transport->dgram_get_length(skb, &payload_len);
>
> What about ssize_t return value here?
>
> Or maybe a single callback that return both length and offset?
>
> .dgram_get_payload_info(skb, &payload_len, &payload_off)
Just architectural question:
May be we can avoid this callback for length? IIUC concept of skbuff is that
current level of network stack already have pointer to its data 'skb->data' and
length of the payload 'skb->len' (both are set by previous stack handler - transport in
this case), so here we can use just 'skb->len' and thats all. There is no need to ask
lower level of network stack for length of payload? I see that VMCI stores metadata
with payload in 'data' buffer, but may be it is more correctly to do 'skb_pull()'
in vmci before inserting skbuff to sockets queue? In this case field with dgram payload
offset could be removed from transport.
Thanks, Arseniy
>
>>> + if (err)
>>> + goto out;
>>> +
>>> + if (payload_len > len) {
>>> + payload_len = len;
>>> + msg->msg_flags |= MSG_TRUNC;
>>> + }
>>> +
>>> + /* Place the datagram payload in the user's iovec. */
>>> + err = skb_copy_datagram_msg(skb, transport->dgram_payload_offset, msg, payload_len);
>>> + if (err)
>>> + goto out;
>>> +
>>> + if (msg->msg_name) {
>>> + /* Provide the address of the sender. */
>>> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
>>> + unsigned int cid, port;
>>> +
>>> + err = transport->dgram_get_cid(skb, &cid);
>>> + if (err)
>>> + goto out;
>>> +
>>> + err = transport->dgram_get_port(skb, &port);
>>> + if (err)
>>> + goto out;
>>
>> Maybe we can merge 'dgram_get_cid' and 'dgram_get_port' to a single callback? Because I see that this is
>> the only place where both are used (correct me if i'm wrong) and logically both operates with addresses:
>> CID and port. E.g. something like that: dgram_get_cid_n_port().
>
> What about .dgram_addr_init(struct sk_buff *skb, struct sockaddr_vm *addr)
> and the transport can set cid and port?
>
>>
>> Moreover, I'm not sure, but is it good "tradeoff" here: remove transport specific callback for dgram receive
>> where we already have 'msghdr' with both data buffer and buffer for 'sockaddr_vm' and instead of it add new
>> several fields (callbacks) to transports like dgram_get_cid(), dgram_get_port()? I agree, that in each transport
>> specific callback we will have same copying logic by calling 'skb_copy_datagram_msg()' and filling address
>> by using 'vsock_addr_init()', but in this case we don't need to update transports too much. For example HyperV
>> still unchanged as it does not support SOCK_DGRAM. For VMCI You just need to add 'vsock_addr_init()' logic
>> to it's dgram dequeue callback.
>>
>> What do You think?
>
> Honestly, I'd rather avoid duplicate code than reduce changes in
> transports that don't support dgram.
>
> One thing I do agree on though is minimizing the number of callbacks
> to call to reduce the number of indirection (more performance?).
>
> Thanks,
> Stefano
>
>>
>> Thanks, Arseniy
>>
>>> +
>>> + vsock_addr_init(vm_addr, cid, port);
>>> + msg->msg_namelen = sizeof(*vm_addr);
>>> + }
>>> + err = payload_len;
>>> +
>>> +out:
>>> + skb_free_datagram(&vsk->sk, skb);
>>> + return err;
>>> }
>>> EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
>>>
>>> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
>>> index 7cb1a9d2cdb4..ff6e87e25fa0 100644
>>> --- a/net/vmw_vsock/hyperv_transport.c
>>> +++ b/net/vmw_vsock/hyperv_transport.c
>>> @@ -556,8 +556,17 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
>>> return -EOPNOTSUPP;
>>> }
>>>
>>> -static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
>>> - size_t len, int flags)
>>> +static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
>>> +{
>>> + return -EOPNOTSUPP;
>>> +}
>>> +
>>> +static int hvs_dgram_get_port(struct sk_buff *skb, unsigned int *port)
>>> +{
>>> + return -EOPNOTSUPP;
>>> +}
>>> +
>>> +static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
>>> {
>>> return -EOPNOTSUPP;
>>> }
>>> @@ -833,7 +842,9 @@ static struct vsock_transport hvs_transport = {
>>> .shutdown = hvs_shutdown,
>>>
>>> .dgram_bind = hvs_dgram_bind,
>>> - .dgram_dequeue = hvs_dgram_dequeue,
>>> + .dgram_get_cid = hvs_dgram_get_cid,
>>> + .dgram_get_port = hvs_dgram_get_port,
>>> + .dgram_get_length = hvs_dgram_get_length,
>>> .dgram_enqueue = hvs_dgram_enqueue,
>>> .dgram_allow = hvs_dgram_allow,
>>>
>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>> index e95df847176b..5763cdf13804 100644
>>> --- a/net/vmw_vsock/virtio_transport.c
>>> +++ b/net/vmw_vsock/virtio_transport.c
>>> @@ -429,9 +429,11 @@ static struct virtio_transport virtio_transport = {
>>> .cancel_pkt = virtio_transport_cancel_pkt,
>>>
>>> .dgram_bind = virtio_transport_dgram_bind,
>>> - .dgram_dequeue = virtio_transport_dgram_dequeue,
>>> .dgram_enqueue = virtio_transport_dgram_enqueue,
>>> .dgram_allow = virtio_transport_dgram_allow,
>>> + .dgram_get_cid = virtio_transport_dgram_get_cid,
>>> + .dgram_get_port = virtio_transport_dgram_get_port,
>>> + .dgram_get_length = virtio_transport_dgram_get_length,
>>>
>>> .stream_dequeue = virtio_transport_stream_dequeue,
>>> .stream_enqueue = virtio_transport_stream_enqueue,
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index b769fc258931..e6903c719964 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -797,6 +797,24 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>>> }
>>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>>>
>>> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
>>> +{
>>> + return -EOPNOTSUPP;
>>> +}
>>> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
>>> +
>>> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
>>> +{
>>> + return -EOPNOTSUPP;
>>> +}
>>> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
>>> +
>>> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
>>> +{
>>> + return -EOPNOTSUPP;
>>> +}
>>> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
>>> +
>>> bool virtio_transport_dgram_allow(u32 cid, u32 port)
>>> {
>>> return false;
>>> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>>> index b370070194fa..bbc63826bf48 100644
>>> --- a/net/vmw_vsock/vmci_transport.c
>>> +++ b/net/vmw_vsock/vmci_transport.c
>>> @@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
>>> return err - sizeof(*dg);
>>> }
>>>
>>> -static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
>>> - struct msghdr *msg, size_t len,
>>> - int flags)
>>> +static int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
>>> {
>>> - int err;
>>> struct vmci_datagram *dg;
>>> - size_t payload_len;
>>> - struct sk_buff *skb;
>>>
>>> - if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
>>> - return -EOPNOTSUPP;
>>> + dg = (struct vmci_datagram *)skb->data;
>>> + if (!dg)
>>> + return -EINVAL;
>>>
>>> - /* Retrieve the head sk_buff from the socket's receive queue. */
>>> - err = 0;
>>> - skb = skb_recv_datagram(&vsk->sk, flags, &err);
>>> - if (!skb)
>>> - return err;
>>> + *cid = dg->src.context;
>>> + return 0;
>>> +}
>>> +
>>> +static int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
>>> +{
>>> + struct vmci_datagram *dg;
>>>
>>> dg = (struct vmci_datagram *)skb->data;
>>> if (!dg)
>>> - /* err is 0, meaning we read zero bytes. */
>>> - goto out;
>>> -
>>> - payload_len = dg->payload_size;
>>> - /* Ensure the sk_buff matches the payload size claimed in the packet. */
>>> - if (payload_len != skb->len - sizeof(*dg)) {
>>> - err = -EINVAL;
>>> - goto out;
>>> - }
>>> + return -EINVAL;
>>>
>>> - if (payload_len > len) {
>>> - payload_len = len;
>>> - msg->msg_flags |= MSG_TRUNC;
>>> - }
>>> + *port = dg->src.resource;
>>> + return 0;
>>> +}
>>>
>>> - /* Place the datagram payload in the user's iovec. */
>>> - err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
>>> - if (err)
>>> - goto out;
>>> +static int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
>>> +{
>>> + struct vmci_datagram *dg;
>>>
>>> - if (msg->msg_name) {
>>> - /* Provide the address of the sender. */
>>> - DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
>>> - vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
>>> - msg->msg_namelen = sizeof(*vm_addr);
>>> - }
>>> - err = payload_len;
>>> + dg = (struct vmci_datagram *)skb->data;
>>> + if (!dg)
>>> + return -EINVAL;
>>>
>>> -out:
>>> - skb_free_datagram(&vsk->sk, skb);
>>> - return err;
>>> + *len = dg->payload_size;
>>> + return 0;
>>> }
>>>
>>> static bool vmci_transport_dgram_allow(u32 cid, u32 port)
>>> @@ -2040,9 +2023,12 @@ static struct vsock_transport vmci_transport = {
>>> .release = vmci_transport_release,
>>> .connect = vmci_transport_connect,
>>> .dgram_bind = vmci_transport_dgram_bind,
>>> - .dgram_dequeue = vmci_transport_dgram_dequeue,
>>> .dgram_enqueue = vmci_transport_dgram_enqueue,
>>> .dgram_allow = vmci_transport_dgram_allow,
>>> + .dgram_get_cid = vmci_transport_dgram_get_cid,
>>> + .dgram_get_port = vmci_transport_dgram_get_port,
>>> + .dgram_get_length = vmci_transport_dgram_get_length,
>>> + .dgram_payload_offset = sizeof(struct vmci_datagram),
>>> .stream_dequeue = vmci_transport_stream_dequeue,
>>> .stream_enqueue = vmci_transport_stream_enqueue,
>>> .stream_has_data = vmci_transport_stream_has_data,
>>> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
>>> index 5c6360df1f31..2f3cabc79ee5 100644
>>> --- a/net/vmw_vsock/vsock_loopback.c
>>> +++ b/net/vmw_vsock/vsock_loopback.c
>>> @@ -62,9 +62,11 @@ static struct virtio_transport loopback_transport = {
>>> .cancel_pkt = vsock_loopback_cancel_pkt,
>>>
>>> .dgram_bind = virtio_transport_dgram_bind,
>>> - .dgram_dequeue = virtio_transport_dgram_dequeue,
>>> .dgram_enqueue = virtio_transport_dgram_enqueue,
>>> .dgram_allow = virtio_transport_dgram_allow,
>>> + .dgram_get_cid = virtio_transport_dgram_get_cid,
>>> + .dgram_get_port = virtio_transport_dgram_get_port,
>>> + .dgram_get_length = virtio_transport_dgram_get_length,
>>>
>>> .stream_dequeue = virtio_transport_stream_dequeue,
>>> .stream_enqueue = virtio_transport_stream_enqueue,
>>>
>>
>
^ permalink raw reply
* [PATCH] Drivers: hv: Change hv_free_hyperv_page() to take void * argument
From: Kameron Carr @ 2023-06-22 20:07 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, arnd, linux-hyperv, linux-kernel,
linux-arch
Currently hv_free_hyperv_page() takes an unsigned long argument, which
is inconsistent with the void * return value from the corresponding
hv_alloc_hyperv_page() function and variants. This creates unnecessary
extra casting.
Change the hv_free_hyperv_page() argument type to void *.
Also remove redundant casts from invocations of
hv_alloc_hyperv_page() and variants.
---
drivers/hv/connection.c | 13 ++++++-------
drivers/hv/hv_common.c | 10 +++++-----
include/asm-generic/mshyperv.h | 2 +-
3 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 5978e9d..ebf15f3 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -209,8 +209,7 @@ int vmbus_connect(void)
* Setup the vmbus event connection for channel interrupt
* abstraction stuff
*/
- vmbus_connection.int_page =
- (void *)hv_alloc_hyperv_zeroed_page();
+ vmbus_connection.int_page = hv_alloc_hyperv_zeroed_page();
if (vmbus_connection.int_page == NULL) {
ret = -ENOMEM;
goto cleanup;
@@ -225,8 +224,8 @@ int vmbus_connect(void)
* Setup the monitor notification facility. The 1st page for
* parent->child and the 2nd page for child->parent
*/
- vmbus_connection.monitor_pages[0] = (void *)hv_alloc_hyperv_page();
- vmbus_connection.monitor_pages[1] = (void *)hv_alloc_hyperv_page();
+ vmbus_connection.monitor_pages[0] = hv_alloc_hyperv_page();
+ vmbus_connection.monitor_pages[1] = hv_alloc_hyperv_page();
if ((vmbus_connection.monitor_pages[0] == NULL) ||
(vmbus_connection.monitor_pages[1] == NULL)) {
ret = -ENOMEM;
@@ -333,15 +332,15 @@ void vmbus_disconnect(void)
destroy_workqueue(vmbus_connection.work_queue);
if (vmbus_connection.int_page) {
- hv_free_hyperv_page((unsigned long)vmbus_connection.int_page);
+ hv_free_hyperv_page(vmbus_connection.int_page);
vmbus_connection.int_page = NULL;
}
set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[0], 1);
set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[1], 1);
- hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
- hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
+ hv_free_hyperv_page(vmbus_connection.monitor_pages[0]);
+ hv_free_hyperv_page(vmbus_connection.monitor_pages[1]);
vmbus_connection.monitor_pages[0] = NULL;
vmbus_connection.monitor_pages[1] = NULL;
}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 542a1d5..6a2258f 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -115,12 +115,12 @@ void *hv_alloc_hyperv_zeroed_page(void)
}
EXPORT_SYMBOL_GPL(hv_alloc_hyperv_zeroed_page);
-void hv_free_hyperv_page(unsigned long addr)
+void hv_free_hyperv_page(void *addr)
{
if (PAGE_SIZE == HV_HYP_PAGE_SIZE)
- free_page(addr);
+ free_page((unsigned long)addr);
else
- kfree((void *)addr);
+ kfree(addr);
}
EXPORT_SYMBOL_GPL(hv_free_hyperv_page);
@@ -253,7 +253,7 @@ static void hv_kmsg_dump_unregister(void)
atomic_notifier_chain_unregister(&panic_notifier_list,
&hyperv_panic_report_block);
- hv_free_hyperv_page((unsigned long)hv_panic_page);
+ hv_free_hyperv_page(hv_panic_page);
hv_panic_page = NULL;
}
@@ -270,7 +270,7 @@ static void hv_kmsg_dump_register(void)
ret = kmsg_dump_register(&hv_kmsg_dumper);
if (ret) {
pr_err("Hyper-V: kmsg dump register error 0x%x\n", ret);
- hv_free_hyperv_page((unsigned long)hv_panic_page);
+ hv_free_hyperv_page(hv_panic_page);
hv_panic_page = NULL;
}
}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 402a8c1..a8f4b65 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -190,7 +190,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
void *hv_alloc_hyperv_page(void);
void *hv_alloc_hyperv_zeroed_page(void);
-void hv_free_hyperv_page(unsigned long addr);
+void hv_free_hyperv_page(void *addr);
/**
* hv_cpu_number_to_vp_number() - Map CPU to VP.
--
1.8.3.1
^ permalink raw reply related
* RE: [PATCH v2] net: mana: Batch ringing RX queue doorbell on receiving packets
From: Haiyang Zhang @ 2023-06-22 21:27 UTC (permalink / raw)
To: longli@linuxonhyperv.com, Jason Gunthorpe, Leon Romanovsky,
Ajay Sharma, Dexuan Cui, KY Srinivasan, Wei Liu, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Long Li,
stable@vger.kernel.org
In-Reply-To: <1687450956-6407-1-git-send-email-longli@linuxonhyperv.com>
> -----Original Message-----
> From: longli@linuxonhyperv.com <longli@linuxonhyperv.com>
> Sent: Thursday, June 22, 2023 12:23 PM
> To: Jason Gunthorpe <jgg@ziepe.ca>; Leon Romanovsky <leon@kernel.org>;
> Ajay Sharma <sharmaajay@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>
> Cc: linux-rdma@vger.kernel.org; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Long Li
> <longli@microsoft.com>; stable@vger.kernel.org
> Subject: [PATCH v2] net: mana: Batch ringing RX queue doorbell on receiving
> packets
>
> From: Long Li <longli@microsoft.com>
>
> It's inefficient to ring the doorbell page every time a WQE is posted to
> the received queue.
>
> Move the code for ringing doorbell page to where after we have posted all
> WQEs to the receive queue during a callback from napi_poll().
>
> Tests showed no regression in network latency benchmarks.
>
> Cc: stable@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> Adapter (MANA)")
> Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* Re: [PATCH RFC net-next v4 7/8] vsock: Add lockless sendmsg() support
From: Bobby Eshleman @ 2023-06-22 22:57 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Bryan Tan, Vishnu Dasa, VMware PV-Drivers Reviewers,
Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <6aif4uoucg6fhqwg2fmx76jkt6542dt7cqsxrtnebpboihfjeb@akpxj3yd2xle>
On Thu, Jun 22, 2023 at 06:37:21PM +0200, Stefano Garzarella wrote:
> On Sat, Jun 10, 2023 at 12:58:34AM +0000, Bobby Eshleman wrote:
> > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
> > it does not scale when many senders share a socket.
> >
> > Prior to this patch the socket lock is used to protect both reads and
> > writes to the local_addr, remote_addr, transport, and buffer size
> > variables of a vsock socket. What follows are the new protection schemes
> > for these fields that ensure a race-free and usually lock-free
> > multi-sender sendmsg() path for vsock dgrams.
> >
> > - local_addr
> > local_addr changes as a result of binding a socket. The write path
> > for local_addr is bind() and various vsock_auto_bind() call sites.
> > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> > rejects the user request and vsock_auto_bind() early exits.
> > Therefore, the local addr can not change while a parallel thread is
> > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> > Change: only acquire lock for auto-binding as-needed in sendmsg().
> >
> > - buffer size variables
> > Not used by dgram, so they do not need protection. No change.
> >
> > - remote_addr and transport
> > Because a remote_addr update may result in a changed transport, but we
> > would like to be able to read these two fields lock-free but coherently
> > in the vsock send path, this patch packages these two fields into a new
> > struct vsock_remote_info that is referenced by an RCU-protected pointer.
> >
> > Writes are synchronized as usual by the socket lock. Reads only take
> > place in RCU read-side critical sections. When remote_addr or transport
> > is updated, a new remote info is allocated. Old readers still see the
> > old coherent remote_addr/transport pair, and new readers will refer to
> > the new coherent. The coherency between remote_addr and transport
> > previously provided by the socket lock alone is now also preserved by
> > RCU, except with the highly-scalable lock-free read-side.
> >
> > Helpers are introduced for accessing and updating the new pointer.
> >
> > The new structure is contains an rcu_head so that kfree_rcu() can be
> > used. This removes the need of writers to use synchronize_rcu() after
> > freeing old structures which is simply more efficient and reduces code
> > churn where remote_addr/transport are already being updated inside RCU
> > read-side sections.
> >
> > Only virtio has been tested, but updates were necessary to the VMCI and
> > hyperv code. Unfortunately the author does not have access to
> > VMCI/hyperv systems so those changes are untested.
>
> @Dexuan, @Vishnu, @Bryan, can you test this?
>
> >
> > Perf Tests (results from patch v2)
> > vCPUS: 16
> > Threads: 16
> > Payload: 4KB
> > Test Runs: 5
> > Type: SOCK_DGRAM
> >
> > Before: 245.2 MB/s
> > After: 509.2 MB/s (+107%)
> >
> > Notably, on the same test system, vsock dgram even outperforms
> > multi-threaded UDP over virtio-net with vhost and MQ support enabled.
> >
> > Throughput metrics for single-threaded SOCK_DGRAM and
> > single/multi-threaded SOCK_STREAM showed no statistically signficant
> > throughput changes (lowest p-value reaching 0.27), with the range of the
> > mean difference ranging between -5% to +1%.
> >
>
> Quite nice. Did you see any improvements also on stream/seqpacket
> sockets?
>
The change seemed to be null for stream sockets. I assumed the same
would be for seqpacket too, but I'll run some numbers there too for the
next revision.
> However this is a big change, maybe I would move it to another series,
> because it takes time to be reviewed and tested properly.
>
> WDYT?
>
Sounds good to me, I'll lop it off and resend on its own.
> Thanks,
> Stefano
>
Thanks!
Bobby
^ permalink raw reply
* Re: [PATCH RFC net-next v4 7/8] vsock: Add lockless sendmsg() support
From: Bobby Eshleman @ 2023-06-22 22:59 UTC (permalink / raw)
To: Simon Horman
Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
VMware PV-Drivers Reviewers, Dan Carpenter, Krasnov Arseniy, kvm,
virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <ZIbq/CeWowEq+nvg@corigine.com>
On Mon, Jun 12, 2023 at 11:53:00AM +0200, Simon Horman wrote:
> On Sat, Jun 10, 2023 at 12:58:34AM +0000, Bobby Eshleman wrote:
> > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
> > it does not scale when many senders share a socket.
> >
> > Prior to this patch the socket lock is used to protect both reads and
> > writes to the local_addr, remote_addr, transport, and buffer size
> > variables of a vsock socket. What follows are the new protection schemes
> > for these fields that ensure a race-free and usually lock-free
> > multi-sender sendmsg() path for vsock dgrams.
> >
> > - local_addr
> > local_addr changes as a result of binding a socket. The write path
> > for local_addr is bind() and various vsock_auto_bind() call sites.
> > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> > rejects the user request and vsock_auto_bind() early exits.
> > Therefore, the local addr can not change while a parallel thread is
> > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> > Change: only acquire lock for auto-binding as-needed in sendmsg().
> >
> > - buffer size variables
> > Not used by dgram, so they do not need protection. No change.
> >
> > - remote_addr and transport
> > Because a remote_addr update may result in a changed transport, but we
> > would like to be able to read these two fields lock-free but coherently
> > in the vsock send path, this patch packages these two fields into a new
> > struct vsock_remote_info that is referenced by an RCU-protected pointer.
> >
> > Writes are synchronized as usual by the socket lock. Reads only take
> > place in RCU read-side critical sections. When remote_addr or transport
> > is updated, a new remote info is allocated. Old readers still see the
> > old coherent remote_addr/transport pair, and new readers will refer to
> > the new coherent. The coherency between remote_addr and transport
> > previously provided by the socket lock alone is now also preserved by
> > RCU, except with the highly-scalable lock-free read-side.
> >
> > Helpers are introduced for accessing and updating the new pointer.
> >
> > The new structure is contains an rcu_head so that kfree_rcu() can be
> > used. This removes the need of writers to use synchronize_rcu() after
> > freeing old structures which is simply more efficient and reduces code
> > churn where remote_addr/transport are already being updated inside RCU
> > read-side sections.
> >
> > Only virtio has been tested, but updates were necessary to the VMCI and
> > hyperv code. Unfortunately the author does not have access to
> > VMCI/hyperv systems so those changes are untested.
> >
> > Perf Tests (results from patch v2)
> > vCPUS: 16
> > Threads: 16
> > Payload: 4KB
> > Test Runs: 5
> > Type: SOCK_DGRAM
> >
> > Before: 245.2 MB/s
> > After: 509.2 MB/s (+107%)
> >
> > Notably, on the same test system, vsock dgram even outperforms
> > multi-threaded UDP over virtio-net with vhost and MQ support enabled.
> >
> > Throughput metrics for single-threaded SOCK_DGRAM and
> > single/multi-threaded SOCK_STREAM showed no statistically signficant
>
> Hi Bobby,
>
> a minor nit from checkpatch --codespell: signficant -> significant
>
Thanks! I'll fix that and add that argument to my pre-commit hook.
> > throughput changes (lowest p-value reaching 0.27), with the range of the
> > mean difference ranging between -5% to +1%.
> >
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>
> ...
^ permalink raw reply
* Re: [PATCH RFC net-next v4 4/8] vsock: make vsock bind reusable
From: Bobby Eshleman @ 2023-06-22 23:00 UTC (permalink / raw)
To: Simon Horman
Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
VMware PV-Drivers Reviewers, Dan Carpenter, Krasnov Arseniy, kvm,
virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <ZIbqRXYMbYsdq874@corigine.com>
On Mon, Jun 12, 2023 at 11:49:57AM +0200, Simon Horman wrote:
> On Sat, Jun 10, 2023 at 12:58:31AM +0000, Bobby Eshleman wrote:
> > This commit makes the bind table management functions in vsock usable
> > for different bind tables. For use by datagrams in a future patch.
> >
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > net/vmw_vsock/af_vsock.c | 33 ++++++++++++++++++++++++++-------
> > 1 file changed, 26 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index ef86765f3765..7a3ca4270446 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -230,11 +230,12 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
> > sock_put(&vsk->sk);
> > }
> >
> > -static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
> > +struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
> > + struct list_head *bind_table)
>
> Hi Bobby,
>
> This function seems to only be used in this file.
> Should it be static?
Oh good call, yep.
Thanks!
Bobby
^ permalink raw reply
* Re: [PATCH RFC net-next v4 4/8] vsock: make vsock bind reusable
From: Bobby Eshleman @ 2023-06-22 23:05 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Bryan Tan, Vishnu Dasa, VMware PV-Drivers Reviewers,
Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <p2tgn3wczd3t3dodyicczetr2nqnqpwcadz6ql5hpvg2cd2dxa@phheksxhxfna>
On Thu, Jun 22, 2023 at 05:25:55PM +0200, Stefano Garzarella wrote:
> On Sat, Jun 10, 2023 at 12:58:31AM +0000, Bobby Eshleman wrote:
> > This commit makes the bind table management functions in vsock usable
> > for different bind tables. For use by datagrams in a future patch.
> >
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > net/vmw_vsock/af_vsock.c | 33 ++++++++++++++++++++++++++-------
> > 1 file changed, 26 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index ef86765f3765..7a3ca4270446 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -230,11 +230,12 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
> > sock_put(&vsk->sk);
> > }
> >
> > -static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
> > +struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
> > + struct list_head *bind_table)
> > {
> > struct vsock_sock *vsk;
> >
> > - list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) {
> > + list_for_each_entry(vsk, bind_table, bound_table) {
> > if (vsock_addr_equals_addr(addr, &vsk->local_addr))
> > return sk_vsock(vsk);
> >
> > @@ -247,6 +248,11 @@ static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
> > return NULL;
> > }
> >
> > +static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
> > +{
> > + return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
> > +}
> > +
> > static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
> > struct sockaddr_vm *dst)
> > {
> > @@ -646,12 +652,17 @@ static void vsock_pending_work(struct work_struct *work)
> >
> > /**** SOCKET OPERATIONS ****/
> >
> > -static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > - struct sockaddr_vm *addr)
> > +static int vsock_bind_common(struct vsock_sock *vsk,
> > + struct sockaddr_vm *addr,
> > + struct list_head *bind_table,
> > + size_t table_size)
> > {
> > static u32 port;
> > struct sockaddr_vm new_addr;
> >
> > + if (table_size < VSOCK_HASH_SIZE)
> > + return -1;
>
> Why we need this check now?
>
If the table_size is not at least VSOCK_HASH_SIZE then the
VSOCK_HASH(addr) used later could overflow the table.
Maybe this really deserves a WARN() and a comment?
> > +
> > if (!port)
> > port = get_random_u32_above(LAST_RESERVED_PORT);
> >
> > @@ -667,7 +678,8 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> >
> > new_addr.svm_port = port++;
> >
> > - if (!__vsock_find_bound_socket(&new_addr)) {
> > + if (!vsock_find_bound_socket_common(&new_addr,
> > + &bind_table[VSOCK_HASH(addr)])) {
> > found = true;
> > break;
> > }
> > @@ -684,7 +696,8 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > return -EACCES;
> > }
> >
> > - if (__vsock_find_bound_socket(&new_addr))
> > + if (vsock_find_bound_socket_common(&new_addr,
> > + &bind_table[VSOCK_HASH(addr)]))
> > return -EADDRINUSE;
> > }
> >
> > @@ -696,11 +709,17 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > * by AF_UNIX.
> > */
> > __vsock_remove_bound(vsk);
> > - __vsock_insert_bound(vsock_bound_sockets(&vsk->local_addr), vsk);
> > + __vsock_insert_bound(&bind_table[VSOCK_HASH(&vsk->local_addr)], vsk);
> >
> > return 0;
> > }
> >
> > +static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > + struct sockaddr_vm *addr)
> > +{
> > + return vsock_bind_common(vsk, addr, vsock_bind_table, VSOCK_HASH_SIZE + 1);
> > +}
> > +
> > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > struct sockaddr_vm *addr)
> > {
> >
> > --
> > 2.30.2
> >
>
> The rest seems okay to me, but I agree with Simon's suggestion.
>
> Stefano
>
Thanks,
Bobby
^ permalink raw reply
* Re: [PATCH RFC net-next v4 5/8] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
From: Bobby Eshleman @ 2023-06-22 23:06 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Bobby Eshleman, linux-hyperv, Stefan Hajnoczi, kvm,
Michael S. Tsirkin, VMware PV-Drivers Reviewers, Simon Horman,
virtualization, Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu,
Dexuan Cui, Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
Krasnov Arseniy, Vishnu Dasa, Jiang Wang, netdev, linux-kernel,
bpf, David S. Miller
In-Reply-To: <med476cdkdhkylddqa5wbhjpgyw2yiqfthvup2kics3zbb5vpb@ovzg57adewfw>
On Thu, Jun 22, 2023 at 05:29:08PM +0200, Stefano Garzarella wrote:
> On Sat, Jun 10, 2023 at 12:58:32AM +0000, Bobby Eshleman wrote:
> > This commit adds a feature bit for virtio vsock to support datagrams.
> >
> > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > include/uapi/linux/virtio_vsock.h | 1 +
> > 1 file changed, 1 insertion(+)
>
> LGTM, but I'll give the R-b when we merge the virtio-spec.
>
> Stefano
>
Roger that.
> >
> > diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> > index 64738838bee5..9c25f267bbc0 100644
> > --- a/include/uapi/linux/virtio_vsock.h
> > +++ b/include/uapi/linux/virtio_vsock.h
> > @@ -40,6 +40,7 @@
> >
> > /* The feature bitmap for virtio vsock */
> > #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
> > +#define VIRTIO_VSOCK_F_DGRAM 3 /* SOCK_DGRAM supported */
> >
> > struct virtio_vsock_config {
> > __le64 guest_cid;
> >
> > --
> > 2.30.2
> >
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH RFC net-next v4 8/8] tests: add vsock dgram tests
From: Bobby Eshleman @ 2023-06-22 23:16 UTC (permalink / raw)
To: Arseniy Krasnov
Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
virtualization, netdev, linux-kernel, linux-hyperv, bpf,
Jiang Wang
In-Reply-To: <484c7b6f-5ce5-ce43-2dfd-0ae3063f1e47@gmail.com>
On Sun, Jun 11, 2023 at 11:54:57PM +0300, Arseniy Krasnov wrote:
> Hello Bobby!
>
> Sorry, may be I become a little bit annoying:), but I tried to run vsock_test with
> this v4 version, and again get the same crash:
Haha not annoying at all. I appreciate the testing!
>
> # cat client.sh
> ./vsock_test --mode=client --control-host=192.168.1.1 --control-port=12345 --peer-cid=2
> # ./client.sh
> Control socket connected to 192.168.1.1:12345.
> 0 - SOCK_STREAM connection reset...[ 20.065237] BUG: kernel NULL pointer dereference, addre0
> [ 20.065895] #PF: supervisor read access in kernel mode
> [ 20.065895] #PF: error_code(0x0000) - not-present page
> [ 20.065895] PGD 0 P4D 0
> [ 20.065895] Oops: 0000 [#1] PREEMPT SMP PTI
> [ 20.065895] CPU: 0 PID: 111 Comm: vsock_test Not tainted 6.4.0-rc3-gefcccba07069 #385
> [ 20.065895] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd44
> [ 20.065895] RIP: 0010:static_key_count+0x0/0x20
> [ 20.065895] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
> [ 20.065895] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
> [ 20.065895] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
> [ 20.065895] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
> [ 20.065895] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
> [ 20.065895] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
> [ 20.065895] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
> [ 20.065895] FS: 00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
> [ 20.065895] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 20.065895] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
> [ 20.065895] Call Trace:
> [ 20.065895] <TASK>
> [ 20.065895] once_deferred+0xd/0x30
> [ 20.065895] vsock_assign_transport+0x9a/0x1b0 [vsock]
> [ 20.065895] vsock_connect+0xb4/0x3a0 [vsock]
> [ 20.065895] ? var_wake_function+0x60/0x60
> [ 20.065895] __sys_connect+0x9e/0xd0
> [ 20.065895] ? _raw_spin_unlock_irq+0xe/0x30
> [ 20.065895] ? do_setitimer+0x128/0x1f0
> [ 20.065895] ? alarm_setitimer+0x4c/0x90
> [ 20.065895] ? fpregs_assert_state_consistent+0x1d/0x50
> [ 20.065895] ? exit_to_user_mode_prepare+0x36/0x130
> [ 20.065895] __x64_sys_connect+0x11/0x20
> [ 20.065895] do_syscall_64+0x3b/0xc0
> [ 20.065895] entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [ 20.065895] RIP: 0033:0x7f1ad822dd13
> [ 20.065895] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8
> [ 20.065895] RSP: 002b:00007ffc513e3c98 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
> [ 20.065895] RAX: ffffffffffffffda RBX: 000055aed298e020 RCX: 00007f1ad822dd13
> [ 20.065895] RDX: 0000000000000010 RSI: 00007ffc513e3cb0 RDI: 0000000000000004
> [ 20.065895] RBP: 0000000000000004 R08: 000055aed32b2018 R09: 0000000000000000
> [ 20.065895] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 20.065895] R13: 000055aed298acb1 R14: 00007ffc513e3cb0 R15: 00007ffc513e3d40
> [ 20.065895] </TASK>
> [ 20.065895] Modules linked in: vsock_loopback vhost_vsock vmw_vsock_virtio_transport vmw_vb
^ I'm guessing this is the difference between our setups. I have been
going all built-in, let me see if I can reproduce w/ modules...
> [ 20.065895] CR2: 0000000000000000
> [ 20.154060] ---[ end trace 0000000000000000 ]---
> [ 20.155519] RIP: 0010:static_key_count+0x0/0x20
> [ 20.156932] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
> [ 20.161367] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
> [ 20.162613] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
> [ 20.164262] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
> [ 20.165934] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
> [ 20.167684] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
> [ 20.169427] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
> [ 20.171109] FS: 00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
> [ 20.173000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 20.174381] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
>
> So, what HEAD do You use? May be You have some specific config (I use x86-64 defconfig + vsock/vhost
> related things) ?
>
For this series I used net-next:
28cfea989d6f55c3d10608eba2a2bae609c5bf3e
> Thanks, Arseniy
>
As always, thanks for the bug finding! I'll report back when I
reproduce or with questions if I can't.
Best,
Bobby
>
> On 10.06.2023 03:58, Bobby Eshleman wrote:
> > From: Jiang Wang <jiang.wang@bytedance.com>
> >
> > This patch adds tests for vsock datagram.
> >
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > ---
> > tools/testing/vsock/util.c | 141 ++++++++++++-
> > tools/testing/vsock/util.h | 6 +
> > tools/testing/vsock/vsock_test.c | 432 +++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 578 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
> > index 01b636d3039a..811e70d7cf1e 100644
> > --- a/tools/testing/vsock/util.c
> > +++ b/tools/testing/vsock/util.c
> > @@ -99,7 +99,8 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)
> > int ret;
> > int fd;
> >
> > - control_expectln("LISTENING");
> > + if (type != SOCK_DGRAM)
> > + control_expectln("LISTENING");
> >
> > fd = socket(AF_VSOCK, type, 0);
> >
> > @@ -130,6 +131,11 @@ int vsock_seqpacket_connect(unsigned int cid, unsigned int port)
> > return vsock_connect(cid, port, SOCK_SEQPACKET);
> > }
> >
> > +int vsock_dgram_connect(unsigned int cid, unsigned int port)
> > +{
> > + return vsock_connect(cid, port, SOCK_DGRAM);
> > +}
> > +
> > /* Listen on <cid, port> and return the first incoming connection. The remote
> > * address is stored to clientaddrp. clientaddrp may be NULL.
> > */
> > @@ -211,6 +217,34 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
> > return vsock_accept(cid, port, clientaddrp, SOCK_SEQPACKET);
> > }
> >
> > +int vsock_dgram_bind(unsigned int cid, unsigned int port)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = port,
> > + .svm_cid = cid,
> > + },
> > + };
> > + int fd;
> > +
> > + fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fd < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> > + perror("bind");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + return fd;
> > +}
> > +
> > /* Transmit one byte and check the return value.
> > *
> > * expected_ret:
> > @@ -260,6 +294,57 @@ void send_byte(int fd, int expected_ret, int flags)
> > }
> > }
> >
> > +/* Transmit one byte and check the return value.
> > + *
> > + * expected_ret:
> > + * <0 Negative errno (for testing errors)
> > + * 0 End-of-file
> > + * 1 Success
> > + */
> > +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> > + int flags)
> > +{
> > + const uint8_t byte = 'A';
> > + ssize_t nwritten;
> > +
> > + timeout_begin(TIMEOUT);
> > + do {
> > + nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
> > + len);
> > + timeout_check("write");
> > + } while (nwritten < 0 && errno == EINTR);
> > + timeout_end();
> > +
> > + if (expected_ret < 0) {
> > + if (nwritten != -1) {
> > + fprintf(stderr, "bogus sendto(2) return value %zd\n",
> > + nwritten);
> > + exit(EXIT_FAILURE);
> > + }
> > + if (errno != -expected_ret) {
> > + perror("write");
> > + exit(EXIT_FAILURE);
> > + }
> > + return;
> > + }
> > +
> > + if (nwritten < 0) {
> > + perror("write");
> > + exit(EXIT_FAILURE);
> > + }
> > + if (nwritten == 0) {
> > + if (expected_ret == 0)
> > + return;
> > +
> > + fprintf(stderr, "unexpected EOF while sending byte\n");
> > + exit(EXIT_FAILURE);
> > + }
> > + if (nwritten != sizeof(byte)) {
> > + fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
> > + exit(EXIT_FAILURE);
> > + }
> > +}
> > +
> > /* Receive one byte and check the return value.
> > *
> > * expected_ret:
> > @@ -313,6 +398,60 @@ void recv_byte(int fd, int expected_ret, int flags)
> > }
> > }
> >
> > +/* Receive one byte and check the return value.
> > + *
> > + * expected_ret:
> > + * <0 Negative errno (for testing errors)
> > + * 0 End-of-file
> > + * 1 Success
> > + */
> > +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> > + int expected_ret, int flags)
> > +{
> > + uint8_t byte;
> > + ssize_t nread;
> > +
> > + timeout_begin(TIMEOUT);
> > + do {
> > + nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
> > + timeout_check("read");
> > + } while (nread < 0 && errno == EINTR);
> > + timeout_end();
> > +
> > + if (expected_ret < 0) {
> > + if (nread != -1) {
> > + fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
> > + nread);
> > + exit(EXIT_FAILURE);
> > + }
> > + if (errno != -expected_ret) {
> > + perror("read");
> > + exit(EXIT_FAILURE);
> > + }
> > + return;
> > + }
> > +
> > + if (nread < 0) {
> > + perror("read");
> > + exit(EXIT_FAILURE);
> > + }
> > + if (nread == 0) {
> > + if (expected_ret == 0)
> > + return;
> > +
> > + fprintf(stderr, "unexpected EOF while receiving byte\n");
> > + exit(EXIT_FAILURE);
> > + }
> > + if (nread != sizeof(byte)) {
> > + fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
> > + exit(EXIT_FAILURE);
> > + }
> > + if (byte != 'A') {
> > + fprintf(stderr, "unexpected byte read %c\n", byte);
> > + exit(EXIT_FAILURE);
> > + }
> > +}
> > +
> > /* Run test cases. The program terminates if a failure occurs. */
> > void run_tests(const struct test_case *test_cases,
> > const struct test_opts *opts)
> > diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
> > index fb99208a95ea..a69e128d120c 100644
> > --- a/tools/testing/vsock/util.h
> > +++ b/tools/testing/vsock/util.h
> > @@ -37,13 +37,19 @@ void init_signals(void);
> > unsigned int parse_cid(const char *str);
> > int vsock_stream_connect(unsigned int cid, unsigned int port);
> > int vsock_seqpacket_connect(unsigned int cid, unsigned int port);
> > +int vsock_dgram_connect(unsigned int cid, unsigned int port);
> > int vsock_stream_accept(unsigned int cid, unsigned int port,
> > struct sockaddr_vm *clientaddrp);
> > int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
> > struct sockaddr_vm *clientaddrp);
> > +int vsock_dgram_bind(unsigned int cid, unsigned int port);
> > void vsock_wait_remote_close(int fd);
> > void send_byte(int fd, int expected_ret, int flags);
> > +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> > + int flags);
> > void recv_byte(int fd, int expected_ret, int flags);
> > +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> > + int expected_ret, int flags);
> > void run_tests(const struct test_case *test_cases,
> > const struct test_opts *opts);
> > void list_tests(const struct test_case *test_cases);
> > diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> > index ac1bd3ac1533..ded82d39ee5d 100644
> > --- a/tools/testing/vsock/vsock_test.c
> > +++ b/tools/testing/vsock/vsock_test.c
> > @@ -1053,6 +1053,413 @@ static void test_stream_virtio_skb_merge_server(const struct test_opts *opts)
> > close(fd);
> > }
> >
> > +static void test_dgram_sendto_client(const struct test_opts *opts)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = 1234,
> > + .svm_cid = opts->peer_cid,
> > + },
> > + };
> > + int fd;
> > +
> > + /* Wait for the server to be ready */
> > + control_expectln("BIND");
> > +
> > + fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fd < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
> > +
> > + /* Notify the server that the client has finished */
> > + control_writeln("DONE");
> > +
> > + close(fd);
> > +}
> > +
> > +static void test_dgram_sendto_server(const struct test_opts *opts)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = 1234,
> > + .svm_cid = VMADDR_CID_ANY,
> > + },
> > + };
> > + int len = sizeof(addr.sa);
> > + int fd;
> > +
> > + fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fd < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> > + perror("bind");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Notify the client that the server is ready */
> > + control_writeln("BIND");
> > +
> > + recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> > +
> > + /* Wait for the client to finish */
> > + control_expectln("DONE");
> > +
> > + close(fd);
> > +}
> > +
> > +static void test_dgram_connect_client(const struct test_opts *opts)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = 1234,
> > + .svm_cid = opts->peer_cid,
> > + },
> > + };
> > + int ret;
> > + int fd;
> > +
> > + /* Wait for the server to be ready */
> > + control_expectln("BIND");
> > +
> > + fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fd < 0) {
> > + perror("bind");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + ret = connect(fd, &addr.sa, sizeof(addr.svm));
> > + if (ret < 0) {
> > + perror("connect");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + send_byte(fd, 1, 0);
> > +
> > + /* Notify the server that the client has finished */
> > + control_writeln("DONE");
> > +
> > + close(fd);
> > +}
> > +
> > +static void test_dgram_connect_server(const struct test_opts *opts)
> > +{
> > + test_dgram_sendto_server(opts);
> > +}
> > +
> > +static void test_dgram_multiconn_sendto_client(const struct test_opts *opts)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = 1234,
> > + .svm_cid = opts->peer_cid,
> > + },
> > + };
> > + int fds[MULTICONN_NFDS];
> > + int i;
> > +
> > + /* Wait for the server to be ready */
> > + control_expectln("BIND");
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++) {
> > + fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fds[i] < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > + }
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++)
> > + sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
> > +
> > + /* Notify the server that the client has finished */
> > + control_writeln("DONE");
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++)
> > + close(fds[i]);
> > +}
> > +
> > +static void test_dgram_multiconn_sendto_server(const struct test_opts *opts)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = 1234,
> > + .svm_cid = VMADDR_CID_ANY,
> > + },
> > + };
> > + int len = sizeof(addr.sa);
> > + int fd;
> > + int i;
> > +
> > + fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fd < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> > + perror("bind");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Notify the client that the server is ready */
> > + control_writeln("BIND");
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++)
> > + recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> > +
> > + /* Wait for the client to finish */
> > + control_expectln("DONE");
> > +
> > + close(fd);
> > +}
> > +
> > +static void test_dgram_multiconn_send_client(const struct test_opts *opts)
> > +{
> > + int fds[MULTICONN_NFDS];
> > + int i;
> > +
> > + /* Wait for the server to be ready */
> > + control_expectln("BIND");
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++) {
> > + fds[i] = vsock_dgram_connect(opts->peer_cid, 1234);
> > + if (fds[i] < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > + }
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++)
> > + send_byte(fds[i], 1, 0);
> > +
> > + /* Notify the server that the client has finished */
> > + control_writeln("DONE");
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++)
> > + close(fds[i]);
> > +}
> > +
> > +static void test_dgram_multiconn_send_server(const struct test_opts *opts)
> > +{
> > + union {
> > + struct sockaddr sa;
> > + struct sockaddr_vm svm;
> > + } addr = {
> > + .svm = {
> > + .svm_family = AF_VSOCK,
> > + .svm_port = 1234,
> > + .svm_cid = VMADDR_CID_ANY,
> > + },
> > + };
> > + int fd;
> > + int i;
> > +
> > + fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > + if (fd < 0) {
> > + perror("socket");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> > + perror("bind");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Notify the client that the server is ready */
> > + control_writeln("BIND");
> > +
> > + for (i = 0; i < MULTICONN_NFDS; i++)
> > + recv_byte(fd, 1, 0);
> > +
> > + /* Wait for the client to finish */
> > + control_expectln("DONE");
> > +
> > + close(fd);
> > +}
> > +
> > +static void test_dgram_msg_bounds_client(const struct test_opts *opts)
> > +{
> > + unsigned long recv_buf_size;
> > + int page_size;
> > + int msg_cnt;
> > + int fd;
> > +
> > + fd = vsock_dgram_connect(opts->peer_cid, 1234);
> > + if (fd < 0) {
> > + perror("connect");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Let the server know the client is ready */
> > + control_writeln("CLNTREADY");
> > +
> > + msg_cnt = control_readulong();
> > + recv_buf_size = control_readulong();
> > +
> > + /* Wait, until receiver sets buffer size. */
> > + control_expectln("SRVREADY");
> > +
> > + page_size = getpagesize();
> > +
> > + for (int i = 0; i < msg_cnt; i++) {
> > + unsigned long curr_hash;
> > + ssize_t send_size;
> > + size_t buf_size;
> > + void *buf;
> > +
> > + /* Use "small" buffers and "big" buffers. */
> > + if (i & 1)
> > + buf_size = page_size +
> > + (rand() % (MAX_MSG_SIZE - page_size));
> > + else
> > + buf_size = 1 + (rand() % page_size);
> > +
> > + buf_size = min(buf_size, recv_buf_size);
> > +
> > + buf = malloc(buf_size);
> > +
> > + if (!buf) {
> > + perror("malloc");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + memset(buf, rand() & 0xff, buf_size);
> > + /* Set at least one MSG_EOR + some random. */
> > +
> > + send_size = send(fd, buf, buf_size, 0);
> > +
> > + if (send_size < 0) {
> > + perror("send");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + if (send_size != buf_size) {
> > + fprintf(stderr, "Invalid send size\n");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* In theory the implementation isn't required to transmit
> > + * these packets in order, so we use this SYNC control message
> > + * so that server and client coordinate sending and receiving
> > + * one packet at a time. The client sends a packet and waits
> > + * until it has been received before sending another.
> > + */
> > + control_writeln("PKTSENT");
> > + control_expectln("PKTRECV");
> > +
> > + /* Send the server a hash of the packet */
> > + curr_hash = hash_djb2(buf, buf_size);
> > + control_writeulong(curr_hash);
> > + free(buf);
> > + }
> > +
> > + control_writeln("SENDDONE");
> > + close(fd);
> > +}
> > +
> > +static void test_dgram_msg_bounds_server(const struct test_opts *opts)
> > +{
> > + const unsigned long msg_cnt = 16;
> > + unsigned long sock_buf_size;
> > + struct msghdr msg = {0};
> > + struct iovec iov = {0};
> > + char buf[MAX_MSG_SIZE];
> > + socklen_t len;
> > + int fd;
> > + int i;
> > +
> > + fd = vsock_dgram_bind(VMADDR_CID_ANY, 1234);
> > +
> > + if (fd < 0) {
> > + perror("bind");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Set receive buffer to maximum */
> > + sock_buf_size = -1;
> > + if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
> > + &sock_buf_size, sizeof(sock_buf_size))) {
> > + perror("setsockopt(SO_RECVBUF)");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Retrieve the receive buffer size */
> > + len = sizeof(sock_buf_size);
> > + if (getsockopt(fd, SOL_SOCKET, SO_RCVBUF,
> > + &sock_buf_size, &len)) {
> > + perror("getsockopt(SO_RECVBUF)");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + /* Client ready to receive parameters */
> > + control_expectln("CLNTREADY");
> > +
> > + control_writeulong(msg_cnt);
> > + control_writeulong(sock_buf_size);
> > +
> > + /* Ready to receive data. */
> > + control_writeln("SRVREADY");
> > +
> > + iov.iov_base = buf;
> > + iov.iov_len = sizeof(buf);
> > + msg.msg_iov = &iov;
> > + msg.msg_iovlen = 1;
> > +
> > + for (i = 0; i < msg_cnt; i++) {
> > + unsigned long remote_hash;
> > + unsigned long curr_hash;
> > + ssize_t recv_size;
> > +
> > + control_expectln("PKTSENT");
> > + recv_size = recvmsg(fd, &msg, 0);
> > + control_writeln("PKTRECV");
> > +
> > + if (!recv_size)
> > + break;
> > +
> > + if (recv_size < 0) {
> > + perror("recvmsg");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + curr_hash = hash_djb2(msg.msg_iov[0].iov_base, recv_size);
> > + remote_hash = control_readulong();
> > +
> > + if (curr_hash != remote_hash) {
> > + fprintf(stderr, "Message bounds broken\n");
> > + exit(EXIT_FAILURE);
> > + }
> > + }
> > +
> > + close(fd);
> > +}
> > +
> > static struct test_case test_cases[] = {
> > {
> > .name = "SOCK_STREAM connection reset",
> > @@ -1128,6 +1535,31 @@ static struct test_case test_cases[] = {
> > .run_client = test_stream_virtio_skb_merge_client,
> > .run_server = test_stream_virtio_skb_merge_server,
> > },
> > + {
> > + .name = "SOCK_DGRAM client sendto",
> > + .run_client = test_dgram_sendto_client,
> > + .run_server = test_dgram_sendto_server,
> > + },
> > + {
> > + .name = "SOCK_DGRAM client connect",
> > + .run_client = test_dgram_connect_client,
> > + .run_server = test_dgram_connect_server,
> > + },
> > + {
> > + .name = "SOCK_DGRAM multiple connections using sendto",
> > + .run_client = test_dgram_multiconn_sendto_client,
> > + .run_server = test_dgram_multiconn_sendto_server,
> > + },
> > + {
> > + .name = "SOCK_DGRAM multiple connections using send",
> > + .run_client = test_dgram_multiconn_send_client,
> > + .run_server = test_dgram_multiconn_send_server,
> > + },
> > + {
> > + .name = "SOCK_DGRAM msg bounds",
> > + .run_client = test_dgram_msg_bounds_client,
> > + .run_server = test_dgram_msg_bounds_server,
> > + },
> > {},
> > };
> >
> >
^ permalink raw reply
* Re: [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Bobby Eshleman @ 2023-06-22 23:25 UTC (permalink / raw)
To: Arseniy Krasnov
Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <3eb6216b-a3d2-e1ef-270c-8a0032a4a8a5@gmail.com>
On Sun, Jun 11, 2023 at 11:43:15PM +0300, Arseniy Krasnov wrote:
> Hello Bobby! Thanks for this patchset! Small comment below:
>
> On 10.06.2023 03:58, Bobby Eshleman wrote:
> > This commit drops the transport->dgram_dequeue callback and makes
> > vsock_dgram_recvmsg() generic. It also adds additional transport
> > callbacks for use by the generic vsock_dgram_recvmsg(), such as for
> > parsing skbs for CID/port which vary in format per transport.
> >
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > drivers/vhost/vsock.c | 4 +-
> > include/linux/virtio_vsock.h | 3 ++
> > include/net/af_vsock.h | 13 ++++++-
> > net/vmw_vsock/af_vsock.c | 51 ++++++++++++++++++++++++-
> > net/vmw_vsock/hyperv_transport.c | 17 +++++++--
> > net/vmw_vsock/virtio_transport.c | 4 +-
> > net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
> > net/vmw_vsock/vmci_transport.c | 68 +++++++++++++--------------------
> > net/vmw_vsock/vsock_loopback.c | 4 +-
> > 9 files changed, 132 insertions(+), 50 deletions(-)
> >
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index 6578db78f0ae..c8201c070b4b 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
> > .cancel_pkt = vhost_transport_cancel_pkt,
> >
> > .dgram_enqueue = virtio_transport_dgram_enqueue,
> > - .dgram_dequeue = virtio_transport_dgram_dequeue,
> > .dgram_bind = virtio_transport_dgram_bind,
> > .dgram_allow = virtio_transport_dgram_allow,
> > + .dgram_get_cid = virtio_transport_dgram_get_cid,
> > + .dgram_get_port = virtio_transport_dgram_get_port,
> > + .dgram_get_length = virtio_transport_dgram_get_length,
> >
> > .stream_enqueue = virtio_transport_stream_enqueue,
> > .stream_dequeue = virtio_transport_stream_dequeue,
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index c58453699ee9..23521a318cf0 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
> > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > struct sockaddr_vm *addr);
> > bool virtio_transport_dgram_allow(u32 cid, u32 port);
> > +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> > +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> > +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
> >
> > int virtio_transport_connect(struct vsock_sock *vsk);
> >
> > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > index 0e7504a42925..7bedb9ee7e3e 100644
> > --- a/include/net/af_vsock.h
> > +++ b/include/net/af_vsock.h
> > @@ -120,11 +120,20 @@ struct vsock_transport {
> >
> > /* DGRAM. */
> > int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
> > - int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> > - size_t len, int flags);
> > int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
> > struct msghdr *, size_t len);
> > bool (*dgram_allow)(u32 cid, u32 port);
> > + int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
> > + int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
> > + int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
> > +
> > + /* The number of bytes into the buffer at which the payload starts, as
> > + * first seen by the receiving socket layer. For example, if the
> > + * transport presets the skb pointers using skb_pull(sizeof(header))
> > + * than this would be zero, otherwise it would be the size of the
> > + * header.
> > + */
> > + const size_t dgram_payload_offset;
> >
> > /* STREAM. */
> > /* TODO: stream_bind() */
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index efb8a0937a13..ffb4dd8b6ea7 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
> > int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> > size_t len, int flags)
> > {
> > + const struct vsock_transport *transport;
> > #ifdef CONFIG_BPF_SYSCALL
> > const struct proto *prot;
> > #endif
> > struct vsock_sock *vsk;
> > + struct sk_buff *skb;
> > + size_t payload_len;
> > struct sock *sk;
> > + int err;
> >
> > sk = sock->sk;
> > vsk = vsock_sk(sk);
> > @@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> > return prot->recvmsg(sk, msg, len, flags, NULL);
> > #endif
> >
> > - return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
> > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > + return -EOPNOTSUPP;
> > +
> > + transport = vsk->transport;
> > +
> > + /* Retrieve the head sk_buff from the socket's receive queue. */
> > + err = 0;
> > + skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
> > + if (!skb)
> > + return err;
> > +
> > + err = transport->dgram_get_length(skb, &payload_len);
> > + if (err)
> > + goto out;
> > +
> > + if (payload_len > len) {
> > + payload_len = len;
> > + msg->msg_flags |= MSG_TRUNC;
> > + }
> > +
> > + /* Place the datagram payload in the user's iovec. */
> > + err = skb_copy_datagram_msg(skb, transport->dgram_payload_offset, msg, payload_len);
> > + if (err)
> > + goto out;
> > +
> > + if (msg->msg_name) {
> > + /* Provide the address of the sender. */
> > + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> > + unsigned int cid, port;
> > +
> > + err = transport->dgram_get_cid(skb, &cid);
> > + if (err)
> > + goto out;
> > +
> > + err = transport->dgram_get_port(skb, &port);
> > + if (err)
> > + goto out;
>
> Maybe we can merge 'dgram_get_cid' and 'dgram_get_port' to a single callback? Because I see that this is
> the only place where both are used (correct me if i'm wrong) and logically both operates with addresses:
> CID and port. E.g. something like that: dgram_get_cid_n_port().
I like this idea.
>
> Moreover, I'm not sure, but is it good "tradeoff" here: remove transport specific callback for dgram receive
> where we already have 'msghdr' with both data buffer and buffer for 'sockaddr_vm' and instead of it add new
> several fields (callbacks) to transports like dgram_get_cid(), dgram_get_port()? I agree, that in each transport
> specific callback we will have same copying logic by calling 'skb_copy_datagram_msg()' and filling address
> by using 'vsock_addr_init()', but in this case we don't need to update transports too much. For example HyperV
> still unchanged as it does not support SOCK_DGRAM. For VMCI You just need to add 'vsock_addr_init()' logic
> to it's dgram dequeue callback.
>
> What do You think?
>
> Thanks, Arseniy
>
I tend to agree with your point here that adding this many callbacks is
not the big win in complexity reduction that we're hoping for.
I also agree with Stefano's assessment that having two near identical
implementations is not good either.
Hopefully having one simpler callback will bring the best of both
worlds?
Best,
Bobby
> > +
> > + vsock_addr_init(vm_addr, cid, port);
> > + msg->msg_namelen = sizeof(*vm_addr);
> > + }
> > + err = payload_len;
> > +
> > +out:
> > + skb_free_datagram(&vsk->sk, skb);
> > + return err;
> > }
> > EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
> >
> > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > index 7cb1a9d2cdb4..ff6e87e25fa0 100644
> > --- a/net/vmw_vsock/hyperv_transport.c
> > +++ b/net/vmw_vsock/hyperv_transport.c
> > @@ -556,8 +556,17 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
> > return -EOPNOTSUPP;
> > }
> >
> > -static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
> > - size_t len, int flags)
> > +static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static int hvs_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
> > {
> > return -EOPNOTSUPP;
> > }
> > @@ -833,7 +842,9 @@ static struct vsock_transport hvs_transport = {
> > .shutdown = hvs_shutdown,
> >
> > .dgram_bind = hvs_dgram_bind,
> > - .dgram_dequeue = hvs_dgram_dequeue,
> > + .dgram_get_cid = hvs_dgram_get_cid,
> > + .dgram_get_port = hvs_dgram_get_port,
> > + .dgram_get_length = hvs_dgram_get_length,
> > .dgram_enqueue = hvs_dgram_enqueue,
> > .dgram_allow = hvs_dgram_allow,
> >
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index e95df847176b..5763cdf13804 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -429,9 +429,11 @@ static struct virtio_transport virtio_transport = {
> > .cancel_pkt = virtio_transport_cancel_pkt,
> >
> > .dgram_bind = virtio_transport_dgram_bind,
> > - .dgram_dequeue = virtio_transport_dgram_dequeue,
> > .dgram_enqueue = virtio_transport_dgram_enqueue,
> > .dgram_allow = virtio_transport_dgram_allow,
> > + .dgram_get_cid = virtio_transport_dgram_get_cid,
> > + .dgram_get_port = virtio_transport_dgram_get_port,
> > + .dgram_get_length = virtio_transport_dgram_get_length,
> >
> > .stream_dequeue = virtio_transport_stream_dequeue,
> > .stream_enqueue = virtio_transport_stream_enqueue,
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index b769fc258931..e6903c719964 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -797,6 +797,24 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > }
> > EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> >
> > +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
> > +
> > +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
> > +
> > +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
> > +
> > bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > {
> > return false;
> > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> > index b370070194fa..bbc63826bf48 100644
> > --- a/net/vmw_vsock/vmci_transport.c
> > +++ b/net/vmw_vsock/vmci_transport.c
> > @@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
> > return err - sizeof(*dg);
> > }
> >
> > -static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
> > - struct msghdr *msg, size_t len,
> > - int flags)
> > +static int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > {
> > - int err;
> > struct vmci_datagram *dg;
> > - size_t payload_len;
> > - struct sk_buff *skb;
> >
> > - if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > - return -EOPNOTSUPP;
> > + dg = (struct vmci_datagram *)skb->data;
> > + if (!dg)
> > + return -EINVAL;
> >
> > - /* Retrieve the head sk_buff from the socket's receive queue. */
> > - err = 0;
> > - skb = skb_recv_datagram(&vsk->sk, flags, &err);
> > - if (!skb)
> > - return err;
> > + *cid = dg->src.context;
> > + return 0;
> > +}
> > +
> > +static int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > +{
> > + struct vmci_datagram *dg;
> >
> > dg = (struct vmci_datagram *)skb->data;
> > if (!dg)
> > - /* err is 0, meaning we read zero bytes. */
> > - goto out;
> > -
> > - payload_len = dg->payload_size;
> > - /* Ensure the sk_buff matches the payload size claimed in the packet. */
> > - if (payload_len != skb->len - sizeof(*dg)) {
> > - err = -EINVAL;
> > - goto out;
> > - }
> > + return -EINVAL;
> >
> > - if (payload_len > len) {
> > - payload_len = len;
> > - msg->msg_flags |= MSG_TRUNC;
> > - }
> > + *port = dg->src.resource;
> > + return 0;
> > +}
> >
> > - /* Place the datagram payload in the user's iovec. */
> > - err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
> > - if (err)
> > - goto out;
> > +static int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> > +{
> > + struct vmci_datagram *dg;
> >
> > - if (msg->msg_name) {
> > - /* Provide the address of the sender. */
> > - DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> > - vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
> > - msg->msg_namelen = sizeof(*vm_addr);
> > - }
> > - err = payload_len;
> > + dg = (struct vmci_datagram *)skb->data;
> > + if (!dg)
> > + return -EINVAL;
> >
> > -out:
> > - skb_free_datagram(&vsk->sk, skb);
> > - return err;
> > + *len = dg->payload_size;
> > + return 0;
> > }
> >
> > static bool vmci_transport_dgram_allow(u32 cid, u32 port)
> > @@ -2040,9 +2023,12 @@ static struct vsock_transport vmci_transport = {
> > .release = vmci_transport_release,
> > .connect = vmci_transport_connect,
> > .dgram_bind = vmci_transport_dgram_bind,
> > - .dgram_dequeue = vmci_transport_dgram_dequeue,
> > .dgram_enqueue = vmci_transport_dgram_enqueue,
> > .dgram_allow = vmci_transport_dgram_allow,
> > + .dgram_get_cid = vmci_transport_dgram_get_cid,
> > + .dgram_get_port = vmci_transport_dgram_get_port,
> > + .dgram_get_length = vmci_transport_dgram_get_length,
> > + .dgram_payload_offset = sizeof(struct vmci_datagram),
> > .stream_dequeue = vmci_transport_stream_dequeue,
> > .stream_enqueue = vmci_transport_stream_enqueue,
> > .stream_has_data = vmci_transport_stream_has_data,
> > diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> > index 5c6360df1f31..2f3cabc79ee5 100644
> > --- a/net/vmw_vsock/vsock_loopback.c
> > +++ b/net/vmw_vsock/vsock_loopback.c
> > @@ -62,9 +62,11 @@ static struct virtio_transport loopback_transport = {
> > .cancel_pkt = vsock_loopback_cancel_pkt,
> >
> > .dgram_bind = virtio_transport_dgram_bind,
> > - .dgram_dequeue = virtio_transport_dgram_dequeue,
> > .dgram_enqueue = virtio_transport_dgram_enqueue,
> > .dgram_allow = virtio_transport_dgram_allow,
> > + .dgram_get_cid = virtio_transport_dgram_get_cid,
> > + .dgram_get_port = virtio_transport_dgram_get_port,
> > + .dgram_get_length = virtio_transport_dgram_get_length,
> >
> > .stream_dequeue = virtio_transport_stream_dequeue,
> > .stream_enqueue = virtio_transport_stream_enqueue,
> >
^ permalink raw reply
* Re: [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Bobby Eshleman @ 2023-06-22 23:34 UTC (permalink / raw)
To: Arseniy Krasnov
Cc: Stefano Garzarella, Bobby Eshleman, Stefan Hajnoczi,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <2a0c29d5-d696-ac24-2a4a-1d691eef8daf@gmail.com>
On Thu, Jun 22, 2023 at 10:23:26PM +0300, Arseniy Krasnov wrote:
>
>
> On 22.06.2023 17:51, Stefano Garzarella wrote:
> > On Sun, Jun 11, 2023 at 11:43:15PM +0300, Arseniy Krasnov wrote:
> >> Hello Bobby! Thanks for this patchset! Small comment below:
> >>
> >> On 10.06.2023 03:58, Bobby Eshleman wrote:
> >>> This commit drops the transport->dgram_dequeue callback and makes
> >>> vsock_dgram_recvmsg() generic. It also adds additional transport
> >>> callbacks for use by the generic vsock_dgram_recvmsg(), such as for
> >>> parsing skbs for CID/port which vary in format per transport.
> >>>
> >>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> >>> ---
> >>> drivers/vhost/vsock.c | 4 +-
> >>> include/linux/virtio_vsock.h | 3 ++
> >>> include/net/af_vsock.h | 13 ++++++-
> >>> net/vmw_vsock/af_vsock.c | 51 ++++++++++++++++++++++++-
> >>> net/vmw_vsock/hyperv_transport.c | 17 +++++++--
> >>> net/vmw_vsock/virtio_transport.c | 4 +-
> >>> net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
> >>> net/vmw_vsock/vmci_transport.c | 68 +++++++++++++--------------------
> >>> net/vmw_vsock/vsock_loopback.c | 4 +-
> >>> 9 files changed, 132 insertions(+), 50 deletions(-)
> >>>
> >>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >>> index 6578db78f0ae..c8201c070b4b 100644
> >>> --- a/drivers/vhost/vsock.c
> >>> +++ b/drivers/vhost/vsock.c
> >>> @@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
> >>> .cancel_pkt = vhost_transport_cancel_pkt,
> >>>
> >>> .dgram_enqueue = virtio_transport_dgram_enqueue,
> >>> - .dgram_dequeue = virtio_transport_dgram_dequeue,
> >>> .dgram_bind = virtio_transport_dgram_bind,
> >>> .dgram_allow = virtio_transport_dgram_allow,
> >>> + .dgram_get_cid = virtio_transport_dgram_get_cid,
> >>> + .dgram_get_port = virtio_transport_dgram_get_port,
> >>> + .dgram_get_length = virtio_transport_dgram_get_length,
> >>>
> >>> .stream_enqueue = virtio_transport_stream_enqueue,
> >>> .stream_dequeue = virtio_transport_stream_dequeue,
> >>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> >>> index c58453699ee9..23521a318cf0 100644
> >>> --- a/include/linux/virtio_vsock.h
> >>> +++ b/include/linux/virtio_vsock.h
> >>> @@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
> >>> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> >>> struct sockaddr_vm *addr);
> >>> bool virtio_transport_dgram_allow(u32 cid, u32 port);
> >>> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> >>> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> >>> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
> >>>
> >>> int virtio_transport_connect(struct vsock_sock *vsk);
> >>>
> >>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >>> index 0e7504a42925..7bedb9ee7e3e 100644
> >>> --- a/include/net/af_vsock.h
> >>> +++ b/include/net/af_vsock.h
> >>> @@ -120,11 +120,20 @@ struct vsock_transport {
> >>>
> >>> /* DGRAM. */
> >>> int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
> >>> - int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> >>> - size_t len, int flags);
> >>> int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
> >>> struct msghdr *, size_t len);
> >>> bool (*dgram_allow)(u32 cid, u32 port);
> >>> + int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
> >>> + int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
> >>> + int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
> >>> +
> >>> + /* The number of bytes into the buffer at which the payload starts, as
> >>> + * first seen by the receiving socket layer. For example, if the
> >>> + * transport presets the skb pointers using skb_pull(sizeof(header))
> >>> + * than this would be zero, otherwise it would be the size of the
> >>> + * header.
> >>> + */
> >>> + const size_t dgram_payload_offset;
> >>>
> >>> /* STREAM. */
> >>> /* TODO: stream_bind() */
> >>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >>> index efb8a0937a13..ffb4dd8b6ea7 100644
> >>> --- a/net/vmw_vsock/af_vsock.c
> >>> +++ b/net/vmw_vsock/af_vsock.c
> >>> @@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
> >>> int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> >>> size_t len, int flags)
> >>> {
> >>> + const struct vsock_transport *transport;
> >>> #ifdef CONFIG_BPF_SYSCALL
> >>> const struct proto *prot;
> >>> #endif
> >>> struct vsock_sock *vsk;
> >>> + struct sk_buff *skb;
> >>> + size_t payload_len;
> >>> struct sock *sk;
> >>> + int err;
> >>>
> >>> sk = sock->sk;
> >>> vsk = vsock_sk(sk);
> >>> @@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> >>> return prot->recvmsg(sk, msg, len, flags, NULL);
> >>> #endif
> >>>
> >>> - return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
> >>> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> >>> + return -EOPNOTSUPP;
> >>> +
> >>> + transport = vsk->transport;
> >>> +
> >>> + /* Retrieve the head sk_buff from the socket's receive queue. */
> >>> + err = 0;
> >>> + skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
> >>> + if (!skb)
> >>> + return err;
> >>> +
> >>> + err = transport->dgram_get_length(skb, &payload_len);
> >
> > What about ssize_t return value here?
> >
> > Or maybe a single callback that return both length and offset?
> >
> > .dgram_get_payload_info(skb, &payload_len, &payload_off)
>
> Just architectural question:
>
> May be we can avoid this callback for length? IIUC concept of skbuff is that
> current level of network stack already have pointer to its data 'skb->data' and
> length of the payload 'skb->len' (both are set by previous stack handler - transport in
> this case), so here we can use just 'skb->len' and thats all. There is no need to ask
> lower level of network stack for length of payload? I see that VMCI stores metadata
> with payload in 'data' buffer, but may be it is more correctly to do 'skb_pull()'
> in vmci before inserting skbuff to sockets queue? In this case field with dgram payload
> offset could be removed from transport.
>
> Thanks, Arseniy
>
I agree, I think introducing the skb_pull() to vmci would honestly be
best. I don't think it should break anything based on my reading of the
code if it is called just prior to sk_receive_skb().
Thanks,
Bobby
> >
> >>> + if (err)
> >>> + goto out;
> >>> +
> >>> + if (payload_len > len) {
> >>> + payload_len = len;
> >>> + msg->msg_flags |= MSG_TRUNC;
> >>> + }
> >>> +
> >>> + /* Place the datagram payload in the user's iovec. */
> >>> + err = skb_copy_datagram_msg(skb, transport->dgram_payload_offset, msg, payload_len);
> >>> + if (err)
> >>> + goto out;
> >>> +
> >>> + if (msg->msg_name) {
> >>> + /* Provide the address of the sender. */
> >>> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> >>> + unsigned int cid, port;
> >>> +
> >>> + err = transport->dgram_get_cid(skb, &cid);
> >>> + if (err)
> >>> + goto out;
> >>> +
> >>> + err = transport->dgram_get_port(skb, &port);
> >>> + if (err)
> >>> + goto out;
> >>
> >> Maybe we can merge 'dgram_get_cid' and 'dgram_get_port' to a single callback? Because I see that this is
> >> the only place where both are used (correct me if i'm wrong) and logically both operates with addresses:
> >> CID and port. E.g. something like that: dgram_get_cid_n_port().
> >
> > What about .dgram_addr_init(struct sk_buff *skb, struct sockaddr_vm *addr)
> > and the transport can set cid and port?
> >
> >>
> >> Moreover, I'm not sure, but is it good "tradeoff" here: remove transport specific callback for dgram receive
> >> where we already have 'msghdr' with both data buffer and buffer for 'sockaddr_vm' and instead of it add new
> >> several fields (callbacks) to transports like dgram_get_cid(), dgram_get_port()? I agree, that in each transport
> >> specific callback we will have same copying logic by calling 'skb_copy_datagram_msg()' and filling address
> >> by using 'vsock_addr_init()', but in this case we don't need to update transports too much. For example HyperV
> >> still unchanged as it does not support SOCK_DGRAM. For VMCI You just need to add 'vsock_addr_init()' logic
> >> to it's dgram dequeue callback.
> >>
> >> What do You think?
> >
> > Honestly, I'd rather avoid duplicate code than reduce changes in
> > transports that don't support dgram.
> >
> > One thing I do agree on though is minimizing the number of callbacks
> > to call to reduce the number of indirection (more performance?).
> >
> > Thanks,
> > Stefano
> >
> >>
> >> Thanks, Arseniy
> >>
> >>> +
> >>> + vsock_addr_init(vm_addr, cid, port);
> >>> + msg->msg_namelen = sizeof(*vm_addr);
> >>> + }
> >>> + err = payload_len;
> >>> +
> >>> +out:
> >>> + skb_free_datagram(&vsk->sk, skb);
> >>> + return err;
> >>> }
> >>> EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
> >>>
> >>> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> >>> index 7cb1a9d2cdb4..ff6e87e25fa0 100644
> >>> --- a/net/vmw_vsock/hyperv_transport.c
> >>> +++ b/net/vmw_vsock/hyperv_transport.c
> >>> @@ -556,8 +556,17 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
> >>> return -EOPNOTSUPP;
> >>> }
> >>>
> >>> -static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
> >>> - size_t len, int flags)
> >>> +static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> >>> +{
> >>> + return -EOPNOTSUPP;
> >>> +}
> >>> +
> >>> +static int hvs_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> >>> +{
> >>> + return -EOPNOTSUPP;
> >>> +}
> >>> +
> >>> +static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
> >>> {
> >>> return -EOPNOTSUPP;
> >>> }
> >>> @@ -833,7 +842,9 @@ static struct vsock_transport hvs_transport = {
> >>> .shutdown = hvs_shutdown,
> >>>
> >>> .dgram_bind = hvs_dgram_bind,
> >>> - .dgram_dequeue = hvs_dgram_dequeue,
> >>> + .dgram_get_cid = hvs_dgram_get_cid,
> >>> + .dgram_get_port = hvs_dgram_get_port,
> >>> + .dgram_get_length = hvs_dgram_get_length,
> >>> .dgram_enqueue = hvs_dgram_enqueue,
> >>> .dgram_allow = hvs_dgram_allow,
> >>>
> >>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >>> index e95df847176b..5763cdf13804 100644
> >>> --- a/net/vmw_vsock/virtio_transport.c
> >>> +++ b/net/vmw_vsock/virtio_transport.c
> >>> @@ -429,9 +429,11 @@ static struct virtio_transport virtio_transport = {
> >>> .cancel_pkt = virtio_transport_cancel_pkt,
> >>>
> >>> .dgram_bind = virtio_transport_dgram_bind,
> >>> - .dgram_dequeue = virtio_transport_dgram_dequeue,
> >>> .dgram_enqueue = virtio_transport_dgram_enqueue,
> >>> .dgram_allow = virtio_transport_dgram_allow,
> >>> + .dgram_get_cid = virtio_transport_dgram_get_cid,
> >>> + .dgram_get_port = virtio_transport_dgram_get_port,
> >>> + .dgram_get_length = virtio_transport_dgram_get_length,
> >>>
> >>> .stream_dequeue = virtio_transport_stream_dequeue,
> >>> .stream_enqueue = virtio_transport_stream_enqueue,
> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>> index b769fc258931..e6903c719964 100644
> >>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>> @@ -797,6 +797,24 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> >>> }
> >>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> >>>
> >>> +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> >>> +{
> >>> + return -EOPNOTSUPP;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
> >>> +
> >>> +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> >>> +{
> >>> + return -EOPNOTSUPP;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
> >>> +
> >>> +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> >>> +{
> >>> + return -EOPNOTSUPP;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
> >>> +
> >>> bool virtio_transport_dgram_allow(u32 cid, u32 port)
> >>> {
> >>> return false;
> >>> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> >>> index b370070194fa..bbc63826bf48 100644
> >>> --- a/net/vmw_vsock/vmci_transport.c
> >>> +++ b/net/vmw_vsock/vmci_transport.c
> >>> @@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
> >>> return err - sizeof(*dg);
> >>> }
> >>>
> >>> -static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
> >>> - struct msghdr *msg, size_t len,
> >>> - int flags)
> >>> +static int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> >>> {
> >>> - int err;
> >>> struct vmci_datagram *dg;
> >>> - size_t payload_len;
> >>> - struct sk_buff *skb;
> >>>
> >>> - if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> >>> - return -EOPNOTSUPP;
> >>> + dg = (struct vmci_datagram *)skb->data;
> >>> + if (!dg)
> >>> + return -EINVAL;
> >>>
> >>> - /* Retrieve the head sk_buff from the socket's receive queue. */
> >>> - err = 0;
> >>> - skb = skb_recv_datagram(&vsk->sk, flags, &err);
> >>> - if (!skb)
> >>> - return err;
> >>> + *cid = dg->src.context;
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> >>> +{
> >>> + struct vmci_datagram *dg;
> >>>
> >>> dg = (struct vmci_datagram *)skb->data;
> >>> if (!dg)
> >>> - /* err is 0, meaning we read zero bytes. */
> >>> - goto out;
> >>> -
> >>> - payload_len = dg->payload_size;
> >>> - /* Ensure the sk_buff matches the payload size claimed in the packet. */
> >>> - if (payload_len != skb->len - sizeof(*dg)) {
> >>> - err = -EINVAL;
> >>> - goto out;
> >>> - }
> >>> + return -EINVAL;
> >>>
> >>> - if (payload_len > len) {
> >>> - payload_len = len;
> >>> - msg->msg_flags |= MSG_TRUNC;
> >>> - }
> >>> + *port = dg->src.resource;
> >>> + return 0;
> >>> +}
> >>>
> >>> - /* Place the datagram payload in the user's iovec. */
> >>> - err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
> >>> - if (err)
> >>> - goto out;
> >>> +static int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> >>> +{
> >>> + struct vmci_datagram *dg;
> >>>
> >>> - if (msg->msg_name) {
> >>> - /* Provide the address of the sender. */
> >>> - DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> >>> - vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
> >>> - msg->msg_namelen = sizeof(*vm_addr);
> >>> - }
> >>> - err = payload_len;
> >>> + dg = (struct vmci_datagram *)skb->data;
> >>> + if (!dg)
> >>> + return -EINVAL;
> >>>
> >>> -out:
> >>> - skb_free_datagram(&vsk->sk, skb);
> >>> - return err;
> >>> + *len = dg->payload_size;
> >>> + return 0;
> >>> }
> >>>
> >>> static bool vmci_transport_dgram_allow(u32 cid, u32 port)
> >>> @@ -2040,9 +2023,12 @@ static struct vsock_transport vmci_transport = {
> >>> .release = vmci_transport_release,
> >>> .connect = vmci_transport_connect,
> >>> .dgram_bind = vmci_transport_dgram_bind,
> >>> - .dgram_dequeue = vmci_transport_dgram_dequeue,
> >>> .dgram_enqueue = vmci_transport_dgram_enqueue,
> >>> .dgram_allow = vmci_transport_dgram_allow,
> >>> + .dgram_get_cid = vmci_transport_dgram_get_cid,
> >>> + .dgram_get_port = vmci_transport_dgram_get_port,
> >>> + .dgram_get_length = vmci_transport_dgram_get_length,
> >>> + .dgram_payload_offset = sizeof(struct vmci_datagram),
> >>> .stream_dequeue = vmci_transport_stream_dequeue,
> >>> .stream_enqueue = vmci_transport_stream_enqueue,
> >>> .stream_has_data = vmci_transport_stream_has_data,
> >>> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> >>> index 5c6360df1f31..2f3cabc79ee5 100644
> >>> --- a/net/vmw_vsock/vsock_loopback.c
> >>> +++ b/net/vmw_vsock/vsock_loopback.c
> >>> @@ -62,9 +62,11 @@ static struct virtio_transport loopback_transport = {
> >>> .cancel_pkt = vsock_loopback_cancel_pkt,
> >>>
> >>> .dgram_bind = virtio_transport_dgram_bind,
> >>> - .dgram_dequeue = virtio_transport_dgram_dequeue,
> >>> .dgram_enqueue = virtio_transport_dgram_enqueue,
> >>> .dgram_allow = virtio_transport_dgram_allow,
> >>> + .dgram_get_cid = virtio_transport_dgram_get_cid,
> >>> + .dgram_get_port = virtio_transport_dgram_get_port,
> >>> + .dgram_get_length = virtio_transport_dgram_get_length,
> >>>
> >>> .stream_dequeue = virtio_transport_stream_dequeue,
> >>> .stream_enqueue = virtio_transport_stream_enqueue,
> >>>
> >>
> >
^ permalink raw reply
* Re: [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Bobby Eshleman @ 2023-06-22 23:37 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Arseniy Krasnov, linux-hyperv, Bobby Eshleman, kvm,
Michael S. Tsirkin, VMware PV-Drivers Reviewers, Simon Horman,
virtualization, Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu,
Dexuan Cui, Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
Stefan Hajnoczi, Vishnu Dasa, netdev, linux-kernel, bpf,
David S. Miller
In-Reply-To: <63ko2n5fwjdefot6rzcxdftfh6pilg6vmqn66v4ue5dgf4oz53@tntmdijw4ghr>
On Thu, Jun 22, 2023 at 04:51:41PM +0200, Stefano Garzarella wrote:
> On Sun, Jun 11, 2023 at 11:43:15PM +0300, Arseniy Krasnov wrote:
> > Hello Bobby! Thanks for this patchset! Small comment below:
> >
> > On 10.06.2023 03:58, Bobby Eshleman wrote:
> > > This commit drops the transport->dgram_dequeue callback and makes
> > > vsock_dgram_recvmsg() generic. It also adds additional transport
> > > callbacks for use by the generic vsock_dgram_recvmsg(), such as for
> > > parsing skbs for CID/port which vary in format per transport.
> > >
> > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > > ---
> > > drivers/vhost/vsock.c | 4 +-
> > > include/linux/virtio_vsock.h | 3 ++
> > > include/net/af_vsock.h | 13 ++++++-
> > > net/vmw_vsock/af_vsock.c | 51 ++++++++++++++++++++++++-
> > > net/vmw_vsock/hyperv_transport.c | 17 +++++++--
> > > net/vmw_vsock/virtio_transport.c | 4 +-
> > > net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
> > > net/vmw_vsock/vmci_transport.c | 68 +++++++++++++--------------------
> > > net/vmw_vsock/vsock_loopback.c | 4 +-
> > > 9 files changed, 132 insertions(+), 50 deletions(-)
> > >
> > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > index 6578db78f0ae..c8201c070b4b 100644
> > > --- a/drivers/vhost/vsock.c
> > > +++ b/drivers/vhost/vsock.c
> > > @@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
> > > .cancel_pkt = vhost_transport_cancel_pkt,
> > >
> > > .dgram_enqueue = virtio_transport_dgram_enqueue,
> > > - .dgram_dequeue = virtio_transport_dgram_dequeue,
> > > .dgram_bind = virtio_transport_dgram_bind,
> > > .dgram_allow = virtio_transport_dgram_allow,
> > > + .dgram_get_cid = virtio_transport_dgram_get_cid,
> > > + .dgram_get_port = virtio_transport_dgram_get_port,
> > > + .dgram_get_length = virtio_transport_dgram_get_length,
> > >
> > > .stream_enqueue = virtio_transport_stream_enqueue,
> > > .stream_dequeue = virtio_transport_stream_dequeue,
> > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > > index c58453699ee9..23521a318cf0 100644
> > > --- a/include/linux/virtio_vsock.h
> > > +++ b/include/linux/virtio_vsock.h
> > > @@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
> > > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > struct sockaddr_vm *addr);
> > > bool virtio_transport_dgram_allow(u32 cid, u32 port);
> > > +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> > > +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> > > +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
> > >
> > > int virtio_transport_connect(struct vsock_sock *vsk);
> > >
> > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > index 0e7504a42925..7bedb9ee7e3e 100644
> > > --- a/include/net/af_vsock.h
> > > +++ b/include/net/af_vsock.h
> > > @@ -120,11 +120,20 @@ struct vsock_transport {
> > >
> > > /* DGRAM. */
> > > int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
> > > - int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> > > - size_t len, int flags);
> > > int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
> > > struct msghdr *, size_t len);
> > > bool (*dgram_allow)(u32 cid, u32 port);
> > > + int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
> > > + int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
> > > + int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
> > > +
> > > + /* The number of bytes into the buffer at which the payload starts, as
> > > + * first seen by the receiving socket layer. For example, if the
> > > + * transport presets the skb pointers using skb_pull(sizeof(header))
> > > + * than this would be zero, otherwise it would be the size of the
> > > + * header.
> > > + */
> > > + const size_t dgram_payload_offset;
> > >
> > > /* STREAM. */
> > > /* TODO: stream_bind() */
> > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > > index efb8a0937a13..ffb4dd8b6ea7 100644
> > > --- a/net/vmw_vsock/af_vsock.c
> > > +++ b/net/vmw_vsock/af_vsock.c
> > > @@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
> > > int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> > > size_t len, int flags)
> > > {
> > > + const struct vsock_transport *transport;
> > > #ifdef CONFIG_BPF_SYSCALL
> > > const struct proto *prot;
> > > #endif
> > > struct vsock_sock *vsk;
> > > + struct sk_buff *skb;
> > > + size_t payload_len;
> > > struct sock *sk;
> > > + int err;
> > >
> > > sk = sock->sk;
> > > vsk = vsock_sk(sk);
> > > @@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> > > return prot->recvmsg(sk, msg, len, flags, NULL);
> > > #endif
> > >
> > > - return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
> > > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > > + return -EOPNOTSUPP;
> > > +
> > > + transport = vsk->transport;
> > > +
> > > + /* Retrieve the head sk_buff from the socket's receive queue. */
> > > + err = 0;
> > > + skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
> > > + if (!skb)
> > > + return err;
> > > +
> > > + err = transport->dgram_get_length(skb, &payload_len);
>
> What about ssize_t return value here?
>
> Or maybe a single callback that return both length and offset?
>
> .dgram_get_payload_info(skb, &payload_len, &payload_off)
>
What are your thoughts on Arseniy's idea of using skb->len and adding a
skb_pull() just before vmci adds the skb to the sk receive queue?
> > > + if (err)
> > > + goto out;
> > > +
> > > + if (payload_len > len) {
> > > + payload_len = len;
> > > + msg->msg_flags |= MSG_TRUNC;
> > > + }
> > > +
> > > + /* Place the datagram payload in the user's iovec. */
> > > + err = skb_copy_datagram_msg(skb, transport->dgram_payload_offset, msg, payload_len);
> > > + if (err)
> > > + goto out;
> > > +
> > > + if (msg->msg_name) {
> > > + /* Provide the address of the sender. */
> > > + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> > > + unsigned int cid, port;
> > > +
> > > + err = transport->dgram_get_cid(skb, &cid);
> > > + if (err)
> > > + goto out;
> > > +
> > > + err = transport->dgram_get_port(skb, &port);
> > > + if (err)
> > > + goto out;
> >
> > Maybe we can merge 'dgram_get_cid' and 'dgram_get_port' to a single callback? Because I see that this is
> > the only place where both are used (correct me if i'm wrong) and logically both operates with addresses:
> > CID and port. E.g. something like that: dgram_get_cid_n_port().
>
> What about .dgram_addr_init(struct sk_buff *skb, struct sockaddr_vm *addr)
> and the transport can set cid and port?
>
LGTM!
> >
> > Moreover, I'm not sure, but is it good "tradeoff" here: remove transport specific callback for dgram receive
> > where we already have 'msghdr' with both data buffer and buffer for 'sockaddr_vm' and instead of it add new
> > several fields (callbacks) to transports like dgram_get_cid(), dgram_get_port()? I agree, that in each transport
> > specific callback we will have same copying logic by calling 'skb_copy_datagram_msg()' and filling address
> > by using 'vsock_addr_init()', but in this case we don't need to update transports too much. For example HyperV
> > still unchanged as it does not support SOCK_DGRAM. For VMCI You just need to add 'vsock_addr_init()' logic
> > to it's dgram dequeue callback.
> >
> > What do You think?
>
> Honestly, I'd rather avoid duplicate code than reduce changes in
> transports that don't support dgram.
>
> One thing I do agree on though is minimizing the number of callbacks
> to call to reduce the number of indirection (more performance?).
>
> Thanks,
> Stefano
>
Thanks!
Best,
Bobby
> >
> > Thanks, Arseniy
> >
> > > +
> > > + vsock_addr_init(vm_addr, cid, port);
> > > + msg->msg_namelen = sizeof(*vm_addr);
> > > + }
> > > + err = payload_len;
> > > +
> > > +out:
> > > + skb_free_datagram(&vsk->sk, skb);
> > > + return err;
> > > }
> > > EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
> > >
> > > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > > index 7cb1a9d2cdb4..ff6e87e25fa0 100644
> > > --- a/net/vmw_vsock/hyperv_transport.c
> > > +++ b/net/vmw_vsock/hyperv_transport.c
> > > @@ -556,8 +556,17 @@ static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
> > > return -EOPNOTSUPP;
> > > }
> > >
> > > -static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
> > > - size_t len, int flags)
> > > +static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +
> > > +static int hvs_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +
> > > +static int hvs_dgram_get_length(struct sk_buff *skb, size_t *len)
> > > {
> > > return -EOPNOTSUPP;
> > > }
> > > @@ -833,7 +842,9 @@ static struct vsock_transport hvs_transport = {
> > > .shutdown = hvs_shutdown,
> > >
> > > .dgram_bind = hvs_dgram_bind,
> > > - .dgram_dequeue = hvs_dgram_dequeue,
> > > + .dgram_get_cid = hvs_dgram_get_cid,
> > > + .dgram_get_port = hvs_dgram_get_port,
> > > + .dgram_get_length = hvs_dgram_get_length,
> > > .dgram_enqueue = hvs_dgram_enqueue,
> > > .dgram_allow = hvs_dgram_allow,
> > >
> > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > > index e95df847176b..5763cdf13804 100644
> > > --- a/net/vmw_vsock/virtio_transport.c
> > > +++ b/net/vmw_vsock/virtio_transport.c
> > > @@ -429,9 +429,11 @@ static struct virtio_transport virtio_transport = {
> > > .cancel_pkt = virtio_transport_cancel_pkt,
> > >
> > > .dgram_bind = virtio_transport_dgram_bind,
> > > - .dgram_dequeue = virtio_transport_dgram_dequeue,
> > > .dgram_enqueue = virtio_transport_dgram_enqueue,
> > > .dgram_allow = virtio_transport_dgram_allow,
> > > + .dgram_get_cid = virtio_transport_dgram_get_cid,
> > > + .dgram_get_port = virtio_transport_dgram_get_port,
> > > + .dgram_get_length = virtio_transport_dgram_get_length,
> > >
> > > .stream_dequeue = virtio_transport_stream_dequeue,
> > > .stream_enqueue = virtio_transport_stream_enqueue,
> > > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > > index b769fc258931..e6903c719964 100644
> > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > @@ -797,6 +797,24 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > }
> > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > >
> > > +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_cid);
> > > +
> > > +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_port);
> > > +
> > > +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_get_length);
> > > +
> > > bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > {
> > > return false;
> > > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> > > index b370070194fa..bbc63826bf48 100644
> > > --- a/net/vmw_vsock/vmci_transport.c
> > > +++ b/net/vmw_vsock/vmci_transport.c
> > > @@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
> > > return err - sizeof(*dg);
> > > }
> > >
> > > -static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
> > > - struct msghdr *msg, size_t len,
> > > - int flags)
> > > +static int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > > {
> > > - int err;
> > > struct vmci_datagram *dg;
> > > - size_t payload_len;
> > > - struct sk_buff *skb;
> > >
> > > - if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > > - return -EOPNOTSUPP;
> > > + dg = (struct vmci_datagram *)skb->data;
> > > + if (!dg)
> > > + return -EINVAL;
> > >
> > > - /* Retrieve the head sk_buff from the socket's receive queue. */
> > > - err = 0;
> > > - skb = skb_recv_datagram(&vsk->sk, flags, &err);
> > > - if (!skb)
> > > - return err;
> > > + *cid = dg->src.context;
> > > + return 0;
> > > +}
> > > +
> > > +static int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > > +{
> > > + struct vmci_datagram *dg;
> > >
> > > dg = (struct vmci_datagram *)skb->data;
> > > if (!dg)
> > > - /* err is 0, meaning we read zero bytes. */
> > > - goto out;
> > > -
> > > - payload_len = dg->payload_size;
> > > - /* Ensure the sk_buff matches the payload size claimed in the packet. */
> > > - if (payload_len != skb->len - sizeof(*dg)) {
> > > - err = -EINVAL;
> > > - goto out;
> > > - }
> > > + return -EINVAL;
> > >
> > > - if (payload_len > len) {
> > > - payload_len = len;
> > > - msg->msg_flags |= MSG_TRUNC;
> > > - }
> > > + *port = dg->src.resource;
> > > + return 0;
> > > +}
> > >
> > > - /* Place the datagram payload in the user's iovec. */
> > > - err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
> > > - if (err)
> > > - goto out;
> > > +static int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> > > +{
> > > + struct vmci_datagram *dg;
> > >
> > > - if (msg->msg_name) {
> > > - /* Provide the address of the sender. */
> > > - DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> > > - vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
> > > - msg->msg_namelen = sizeof(*vm_addr);
> > > - }
> > > - err = payload_len;
> > > + dg = (struct vmci_datagram *)skb->data;
> > > + if (!dg)
> > > + return -EINVAL;
> > >
> > > -out:
> > > - skb_free_datagram(&vsk->sk, skb);
> > > - return err;
> > > + *len = dg->payload_size;
> > > + return 0;
> > > }
> > >
> > > static bool vmci_transport_dgram_allow(u32 cid, u32 port)
> > > @@ -2040,9 +2023,12 @@ static struct vsock_transport vmci_transport = {
> > > .release = vmci_transport_release,
> > > .connect = vmci_transport_connect,
> > > .dgram_bind = vmci_transport_dgram_bind,
> > > - .dgram_dequeue = vmci_transport_dgram_dequeue,
> > > .dgram_enqueue = vmci_transport_dgram_enqueue,
> > > .dgram_allow = vmci_transport_dgram_allow,
> > > + .dgram_get_cid = vmci_transport_dgram_get_cid,
> > > + .dgram_get_port = vmci_transport_dgram_get_port,
> > > + .dgram_get_length = vmci_transport_dgram_get_length,
> > > + .dgram_payload_offset = sizeof(struct vmci_datagram),
> > > .stream_dequeue = vmci_transport_stream_dequeue,
> > > .stream_enqueue = vmci_transport_stream_enqueue,
> > > .stream_has_data = vmci_transport_stream_has_data,
> > > diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> > > index 5c6360df1f31..2f3cabc79ee5 100644
> > > --- a/net/vmw_vsock/vsock_loopback.c
> > > +++ b/net/vmw_vsock/vsock_loopback.c
> > > @@ -62,9 +62,11 @@ static struct virtio_transport loopback_transport = {
> > > .cancel_pkt = vsock_loopback_cancel_pkt,
> > >
> > > .dgram_bind = virtio_transport_dgram_bind,
> > > - .dgram_dequeue = virtio_transport_dgram_dequeue,
> > > .dgram_enqueue = virtio_transport_dgram_enqueue,
> > > .dgram_allow = virtio_transport_dgram_allow,
> > > + .dgram_get_cid = virtio_transport_dgram_get_cid,
> > > + .dgram_get_port = virtio_transport_dgram_get_port,
> > > + .dgram_get_length = virtio_transport_dgram_get_length,
> > >
> > > .stream_dequeue = virtio_transport_stream_dequeue,
> > > .stream_enqueue = virtio_transport_stream_enqueue,
> > >
> >
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* RE: [PATCH v2] net: mana: Batch ringing RX queue doorbell on receiving packets
From: Dexuan Cui @ 2023-06-23 3:00 UTC (permalink / raw)
To: longli@linuxonhyperv.com, Jason Gunthorpe, Leon Romanovsky,
Ajay Sharma, KY Srinivasan, Haiyang Zhang, Wei Liu,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Long Li,
stable@vger.kernel.org
In-Reply-To: <1687450956-6407-1-git-send-email-longli@linuxonhyperv.com>
> From: longli@linuxonhyperv.com
> Sent: Thursday, June 22, 2023 9:23 AM
> ...
>
> It's inefficient to ring the doorbell page every time a WQE is posted to
> the received queue.
>
> Move the code for ringing doorbell page to where after we have posted all
> WQEs to the receive queue during a callback from napi_poll().
>
> Tests showed no regression in network latency benchmarks.
>
> Cc: stable@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> Adapter (MANA)")
> Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
^ permalink raw reply
* RE: [PATCH] Drivers: hv: Change hv_free_hyperv_page() to take void * argument
From: Dexuan Cui @ 2023-06-23 3:09 UTC (permalink / raw)
To: Kameron Carr, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
arnd@arndb.de, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
In-Reply-To: <1687464427-4722-1-git-send-email-kameroncarr@linux.microsoft.com>
> From: Kameron Carr <kameroncarr@linux.microsoft.com>
> Sent: Thursday, June 22, 2023 1:07 PM
> ...
> Currently hv_free_hyperv_page() takes an unsigned long argument, which
> is inconsistent with the void * return value from the corresponding
> hv_alloc_hyperv_page() function and variants. This creates unnecessary
> extra casting.
>
> Change the hv_free_hyperv_page() argument type to void *.
> Also remove redundant casts from invocations of
> hv_alloc_hyperv_page() and variants.
> ---
Hi Kameron, you forgot to add your Signed-off-by :-)
Otherwise, it looks good to me.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
^ permalink raw reply
* [PATCH V2 net] net: mana: Fix MANA VF unload when host is unresponsive
From: souradeep chakrabarti @ 2023-06-23 7:29 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma
Cc: stable, schakrabarti, Souradeep Chakrabarti
From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
This patch addresses the VF unload issue, where mana_dealloc_queues()
gets stuck in infinite while loop, because of host unresponsiveness.
It adds a timeout in the while loop, to fix it.
Also this patch adds a new attribute in mana_context, which gets set when
mana_hwc_send_request() hits a timeout because of host unresponsiveness.
This flag then helps to avoid the timeouts in successive calls.
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V1 -> V2:
* Added net branch
* Removed the typecasting to (struct mana_context*) of void pointer
* Repositioned timeout variable in mana_dealloc_queues()
* Repositioned vf_unload_timeout in mana_context struct, to utilise the
6 bytes hole
---
.../net/ethernet/microsoft/mana/gdma_main.c | 4 +++-
.../net/ethernet/microsoft/mana/hw_channel.c | 12 ++++++++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 21 +++++++++++++++++--
include/net/mana/mana.h | 2 ++
4 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..6411f01be0d9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
struct gdma_context *gc = gd->gdma_context;
struct gdma_general_resp resp = {};
struct gdma_general_req req = {};
+ struct mana_context *ac;
int err;
if (gd->pdid == INVALID_PDID)
return -EINVAL;
+ ac = gd->driver_data;
mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req),
sizeof(resp));
@@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
req.hdr.dev_id = gd->dev_id;
err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
- if (err || resp.hdr.status) {
+ if ((err || resp.hdr.status) && !ac->vf_unload_timeout) {
dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n",
err, resp.hdr.status);
if (!err)
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 9d1507eba5b9..492cb2c6e2cb 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -1,8 +1,10 @@
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2021, Microsoft Corporation. */
+#include "asm-generic/errno.h"
#include <net/mana/gdma.h>
#include <net/mana/hw_channel.h>
+#include <net/mana/mana.h>
static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
{
@@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
struct hwc_wq *txq = hwc->txq;
struct gdma_req_hdr *req_msg;
struct hwc_caller_ctx *ctx;
+ struct mana_context *ac;
u32 dest_vrcq = 0;
u32 dest_vrq = 0;
u16 msg_id;
int err;
mana_hwc_get_msg_index(hwc, &msg_id);
+ ac = hwc->gdma_dev->driver_data;
+ if (ac->vf_unload_timeout) {
+ dev_err(hwc->dev, "HWC: vport is already unloaded.\n");
+ err = -ETIMEDOUT;
+ goto out;
+ }
tx_wr = &txq->msg_buf->reqs[msg_id];
@@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
goto out;
}
- if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
+ if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) {
dev_err(hwc->dev, "HWC: Request timed out!\n");
err = -ETIMEDOUT;
+ ac->vf_unload_timeout = true;
goto out;
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a..cb2080b3a00c 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
{
struct mana_port_context *apc = netdev_priv(ndev);
struct gdma_dev *gd = apc->ac->gdma_dev;
+ unsigned long timeout;
struct mana_txq *txq;
+ struct sk_buff *skb;
+ struct mana_cq *cq;
int i, err;
if (apc->port_is_up)
@@ -2348,13 +2351,26 @@ static int mana_dealloc_queues(struct net_device *ndev)
*
* Drain all the in-flight TX packets
*/
+
+ timeout = jiffies + 120 * HZ;
for (i = 0; i < apc->num_queues; i++) {
txq = &apc->tx_qp[i].txq;
-
- while (atomic_read(&txq->pending_sends) > 0)
+ while (atomic_read(&txq->pending_sends) > 0 &&
+ time_before(jiffies, timeout)) {
usleep_range(1000, 2000);
+ }
}
+ for (i = 0; i < apc->num_queues; i++) {
+ txq = &apc->tx_qp[i].txq;
+ cq = &apc->tx_qp[i].tx_cq;
+ while (atomic_read(&txq->pending_sends)) {
+ skb = skb_dequeue(&txq->pending_skbs);
+ mana_unmap_skb(skb, apc);
+ napi_consume_skb(skb, cq->budget);
+ atomic_sub(1, &txq->pending_sends);
+ }
+ }
/* We're 100% sure the queues can no longer be woken up, because
* we're sure now mana_poll_tx_cq() can't be running.
*/
@@ -2605,6 +2621,7 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
}
}
+ ac->vf_unload_timeout = false;
err = add_adev(gd);
out:
if (err)
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 9eef19972845..5f5affdca1eb 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -358,6 +358,8 @@ struct mana_context {
u16 num_ports;
+ bool vf_unload_timeout;
+
struct mana_eq *eqs;
struct net_device *ports[MAX_PORTS_IN_MANA_DEV];
--
2.34.1
^ permalink raw reply related
* Re: [PATCH RFC net-next v4 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Stefano Garzarella @ 2023-06-23 8:14 UTC (permalink / raw)
To: Bobby Eshleman
Cc: Arseniy Krasnov, linux-hyperv, Bobby Eshleman, kvm,
Michael S. Tsirkin, VMware PV-Drivers Reviewers, Simon Horman,
virtualization, Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu,
Dexuan Cui, Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
Stefan Hajnoczi, Vishnu Dasa, netdev, linux-kernel, bpf,
David S. Miller
In-Reply-To: <ZJTbRsU5kQfLEV9c@bullseye>
On Thu, Jun 22, 2023 at 11:37:42PM +0000, Bobby Eshleman wrote:
>On Thu, Jun 22, 2023 at 04:51:41PM +0200, Stefano Garzarella wrote:
>> On Sun, Jun 11, 2023 at 11:43:15PM +0300, Arseniy Krasnov wrote:
>> > Hello Bobby! Thanks for this patchset! Small comment below:
>> >
>> > On 10.06.2023 03:58, Bobby Eshleman wrote:
>> > > This commit drops the transport->dgram_dequeue callback and makes
>> > > vsock_dgram_recvmsg() generic. It also adds additional transport
>> > > callbacks for use by the generic vsock_dgram_recvmsg(), such as for
>> > > parsing skbs for CID/port which vary in format per transport.
>> > >
>> > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>> > > ---
>> > > drivers/vhost/vsock.c | 4 +-
>> > > include/linux/virtio_vsock.h | 3 ++
>> > > include/net/af_vsock.h | 13 ++++++-
>> > > net/vmw_vsock/af_vsock.c | 51 ++++++++++++++++++++++++-
>> > > net/vmw_vsock/hyperv_transport.c | 17 +++++++--
>> > > net/vmw_vsock/virtio_transport.c | 4 +-
>> > > net/vmw_vsock/virtio_transport_common.c | 18 +++++++++
>> > > net/vmw_vsock/vmci_transport.c | 68 +++++++++++++--------------------
>> > > net/vmw_vsock/vsock_loopback.c | 4 +-
>> > > 9 files changed, 132 insertions(+), 50 deletions(-)
>> > >
>> > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> > > index 6578db78f0ae..c8201c070b4b 100644
>> > > --- a/drivers/vhost/vsock.c
>> > > +++ b/drivers/vhost/vsock.c
>> > > @@ -410,9 +410,11 @@ static struct virtio_transport vhost_transport = {
>> > > .cancel_pkt = vhost_transport_cancel_pkt,
>> > >
>> > > .dgram_enqueue = virtio_transport_dgram_enqueue,
>> > > - .dgram_dequeue = virtio_transport_dgram_dequeue,
>> > > .dgram_bind = virtio_transport_dgram_bind,
>> > > .dgram_allow = virtio_transport_dgram_allow,
>> > > + .dgram_get_cid = virtio_transport_dgram_get_cid,
>> > > + .dgram_get_port = virtio_transport_dgram_get_port,
>> > > + .dgram_get_length = virtio_transport_dgram_get_length,
>> > >
>> > > .stream_enqueue = virtio_transport_stream_enqueue,
>> > > .stream_dequeue = virtio_transport_stream_dequeue,
>> > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> > > index c58453699ee9..23521a318cf0 100644
>> > > --- a/include/linux/virtio_vsock.h
>> > > +++ b/include/linux/virtio_vsock.h
>> > > @@ -219,6 +219,9 @@ bool virtio_transport_stream_allow(u32 cid, u32 port);
>> > > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>> > > struct sockaddr_vm *addr);
>> > > bool virtio_transport_dgram_allow(u32 cid, u32 port);
>> > > +int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
>> > > +int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
>> > > +int virtio_transport_dgram_get_length(struct sk_buff *skb, size_t *len);
>> > >
>> > > int virtio_transport_connect(struct vsock_sock *vsk);
>> > >
>> > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> > > index 0e7504a42925..7bedb9ee7e3e 100644
>> > > --- a/include/net/af_vsock.h
>> > > +++ b/include/net/af_vsock.h
>> > > @@ -120,11 +120,20 @@ struct vsock_transport {
>> > >
>> > > /* DGRAM. */
>> > > int (*dgram_bind)(struct vsock_sock *, struct sockaddr_vm *);
>> > > - int (*dgram_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
>> > > - size_t len, int flags);
>> > > int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
>> > > struct msghdr *, size_t len);
>> > > bool (*dgram_allow)(u32 cid, u32 port);
>> > > + int (*dgram_get_cid)(struct sk_buff *skb, unsigned int *cid);
>> > > + int (*dgram_get_port)(struct sk_buff *skb, unsigned int *port);
>> > > + int (*dgram_get_length)(struct sk_buff *skb, size_t *length);
>> > > +
>> > > + /* The number of bytes into the buffer at which the payload starts, as
>> > > + * first seen by the receiving socket layer. For example, if the
>> > > + * transport presets the skb pointers using skb_pull(sizeof(header))
>> > > + * than this would be zero, otherwise it would be the size of the
>> > > + * header.
>> > > + */
>> > > + const size_t dgram_payload_offset;
>> > >
>> > > /* STREAM. */
>> > > /* TODO: stream_bind() */
>> > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> > > index efb8a0937a13..ffb4dd8b6ea7 100644
>> > > --- a/net/vmw_vsock/af_vsock.c
>> > > +++ b/net/vmw_vsock/af_vsock.c
>> > > @@ -1271,11 +1271,15 @@ static int vsock_dgram_connect(struct socket *sock,
>> > > int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>> > > size_t len, int flags)
>> > > {
>> > > + const struct vsock_transport *transport;
>> > > #ifdef CONFIG_BPF_SYSCALL
>> > > const struct proto *prot;
>> > > #endif
>> > > struct vsock_sock *vsk;
>> > > + struct sk_buff *skb;
>> > > + size_t payload_len;
>> > > struct sock *sk;
>> > > + int err;
>> > >
>> > > sk = sock->sk;
>> > > vsk = vsock_sk(sk);
>> > > @@ -1286,7 +1290,52 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>> > > return prot->recvmsg(sk, msg, len, flags, NULL);
>> > > #endif
>> > >
>> > > - return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
>> > > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
>> > > + return -EOPNOTSUPP;
>> > > +
>> > > + transport = vsk->transport;
>> > > +
>> > > + /* Retrieve the head sk_buff from the socket's receive queue. */
>> > > + err = 0;
>> > > + skb = skb_recv_datagram(sk_vsock(vsk), flags, &err);
>> > > + if (!skb)
>> > > + return err;
>> > > +
>> > > + err = transport->dgram_get_length(skb, &payload_len);
>>
>> What about ssize_t return value here?
>>
>> Or maybe a single callback that return both length and offset?
>>
>> .dgram_get_payload_info(skb, &payload_len, &payload_off)
>>
>
>What are your thoughts on Arseniy's idea of using skb->len and adding a
>skb_pull() just before vmci adds the skb to the sk receive queue?
Yep, I agree on that!
Thanks,
Stefano
^ permalink raw reply
* Re: [PATCH RFC net-next v4 4/8] vsock: make vsock bind reusable
From: Stefano Garzarella @ 2023-06-23 8:15 UTC (permalink / raw)
To: Bobby Eshleman
Cc: Bobby Eshleman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Bryan Tan, Vishnu Dasa, VMware PV-Drivers Reviewers,
Dan Carpenter, Simon Horman, Krasnov Arseniy, kvm, virtualization,
netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <ZJTTx0XJ2LeITNh0@bullseye>
On Thu, Jun 22, 2023 at 11:05:43PM +0000, Bobby Eshleman wrote:
>On Thu, Jun 22, 2023 at 05:25:55PM +0200, Stefano Garzarella wrote:
>> On Sat, Jun 10, 2023 at 12:58:31AM +0000, Bobby Eshleman wrote:
>> > This commit makes the bind table management functions in vsock usable
>> > for different bind tables. For use by datagrams in a future patch.
>> >
>> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>> > ---
>> > net/vmw_vsock/af_vsock.c | 33 ++++++++++++++++++++++++++-------
>> > 1 file changed, 26 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> > index ef86765f3765..7a3ca4270446 100644
>> > --- a/net/vmw_vsock/af_vsock.c
>> > +++ b/net/vmw_vsock/af_vsock.c
>> > @@ -230,11 +230,12 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
>> > sock_put(&vsk->sk);
>> > }
>> >
>> > -static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
>> > +struct sock *vsock_find_bound_socket_common(struct sockaddr_vm *addr,
>> > + struct list_head *bind_table)
>> > {
>> > struct vsock_sock *vsk;
>> >
>> > - list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) {
>> > + list_for_each_entry(vsk, bind_table, bound_table) {
>> > if (vsock_addr_equals_addr(addr, &vsk->local_addr))
>> > return sk_vsock(vsk);
>> >
>> > @@ -247,6 +248,11 @@ static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
>> > return NULL;
>> > }
>> >
>> > +static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
>> > +{
>> > + return vsock_find_bound_socket_common(addr, vsock_bound_sockets(addr));
>> > +}
>> > +
>> > static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
>> > struct sockaddr_vm *dst)
>> > {
>> > @@ -646,12 +652,17 @@ static void vsock_pending_work(struct work_struct *work)
>> >
>> > /**** SOCKET OPERATIONS ****/
>> >
>> > -static int __vsock_bind_connectible(struct vsock_sock *vsk,
>> > - struct sockaddr_vm *addr)
>> > +static int vsock_bind_common(struct vsock_sock *vsk,
>> > + struct sockaddr_vm *addr,
>> > + struct list_head *bind_table,
>> > + size_t table_size)
>> > {
>> > static u32 port;
>> > struct sockaddr_vm new_addr;
>> >
>> > + if (table_size < VSOCK_HASH_SIZE)
>> > + return -1;
>>
>> Why we need this check now?
>>
>
>If the table_size is not at least VSOCK_HASH_SIZE then the
>VSOCK_HASH(addr) used later could overflow the table.
>
>Maybe this really deserves a WARN() and a comment?
Yes, please WARN_ONCE() should be enough.
Stefano
^ permalink raw reply
* Re: [PATCH V2 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Jiri Pirko @ 2023-06-23 9:43 UTC (permalink / raw)
To: souradeep chakrabarti
Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma, stable,
schakrabarti
In-Reply-To: <1687505355-29212-1-git-send-email-schakrabarti@linux.microsoft.com>
Fri, Jun 23, 2023 at 09:29:15AM CEST, schakrabarti@linux.microsoft.com wrote:
>From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
>
>This patch addresses the VF unload issue, where mana_dealloc_queues()
double space here ^^
>gets stuck in infinite while loop, because of host unresponsiveness.
>It adds a timeout in the while loop, to fix it.
>
>Also this patch adds a new attribute in mana_context, which gets set when
>mana_hwc_send_request() hits a timeout because of host unresponsiveness.
>This flag then helps to avoid the timeouts in successive calls.
You aparently combine 2 patches together. Please split.
>
>Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Please provide a "Fixes" tag with pointer to the commit that introduced
the problem.
>---
>V1 -> V2:
>* Added net branch
>* Removed the typecasting to (struct mana_context*) of void pointer
>* Repositioned timeout variable in mana_dealloc_queues()
>* Repositioned vf_unload_timeout in mana_context struct, to utilise the
> 6 bytes hole
>---
> .../net/ethernet/microsoft/mana/gdma_main.c | 4 +++-
> .../net/ethernet/microsoft/mana/hw_channel.c | 12 ++++++++++-
> drivers/net/ethernet/microsoft/mana/mana_en.c | 21 +++++++++++++++++--
> include/net/mana/mana.h | 2 ++
> 4 files changed, 35 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>index 8f3f78b68592..6411f01be0d9 100644
>--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>@@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
> struct gdma_context *gc = gd->gdma_context;
> struct gdma_general_resp resp = {};
> struct gdma_general_req req = {};
>+ struct mana_context *ac;
> int err;
>
> if (gd->pdid == INVALID_PDID)
> return -EINVAL;
>+ ac = gd->driver_data;
>
> mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req),
> sizeof(resp));
>@@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
> req.hdr.dev_id = gd->dev_id;
>
> err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
>- if (err || resp.hdr.status) {
>+ if ((err || resp.hdr.status) && !ac->vf_unload_timeout) {
> dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n",
> err, resp.hdr.status);
> if (!err)
>diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
>index 9d1507eba5b9..492cb2c6e2cb 100644
>--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
>+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
>@@ -1,8 +1,10 @@
> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> /* Copyright (c) 2021, Microsoft Corporation. */
>
>+#include "asm-generic/errno.h"
> #include <net/mana/gdma.h>
> #include <net/mana/hw_channel.h>
>+#include <net/mana/mana.h>
>
> static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
> {
>@@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
> struct hwc_wq *txq = hwc->txq;
> struct gdma_req_hdr *req_msg;
> struct hwc_caller_ctx *ctx;
>+ struct mana_context *ac;
> u32 dest_vrcq = 0;
> u32 dest_vrq = 0;
> u16 msg_id;
> int err;
>
> mana_hwc_get_msg_index(hwc, &msg_id);
>+ ac = hwc->gdma_dev->driver_data;
>+ if (ac->vf_unload_timeout) {
>+ dev_err(hwc->dev, "HWC: vport is already unloaded.\n");
>+ err = -ETIMEDOUT;
>+ goto out;
>+ }
>
> tx_wr = &txq->msg_buf->reqs[msg_id];
>
>@@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
> goto out;
> }
>
>- if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
>+ if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) {
> dev_err(hwc->dev, "HWC: Request timed out!\n");
> err = -ETIMEDOUT;
>+ ac->vf_unload_timeout = true;
> goto out;
> }
>
>diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
>index d907727c7b7a..cb2080b3a00c 100644
>--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
>+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
>@@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
> {
> struct mana_port_context *apc = netdev_priv(ndev);
> struct gdma_dev *gd = apc->ac->gdma_dev;
>+ unsigned long timeout;
> struct mana_txq *txq;
>+ struct sk_buff *skb;
>+ struct mana_cq *cq;
> int i, err;
>
> if (apc->port_is_up)
>@@ -2348,13 +2351,26 @@ static int mana_dealloc_queues(struct net_device *ndev)
> *
> * Drain all the in-flight TX packets
> */
>+
>+ timeout = jiffies + 120 * HZ;
> for (i = 0; i < apc->num_queues; i++) {
> txq = &apc->tx_qp[i].txq;
>-
>- while (atomic_read(&txq->pending_sends) > 0)
>+ while (atomic_read(&txq->pending_sends) > 0 &&
>+ time_before(jiffies, timeout)) {
> usleep_range(1000, 2000);
>+ }
> }
>
>+ for (i = 0; i < apc->num_queues; i++) {
>+ txq = &apc->tx_qp[i].txq;
>+ cq = &apc->tx_qp[i].tx_cq;
>+ while (atomic_read(&txq->pending_sends)) {
>+ skb = skb_dequeue(&txq->pending_skbs);
>+ mana_unmap_skb(skb, apc);
>+ napi_consume_skb(skb, cq->budget);
>+ atomic_sub(1, &txq->pending_sends);
>+ }
>+ }
> /* We're 100% sure the queues can no longer be woken up, because
> * we're sure now mana_poll_tx_cq() can't be running.
> */
>@@ -2605,6 +2621,7 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> }
> }
>
>+ ac->vf_unload_timeout = false;
Pointless init. You have the struct zeroed during allocation.
> err = add_adev(gd);
> out:
> if (err)
>diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
>index 9eef19972845..5f5affdca1eb 100644
>--- a/include/net/mana/mana.h
>+++ b/include/net/mana/mana.h
>@@ -358,6 +358,8 @@ struct mana_context {
>
> u16 num_ports;
>
>+ bool vf_unload_timeout;
>+
> struct mana_eq *eqs;
>
> struct net_device *ports[MAX_PORTS_IN_MANA_DEV];
>--
>2.34.1
>
>
^ permalink raw reply
* [PATCH v3] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Sengar @ 2023-06-23 16:28 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
mikelley, linux-kernel, linux-hyperv, hpa
Hyper-V can run VMs at different privilege "levels" known as Virtual
Trust Levels (VTL). Sometimes, it chooses to run two different VMs
at different levels but they share some of their address space. In
such setups VTL2 (higher level VM) has visibility of all of the
VTL0 (level 0) memory space.
When the CONFIG_X86_MPPARSE is enabled for VTL2, the VTL2 kernel
performs a search within the low memory to locate MP tables. However,
in systems where VTL0 manages the low memory and may contain valid
tables, this scanning can result in incorrect MP table information
being provided to the VTL2 kernel, mistakenly considering VTL0's MP
table as its own
Add noop functions to avoid MP parse scan by VTL2.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
[V3]
- modify commit message.
arch/x86/hyperv/hv_vtl.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index 85d38b9f3586..db5d2ea39fc0 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -25,6 +25,10 @@ void __init hv_vtl_init_platform(void)
x86_init.irqs.pre_vector_init = x86_init_noop;
x86_init.timers.timer_init = x86_init_noop;
+ /* Avoid searching for BIOS MP tables */
+ x86_init.mpparse.find_smp_config = x86_init_noop;
+ x86_init.mpparse.get_smp_config = x86_init_uint_noop;
+
x86_platform.get_wallclock = get_rtc_noop;
x86_platform.set_wallclock = set_rtc_noop;
x86_platform.get_nmi_reason = hv_get_nmi_reason;
--
2.34.1
^ permalink raw reply related
* Re: [PATCH V2 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Simon Horman @ 2023-06-23 17:44 UTC (permalink / raw)
To: souradeep chakrabarti
Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma, stable,
schakrabarti
In-Reply-To: <1687505355-29212-1-git-send-email-schakrabarti@linux.microsoft.com>
On Fri, Jun 23, 2023 at 12:29:15AM -0700, souradeep chakrabarti wrote:
> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
>
> This patch addresses the VF unload issue, where mana_dealloc_queues()
> gets stuck in infinite while loop, because of host unresponsiveness.
> It adds a timeout in the while loop, to fix it.
>
> Also this patch adds a new attribute in mana_context, which gets set when
> mana_hwc_send_request() hits a timeout because of host unresponsiveness.
> This flag then helps to avoid the timeouts in successive calls.
>
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
...
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index d907727c7b7a..cb2080b3a00c 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
...
> @@ -2348,13 +2351,26 @@ static int mana_dealloc_queues(struct net_device *ndev)
> *
> * Drain all the in-flight TX packets
> */
> +
> + timeout = jiffies + 120 * HZ;
> for (i = 0; i < apc->num_queues; i++) {
> txq = &apc->tx_qp[i].txq;
> -
> - while (atomic_read(&txq->pending_sends) > 0)
> + while (atomic_read(&txq->pending_sends) > 0 &&
> + time_before(jiffies, timeout)) {
> usleep_range(1000, 2000);
> + }
Hi Souradeep,
minor feedback from my side, as it seems there will be a new version anyway:
I think the braces - '{' '}' - are unnecessary above.
...
--
pw-bot: cr
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox