Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* RE: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Dexuan Cui @ 2023-06-26 20:06 UTC (permalink / raw)
  To: Simon Horman, souradeep chakrabarti
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, Long Li, Ajay Sharma, leon@kernel.org,
	cai.huoqing@linux.dev, ssengar@linux.microsoft.com,
	vkuznets@redhat.com, tglx@linutronix.de,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <ZJmNBKA3ygDryP7i@corigine.com>

> From: Simon Horman
> Sent: Monday, June 26, 2023 6:05 AM
> > ...
> > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a
> > driver for
> > Microsoft Azure Network Adapter)
> 
> nit: A correct format of this fixes tag is:
> 
> In particular:
> * All on lone line
> * Description in double quotes.
> 
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> Adapter (MANA)")

Hi Souradeep, FYI I often refer to:
https://marc.info/?l=linux-pci&m=150905742808166&w=2

The link mentions:
alias gsr='git --no-pager show -s --abbrev-commit --abbrev=12 --pretty=format:"%h (\"%s\")%n"'

"gsr ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f" produces:
ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")

^ permalink raw reply

* RE: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Haiyang Zhang @ 2023-06-26 16:02 UTC (permalink / raw)
  To: Haiyang Zhang, souradeep chakrabarti, KY Srinivasan,
	wei.liu@kernel.org, Dexuan Cui, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, Long Li,
	Ajay Sharma, leon@kernel.org, cai.huoqing@linux.dev,
	ssengar@linux.microsoft.com, vkuznets@redhat.com,
	tglx@linutronix.de, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org
  Cc: stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <PH7PR21MB3116B7C4826A19F3103D6304CA26A@PH7PR21MB3116.namprd21.prod.outlook.com>



> -----Original Message-----
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Sent: Monday, June 26, 2023 11:54 AM
> To: souradeep chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
> Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
> cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
> tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
> Cc: stable@vger.kernel.org; Souradeep Chakrabarti
> <schakrabarti@microsoft.com>
> Subject: RE: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is
> unresponsive
> 
> 
> 
> > -----Original Message-----
> > From: souradeep chakrabarti <schakrabarti@linux.microsoft.com>
> > Sent: Monday, June 26, 2023 5:20 AM
> > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
> > Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
> > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
> > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org;
> > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
> > Cc: stable@vger.kernel.org; Souradeep Chakrabarti
> > <schakrabarti@microsoft.com>; Souradeep Chakrabarti
> > <schakrabarti@linux.microsoft.com>
> > Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is
> > unresponsive
> 
> In general, two patches shouldn't have the same Subject.
> 

If two patches are preferred, please use more descriptive subjects like these:

1/2: Fix the infinite waiting on pending_sends during host unresponsiveness
2/2: Avoid extra waits if host not responding in earlier steps

Thanks,
- Haiyang


^ permalink raw reply

* RE: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Haiyang Zhang @ 2023-06-26 15:53 UTC (permalink / raw)
  To: souradeep chakrabarti, KY Srinivasan, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Long Li, Ajay Sharma,
	leon@kernel.org, cai.huoqing@linux.dev,
	ssengar@linux.microsoft.com, vkuznets@redhat.com,
	tglx@linutronix.de, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org
  Cc: stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <1687771224-27162-1-git-send-email-schakrabarti@linux.microsoft.com>



> -----Original Message-----
> From: souradeep chakrabarti <schakrabarti@linux.microsoft.com>
> Sent: Monday, June 26, 2023 5:20 AM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
> Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
> cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
> tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
> Cc: stable@vger.kernel.org; Souradeep Chakrabarti
> <schakrabarti@microsoft.com>; Souradeep Chakrabarti
> <schakrabarti@linux.microsoft.com>
> Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is
> unresponsive

In general, two patches shouldn't have the same Subject.

For this patch set, the two patches are two steps to fix the unloading issue, and
they are not very long. IMHO, they should be in one patch.

Thanks,
- Haiyang


^ permalink raw reply

* RE: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Michael Kelley (LINUX) @ 2023-06-26 15:35 UTC (permalink / raw)
  To: souradeep chakrabarti, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, Long Li,
	Ajay Sharma, leon@kernel.org, cai.huoqing@linux.dev,
	ssengar@linux.microsoft.com, vkuznets@redhat.com,
	tglx@linutronix.de, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org
  Cc: stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <1687771137-26911-1-git-send-email-schakrabarti@linux.microsoft.com>

From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> Sent: Monday, June 26, 2023 2:19 AM
> 
> This patch addresses the VF unload issue, where mana_dealloc_queues()
> gets stuck in infinite while loop, because of host unresponsiveness.
> It adds a timeout in the while loop, to fix it.

For a patch series, the cover letter (patch 0 of the series) does not get
included in the commit log anywhere.   The cover letter can provide
overall motivation and describe how the patches fit together, but the
commit message for each patch should be as self-contained as possible.

The commit message here refers to "the VF unload issue", and there's
no context for understanding what that issue is, though you do provide
some description in the text following "where". Could you provide a
commit message that is a bit more self-contained?

Same comment applies to commit message for the 2nd patch of this
series.

Also, avoid text like "this patch".   See the "Describe your changes"
section in Documentation/process/submitting-patches.rst where the
use of imperative mood is mentioned.  If you like, I can provide some
offline help on writing a good commit message.

Michael

> 
> Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
> Microsoft Azure Network Adapter)
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> ---
> V2 -> V3:
> * Splitted the patch in two parts.
> * Removed the unnecessary braces from mana_dealloc_queues().
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index d907727c7b7a..cb5c43c3c47e 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  {
>  	struct mana_port_context *apc = netdev_priv(ndev);
>  	struct gdma_dev *gd = apc->ac->gdma_dev;
> +	unsigned long timeout;
>  	struct mana_txq *txq;
> +	struct sk_buff *skb;
> +	struct mana_cq *cq;
>  	int i, err;
> 
>  	if (apc->port_is_up)
> @@ -2348,13 +2351,25 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 *
>  	 * Drain all the in-flight TX packets
>  	 */
> +
> +	timeout = jiffies + 120 * HZ;
>  	for (i = 0; i < apc->num_queues; i++) {
>  		txq = &apc->tx_qp[i].txq;
> -
> -		while (atomic_read(&txq->pending_sends) > 0)
> +		while (atomic_read(&txq->pending_sends) > 0 &&
> +		       time_before(jiffies, timeout))
>  			usleep_range(1000, 2000);
>  	}
> 
> +	for (i = 0; i < apc->num_queues; i++) {
> +		txq = &apc->tx_qp[i].txq;
> +		cq = &apc->tx_qp[i].tx_cq;
> +		while (atomic_read(&txq->pending_sends)) {
> +			skb = skb_dequeue(&txq->pending_skbs);
> +			mana_unmap_skb(skb, apc);
> +			napi_consume_skb(skb, cq->budget);
> +			atomic_sub(1, &txq->pending_sends);
> +		}
> +	}
>  	/* We're 100% sure the queues can no longer be woken up, because
>  	 * we're sure now mana_poll_tx_cq() can't be running.
>  	 */
> --
> 2.34.1


^ permalink raw reply

* Re: [PATCH RFC net-next v4 6/8] virtio/vsock: support dgrams
From: Stefano Garzarella @ 2023-06-26 15:03 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <ZJUho6NbpCgGatap@bullseye>

On Fri, Jun 23, 2023 at 04:37:55AM +0000, Bobby Eshleman wrote:
>On Thu, Jun 22, 2023 at 06:09:12PM +0200, Stefano Garzarella wrote:
>> On Sun, Jun 11, 2023 at 11:49:02PM +0300, Arseniy Krasnov wrote:
>> > Hello Bobby!
>> >
>> > On 10.06.2023 03:58, Bobby Eshleman wrote:
>> > > This commit adds support for datagrams over virtio/vsock.
>> > >
>> > > Message boundaries are preserved on a per-skb and per-vq entry basis.
>> >
>> > I'm a little bit confused about the following case: let vhost sends 4097 bytes
>> > datagram to the guest. Guest uses 4096 RX buffers in it's virtio queue, each
>> > buffer has attached empty skb to it. Vhost places first 4096 bytes to the first
>> > buffer of guests RX queue, and 1 last byte to the second buffer. Now IIUC guest
>> > has two skb in it rx queue, and user in guest wants to read data - does it read
>> > 4097 bytes, while guest has two skb - 4096 bytes and 1 bytes? In seqpacket there is
>> > special marker in header which shows where message ends, and how it works here?
>>
>> I think the main difference is that DGRAM is not connection-oriented, so
>> we don't have a stream and we can't split the packet into 2 (maybe we
>> could, but we have no guarantee that the second one for example will be
>> not discarded because there is no space).
>>
>> So I think it is acceptable as a restriction to keep it simple.
>>
>> My only doubt is, should we make the RX buffer size configurable,
>> instead of always using 4k?
>>
>I think that is a really good idea. What mechanism do you imagine?

Some parameter in sysfs?

>
>For sendmsg() with buflen > VQ_BUF_SIZE, I think I'd like -ENOBUFS

For the guest it should be easy since it allocates the buffers, but for
the host?

Maybe we should add a field in the configuration space that reports some
sort of MTU.

Something in addition to what Laura had proposed here:
https://markmail.org/message/ymhz7wllutdxji3e

>returned even though it is uncharacteristic of Linux sockets.
>Alternatively, silently dropping is okay... but seems needlessly
>unhelpful.

UDP takes advantage of IP fragmentation, right?
But what happens if a fragment is lost?

We should try to behave in a similar way.

>
>FYI, this patch is broken for h2g because it requeues partially sent
>skbs, so probably doesn't need much code review until we decided on the
>policy.

Got it.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH RFC net-next v4 3/8] vsock: support multi-transport datagrams
From: Stefano Garzarella @ 2023-06-26 14:50 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, linux-hyperv, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, VMware PV-Drivers Reviewers, Simon Horman,
	virtualization, Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu,
	Dexuan Cui, Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
	Krasnov Arseniy, Vishnu Dasa, netdev, linux-kernel, bpf,
	David S. Miller
In-Reply-To: <ZJUKi+NtOajbplQg@bullseye>

On Fri, Jun 23, 2023 at 02:59:23AM +0000, Bobby Eshleman wrote:
>On Fri, Jun 23, 2023 at 02:50:01AM +0000, Bobby Eshleman wrote:
>> On Thu, Jun 22, 2023 at 05:19:08PM +0200, Stefano Garzarella wrote:
>> > On Sat, Jun 10, 2023 at 12:58:30AM +0000, Bobby Eshleman wrote:
>> > > This patch adds support for multi-transport datagrams.
>> > >
>> > > This includes:
>> > > - Per-packet lookup of transports when using sendto(sockaddr_vm)
>> > > - Selecting H2G or G2H transport using VMADDR_FLAG_TO_HOST and CID in
>> > >  sockaddr_vm
>> > >
>> > > To preserve backwards compatibility with VMCI, some important changes
>> > > were made. The "transport_dgram" / VSOCK_TRANSPORT_F_DGRAM is changed to
>> > > be used for dgrams iff there is not yet a g2h or h2g transport that has
>> >
>> > s/iff/if
>> >
>> > > been registered that can transmit the packet. If there is a g2h/h2g
>> > > transport for that remote address, then that transport will be used and
>> > > not "transport_dgram". This essentially makes "transport_dgram" a
>> > > fallback transport for when h2g/g2h has not yet gone online, which
>> > > appears to be the exact use case for VMCI.
>> > >
>> > > This design makes sense, because there is no reason that the
>> > > transport_{g2h,h2g} cannot also service datagrams, which makes the role
>> > > of transport_dgram difficult to understand outside of the VMCI context.
>> > >
>> > > The logic around "transport_dgram" had to be retained to prevent
>> > > breaking VMCI:
>> > >
>> > > 1) VMCI datagrams appear to function outside of the h2g/g2h
>> > >   paradigm. When the vmci transport becomes online, it registers itself
>> > >   with the DGRAM feature, but not H2G/G2H. Only later when the
>> > >   transport has more information about its environment does it register
>> > >   H2G or G2H. In the case that a datagram socket becomes active
>> > >   after DGRAM registration but before G2H/H2G registration, the
>> > >   "transport_dgram" transport needs to be used.
>> >
>> > IIRC we did this, because at that time only VMCI supported DGRAM. Now that
>> > there are more transports, maybe DGRAM can follow the h2g/g2h paradigm.
>> >
>>
>> Totally makes sense. I'll add the detail above that the prior design was
>> a result of chronology.
>>
>> > >
>> > > 2) VMCI seems to require special message be sent by the transport when a
>> > >   datagram socket calls bind(). Under the h2g/g2h model, the transport
>> > >   is selected using the remote_addr which is set by connect(). At
>> > >   bind time there is no remote_addr because often no connect() has been
>> > >   called yet: the transport is null. Therefore, with a null transport
>> > >   there doesn't seem to be any good way for a datagram socket a tell the
>> > >   VMCI transport that it has just had bind() called upon it.
>> >
>> > @Vishnu, @Bryan do you think we can avoid this in some way?
>> >
>> > >
>> > > Only transports with a special datagram fallback use-case such as VMCI
>> > > need to register VSOCK_TRANSPORT_F_DGRAM.
>> >
>> > Maybe we should rename it in VSOCK_TRANSPORT_F_DGRAM_FALLBACK or
>> > something like that.
>> >
>> > In any case, we definitely need to update the comment in
>> > include/net/af_vsock.h on top of VSOCK_TRANSPORT_F_DGRAM mentioning
>> > this.
>> >
>>
>> Agreed. I'll rename to VSOCK_TRANSPORT_F_DGRAM_FALLBACK, unless we find
>> there is a better way altogether.
>>
>> > >
>> > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>> > > ---
>> > > drivers/vhost/vsock.c                   |  1 -
>> > > include/linux/virtio_vsock.h            |  2 -
>> > > net/vmw_vsock/af_vsock.c                | 78 +++++++++++++++++++++++++--------
>> > > net/vmw_vsock/hyperv_transport.c        |  6 ---
>> > > net/vmw_vsock/virtio_transport.c        |  1 -
>> > > net/vmw_vsock/virtio_transport_common.c |  7 ---
>> > > net/vmw_vsock/vsock_loopback.c          |  1 -
>> > > 7 files changed, 60 insertions(+), 36 deletions(-)
>> > >
>> > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> > > index c8201c070b4b..8f0082da5e70 100644
>> > > --- a/drivers/vhost/vsock.c
>> > > +++ b/drivers/vhost/vsock.c
>> > > @@ -410,7 +410,6 @@ static struct virtio_transport vhost_transport = {
>> > > 		.cancel_pkt               = vhost_transport_cancel_pkt,
>> > >
>> > > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
>> > > -		.dgram_bind               = virtio_transport_dgram_bind,
>> > > 		.dgram_allow              = virtio_transport_dgram_allow,
>> > > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
>> > > 		.dgram_get_port		  = virtio_transport_dgram_get_port,
>> > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> > > index 23521a318cf0..73afa09f4585 100644
>> > > --- a/include/linux/virtio_vsock.h
>> > > +++ b/include/linux/virtio_vsock.h
>> > > @@ -216,8 +216,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
>> > > u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
>> > > bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
>> > > bool virtio_transport_stream_allow(u32 cid, u32 port);
>> > > -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>> > > -				struct sockaddr_vm *addr);
>> > > bool virtio_transport_dgram_allow(u32 cid, u32 port);
>> > > int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
>> > > int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
>> > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> > > index 74358f0b47fa..ef86765f3765 100644
>> > > --- a/net/vmw_vsock/af_vsock.c
>> > > +++ b/net/vmw_vsock/af_vsock.c
>> > > @@ -438,6 +438,18 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
>> > > 	return transport;
>> > > }
>> > >
>> > > +static const struct vsock_transport *
>> > > +vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
>> > > +{
>> > > +	const struct vsock_transport *transport;
>> > > +
>> > > +	transport = vsock_connectible_lookup_transport(cid, flags);
>> > > +	if (transport)
>> > > +		return transport;
>> > > +
>> > > +	return transport_dgram;
>> > > +}
>> > > +
>> > > /* Assign a transport to a socket and call the .init transport callback.
>> > >  *
>> > >  * Note: for connection oriented socket this must be called when vsk->remote_addr
>> > > @@ -474,7 +486,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
>> > >
>> > > 	switch (sk->sk_type) {
>> > > 	case SOCK_DGRAM:
>> > > -		new_transport = transport_dgram;
>> > > +		new_transport = vsock_dgram_lookup_transport(remote_cid,
>> > > +							     remote_flags);
>> > > 		break;
>> > > 	case SOCK_STREAM:
>> > > 	case SOCK_SEQPACKET:
>> > > @@ -691,6 +704,9 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>> > > static int __vsock_bind_dgram(struct vsock_sock *vsk,
>> > > 			      struct sockaddr_vm *addr)
>> > > {
>> > > +	if (!vsk->transport || !vsk->transport->dgram_bind)
>> > > +		return -EINVAL;
>> > > +
>> > > 	return vsk->transport->dgram_bind(vsk, addr);
>> > > }
>> > >
>> > > @@ -1172,19 +1188,24 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>> > >
>> > > 	lock_sock(sk);
>> > >
>> > > -	transport = vsk->transport;
>> > > -
>> > > -	err = vsock_auto_bind(vsk);
>> > > -	if (err)
>> > > -		goto out;
>> > > -
>> > > -
>> > > 	/* If the provided message contains an address, use that.  Otherwise
>> > > 	 * fall back on the socket's remote handle (if it has been connected).
>> > > 	 */
>> > > 	if (msg->msg_name &&
>> > > 	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
>> > > 			    &remote_addr) == 0) {
>> > > +		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
>> > > +							 remote_addr->svm_flags);
>> > > +		if (!transport) {
>> > > +			err = -EINVAL;
>> > > +			goto out;
>> > > +		}
>> > > +
>> > > +		if (!try_module_get(transport->module)) {
>> > > +			err = -ENODEV;
>> > > +			goto out;
>> > > +		}
>> > > +
>> > > 		/* Ensure this address is of the right type and is a valid
>> > > 		 * destination.
>> > > 		 */
>> > > @@ -1193,11 +1214,27 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>> > > 			remote_addr->svm_cid = transport->get_local_cid();
>> > >
>> >
>> > From here ...
>> >
>> > > 		if (!vsock_addr_bound(remote_addr)) {
>> > > +			module_put(transport->module);
>> > > +			err = -EINVAL;
>> > > +			goto out;
>> > > +		}
>> > > +
>> > > +		if (!transport->dgram_allow(remote_addr->svm_cid,
>> > > +					    remote_addr->svm_port)) {
>> > > +			module_put(transport->module);
>> > > 			err = -EINVAL;
>> > > 			goto out;
>> > > 		}
>> > > +
>> > > +		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
>> >
>> > ... to here, looks like duplicate code, can we get it out of the if block?
>> >
>>
>> Yes, I think using something like this:
>>
>> [...]
>> 	bool module_got = false;
>>
>> [...]
>> 		if (!try_module_get(transport->module)) {
>> 			err = -ENODEV;
>> 			goto out;
>> 		}
>> 		module_got = true;
>>
>> [...]
>>
>> out:
>> 	if (likely(transport && !err && module_got))
>
>Actually, just...
>
>	if (module_got)
>

Yep, I think it should work ;-)

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Praveen Kumar @ 2023-06-26 14:20 UTC (permalink / raw)
  To: souradeep chakrabarti, kys, haiyangz, wei.liu, decui, davem,
	edumazet, kuba, pabeni, longli, sharmaajay, leon, cai.huoqing,
	ssengar, vkuznets, tglx, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable, schakrabarti
In-Reply-To: <1687771137-26911-1-git-send-email-schakrabarti@linux.microsoft.com>

On 6/26/2023 2:48 PM, souradeep chakrabarti wrote:
> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> 
> This patch addresses the VF unload issue, where mana_dealloc_queues()
> gets stuck in infinite while loop, because of host unresponsiveness.
> It adds a timeout in the while loop, to fix it.
> 
> Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
> Microsoft Azure Network Adapter)
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> ---
> V2 -> V3:
> * Splitted the patch in two parts.
> * Removed the unnecessary braces from mana_dealloc_queues().
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index d907727c7b7a..cb5c43c3c47e 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  {
>  	struct mana_port_context *apc = netdev_priv(ndev);
>  	struct gdma_dev *gd = apc->ac->gdma_dev;
> +	unsigned long timeout;
>  	struct mana_txq *txq;
> +	struct sk_buff *skb;
> +	struct mana_cq *cq;
>  	int i, err;
>  
>  	if (apc->port_is_up)
> @@ -2348,13 +2351,25 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 *
>  	 * Drain all the in-flight TX packets
>  	 */
> +
> +	timeout = jiffies + 120 * HZ;
>  	for (i = 0; i < apc->num_queues; i++) {
>  		txq = &apc->tx_qp[i].txq;
> -
> -		while (atomic_read(&txq->pending_sends) > 0)
> +		while (atomic_read(&txq->pending_sends) > 0 &&
> +		       time_before(jiffies, timeout))
>  			usleep_range(1000, 2000);
>  	}
>  
> +	for (i = 0; i < apc->num_queues; i++) {
> +		txq = &apc->tx_qp[i].txq;
> +		cq = &apc->tx_qp[i].tx_cq;
> +		while (atomic_read(&txq->pending_sends)) {
> +			skb = skb_dequeue(&txq->pending_skbs);
> +			mana_unmap_skb(skb, apc);
> +			napi_consume_skb(skb, cq->budget);
> +			atomic_sub(1, &txq->pending_sends);
> +		}
> +	}

Can we combine these 2 loops into 1 something like this ?

	for (i = 0; i < apc->num_queues; i++) {
		txq = &apc->tx_qp[i].txq;
		cq = &apc->tx_qp[i].tx_cq;
		while (atomic_read(&txq->pending_sends)) {
			if (time_before(jiffies, timeout)) {
				usleep_range(1000, 2000);
			} else {
				skb = skb_dequeue(&txq->pending_skbs);
				mana_unmap_skb(skb, apc);
				napi_consume_skb(skb, cq->budget);
				atomic_sub(1, &txq->pending_sends);
			}
		}
	}
>  	/* We're 100% sure the queues can no longer be woken up, because
>  	 * we're sure now mana_poll_tx_cq() can't be running.
>  	 */


^ permalink raw reply

* Re: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Praveen Kumar @ 2023-06-26 14:13 UTC (permalink / raw)
  To: souradeep chakrabarti, kys, haiyangz, wei.liu, decui, davem,
	edumazet, kuba, pabeni, longli, sharmaajay, leon, cai.huoqing,
	ssengar, vkuznets, tglx, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable, schakrabarti
In-Reply-To: <1687771224-27162-1-git-send-email-schakrabarti@linux.microsoft.com>

On 6/26/2023 2:50 PM, souradeep chakrabarti wrote:
> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> 
> This is the second part of the fix.
> 
> Also this patch adds a new attribute in mana_context, which gets set when
> mana_hwc_send_request() hits a timeout because of host unresponsiveness.
> This flag then helps to avoid the timeouts in successive calls.
> 
> Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
> Microsoft Azure Network Adapter)
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> ---
> V2 -> V3:
> * Removed the initialization of vf_unload_timeout
> * Splitted the patch in two.
> * Fixed extra space from the commit message.
> ---
>  drivers/net/ethernet/microsoft/mana/gdma_main.c  |  4 +++-
>  drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++-
>  include/net/mana/mana.h                          |  2 ++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 8f3f78b68592..6411f01be0d9 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
>  	struct gdma_context *gc = gd->gdma_context;
>  	struct gdma_general_resp resp = {};
>  	struct gdma_general_req req = {};
> +	struct mana_context *ac;
>  	int err;
>  
>  	if (gd->pdid == INVALID_PDID)
>  		return -EINVAL;
> +	ac = gd->driver_data;
>  
>  	mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req),
>  			     sizeof(resp));
> @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
>  	req.hdr.dev_id = gd->dev_id;
>  
>  	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
> -	if (err || resp.hdr.status) {
> +	if ((err || resp.hdr.status) && !ac->vf_unload_timeout) {
>  		dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n",
>  			err, resp.hdr.status);

With !ac->vf_unload_timeout option, this message may not be correctly showing err, status. Probably you want to add explicit information during timeouts so that it give right information ? Or have the err, status field properly updated.

>  		if (!err)
> diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> index 9d1507eba5b9..492cb2c6e2cb 100644
> --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> @@ -1,8 +1,10 @@
>  // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>  /* Copyright (c) 2021, Microsoft Corporation. */
>  
> +#include "asm-generic/errno.h"
>  #include <net/mana/gdma.h>
>  #include <net/mana/hw_channel.h>
> +#include <net/mana/mana.h>
>  
>  static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
>  {
> @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
>  	struct hwc_wq *txq = hwc->txq;
>  	struct gdma_req_hdr *req_msg;
>  	struct hwc_caller_ctx *ctx;
> +	struct mana_context *ac;
>  	u32 dest_vrcq = 0;
>  	u32 dest_vrq = 0;
>  	u16 msg_id;
>  	int err;
>  
>  	mana_hwc_get_msg_index(hwc, &msg_id);
> +	ac = hwc->gdma_dev->driver_data;

Is there a case where gdma_dev be invalid here ? If so, lets check the state and then proceed further ?

> +	if (ac->vf_unload_timeout) {
> +		dev_err(hwc->dev, "HWC: vport is already unloaded.\n");
> +		err = -ETIMEDOUT;
> +		goto out;
> +	}
>  
>  	tx_wr = &txq->msg_buf->reqs[msg_id];
>  
> @@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
>  		goto out;
>  	}
>  
> -	if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
> +	if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) {

IMHO we should have macros instead of magic numbers (5 , 30 or so). But would like others to comment here.

>  		dev_err(hwc->dev, "HWC: Request timed out!\n");
>  		err = -ETIMEDOUT;
> +		ac->vf_unload_timeout = true;
>  		goto out;
>  	}
>  
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> index 9eef19972845..5f5affdca1eb 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -358,6 +358,8 @@ struct mana_context {
>  
>  	u16 num_ports;
>  
> +	bool vf_unload_timeout;
> +
>  	struct mana_eq *eqs;
>  
>  	struct net_device *ports[MAX_PORTS_IN_MANA_DEV];


^ permalink raw reply

* Re: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Simon Horman @ 2023-06-26 13:05 UTC (permalink / raw)
  To: souradeep chakrabarti
  Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma, stable,
	schakrabarti
In-Reply-To: <1687771098-26775-1-git-send-email-schakrabarti@linux.microsoft.com>

On Mon, Jun 26, 2023 at 02:18:18AM -0700, souradeep chakrabarti wrote:
> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> 
> This patch addresses the VF unload issue, where mana_dealloc_queues()
> gets stuck in infinite while loop, because of host unresponsiveness.
> It adds a timeout in the while loop, to fix it.
> 
> Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
> Microsoft Azure Network Adapter)

nit: A correct format of this fixes tag is:

In particular:
* All on lone line
* Description in double quotes.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")

> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> ---
> V2 -> V3:
> * Splitted the patch in two parts.
> * Removed the unnecessary braces from mana_dealloc_queues().

^ permalink raw reply

* [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: souradeep chakrabarti @ 2023-06-26  9:20 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma
  Cc: stable, schakrabarti, Souradeep Chakrabarti
In-Reply-To: <1687771098-26775-1-git-send-email-schakrabarti@linux.microsoft.com>

From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

This is the second part of the fix.

Also this patch adds a new attribute in mana_context, which gets set when
mana_hwc_send_request() hits a timeout because of host unresponsiveness.
This flag then helps to avoid the timeouts in successive calls.

Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
Microsoft Azure Network Adapter)
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V2 -> V3:
* Removed the initialization of vf_unload_timeout
* Splitted the patch in two.
* Fixed extra space from the commit message.
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c  |  4 +++-
 drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++-
 include/net/mana/mana.h                          |  2 ++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..6411f01be0d9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
 	struct gdma_context *gc = gd->gdma_context;
 	struct gdma_general_resp resp = {};
 	struct gdma_general_req req = {};
+	struct mana_context *ac;
 	int err;
 
 	if (gd->pdid == INVALID_PDID)
 		return -EINVAL;
+	ac = gd->driver_data;
 
 	mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req),
 			     sizeof(resp));
@@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
 	req.hdr.dev_id = gd->dev_id;
 
 	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
-	if (err || resp.hdr.status) {
+	if ((err || resp.hdr.status) && !ac->vf_unload_timeout) {
 		dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n",
 			err, resp.hdr.status);
 		if (!err)
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 9d1507eba5b9..492cb2c6e2cb 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -1,8 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /* Copyright (c) 2021, Microsoft Corporation. */
 
+#include "asm-generic/errno.h"
 #include <net/mana/gdma.h>
 #include <net/mana/hw_channel.h>
+#include <net/mana/mana.h>
 
 static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
 {
@@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 	struct hwc_wq *txq = hwc->txq;
 	struct gdma_req_hdr *req_msg;
 	struct hwc_caller_ctx *ctx;
+	struct mana_context *ac;
 	u32 dest_vrcq = 0;
 	u32 dest_vrq = 0;
 	u16 msg_id;
 	int err;
 
 	mana_hwc_get_msg_index(hwc, &msg_id);
+	ac = hwc->gdma_dev->driver_data;
+	if (ac->vf_unload_timeout) {
+		dev_err(hwc->dev, "HWC: vport is already unloaded.\n");
+		err = -ETIMEDOUT;
+		goto out;
+	}
 
 	tx_wr = &txq->msg_buf->reqs[msg_id];
 
@@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 		goto out;
 	}
 
-	if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
+	if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) {
 		dev_err(hwc->dev, "HWC: Request timed out!\n");
 		err = -ETIMEDOUT;
+		ac->vf_unload_timeout = true;
 		goto out;
 	}
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 9eef19972845..5f5affdca1eb 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -358,6 +358,8 @@ struct mana_context {
 
 	u16 num_ports;
 
+	bool vf_unload_timeout;
+
 	struct mana_eq *eqs;
 
 	struct net_device *ports[MAX_PORTS_IN_MANA_DEV];
-- 
2.34.1


^ permalink raw reply related

* [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: souradeep chakrabarti @ 2023-06-26  9:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma
  Cc: stable, schakrabarti, Souradeep Chakrabarti
In-Reply-To: <1687771098-26775-1-git-send-email-schakrabarti@linux.microsoft.com>

From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

This patch addresses the VF unload issue, where mana_dealloc_queues()
gets stuck in infinite while loop, because of host unresponsiveness.
It adds a timeout in the while loop, to fix it.

Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
Microsoft Azure Network Adapter)
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V2 -> V3:
* Splitted the patch in two parts.
* Removed the unnecessary braces from mana_dealloc_queues().
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a..cb5c43c3c47e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	struct gdma_dev *gd = apc->ac->gdma_dev;
+	unsigned long timeout;
 	struct mana_txq *txq;
+	struct sk_buff *skb;
+	struct mana_cq *cq;
 	int i, err;
 
 	if (apc->port_is_up)
@@ -2348,13 +2351,25 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 *
 	 * Drain all the in-flight TX packets
 	 */
+
+	timeout = jiffies + 120 * HZ;
 	for (i = 0; i < apc->num_queues; i++) {
 		txq = &apc->tx_qp[i].txq;
-
-		while (atomic_read(&txq->pending_sends) > 0)
+		while (atomic_read(&txq->pending_sends) > 0 &&
+		       time_before(jiffies, timeout))
 			usleep_range(1000, 2000);
 	}
 
+	for (i = 0; i < apc->num_queues; i++) {
+		txq = &apc->tx_qp[i].txq;
+		cq = &apc->tx_qp[i].tx_cq;
+		while (atomic_read(&txq->pending_sends)) {
+			skb = skb_dequeue(&txq->pending_skbs);
+			mana_unmap_skb(skb, apc);
+			napi_consume_skb(skb, cq->budget);
+			atomic_sub(1, &txq->pending_sends);
+		}
+	}
 	/* We're 100% sure the queues can no longer be woken up, because
 	 * we're sure now mana_poll_tx_cq() can't be running.
 	 */
-- 
2.34.1


^ permalink raw reply related

* [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: souradeep chakrabarti @ 2023-06-26  9:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma
  Cc: stable, schakrabarti, Souradeep Chakrabarti
In-Reply-To: <1687771058-26634-1-git-send-email-schakrabarti@linux.microsoft.com>

From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

This patch addresses the VF unload issue, where mana_dealloc_queues()
gets stuck in infinite while loop, because of host unresponsiveness.
It adds a timeout in the while loop, to fix it.

Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
Microsoft Azure Network Adapter)
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V2 -> V3:
* Splitted the patch in two parts.
* Removed the unnecessary braces from mana_dealloc_queues().
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a..cb5c43c3c47e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	struct gdma_dev *gd = apc->ac->gdma_dev;
+	unsigned long timeout;
 	struct mana_txq *txq;
+	struct sk_buff *skb;
+	struct mana_cq *cq;
 	int i, err;
 
 	if (apc->port_is_up)
@@ -2348,13 +2351,25 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 *
 	 * Drain all the in-flight TX packets
 	 */
+
+	timeout = jiffies + 120 * HZ;
 	for (i = 0; i < apc->num_queues; i++) {
 		txq = &apc->tx_qp[i].txq;
-
-		while (atomic_read(&txq->pending_sends) > 0)
+		while (atomic_read(&txq->pending_sends) > 0 &&
+		       time_before(jiffies, timeout))
 			usleep_range(1000, 2000);
 	}
 
+	for (i = 0; i < apc->num_queues; i++) {
+		txq = &apc->tx_qp[i].txq;
+		cq = &apc->tx_qp[i].tx_cq;
+		while (atomic_read(&txq->pending_sends)) {
+			skb = skb_dequeue(&txq->pending_skbs);
+			mana_unmap_skb(skb, apc);
+			napi_consume_skb(skb, cq->budget);
+			atomic_sub(1, &txq->pending_sends);
+		}
+	}
 	/* We're 100% sure the queues can no longer be woken up, because
 	 * we're sure now mana_poll_tx_cq() can't be running.
 	 */
-- 
2.34.1


^ permalink raw reply related

* [PATCH 0/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: souradeep chakrabarti @ 2023-06-26  9:17 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma
  Cc: stable, schakrabarti, Souradeep Chakrabarti

From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

VF unload gets stuck in MANA driver, when the host is not responding.
The function mana_dealloc_queues() tries to clear the inflight packets,
and gets stuck in while loop. Another problem in this scenario is the
timeout from hwc send request.
These patch add fix for the same.
In mana driver we are adding a timeout in the while loop, to fix it.
Also we are adding a new attribute in mana_context, which gets set when
mana_hwc_send_request() hits a timeout because of host unresponsiveness.

Souradeep Chakrabarti (2):
  net: mana: Fix MANA VF unload when host is unresponsive
  net: mana: Fix MANA VF unload when host is unresponsive

 .../net/ethernet/microsoft/mana/gdma_main.c   |  4 +++-
 .../net/ethernet/microsoft/mana/hw_channel.c  | 12 +++++++++++-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
 include/net/mana/mana.h                       |  2 ++
 4 files changed, 33 insertions(+), 4 deletions(-)

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH v2] net: mana: Batch ringing RX queue doorbell on receiving packets
From: Jakub Kicinski @ 2023-06-24 22:10 UTC (permalink / raw)
  To: longli
  Cc: Jason Gunthorpe, Leon Romanovsky, Ajay Sharma, Dexuan Cui,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-rdma, linux-hyperv, netdev,
	linux-kernel, Long Li, stable
In-Reply-To: <1687450956-6407-1-git-send-email-longli@linuxonhyperv.com>

On Thu, 22 Jun 2023 09:22:36 -0700 longli@linuxonhyperv.com wrote:
> It's inefficient to ring the doorbell page every time a WQE is posted to
> the received queue.
> 
> Move the code for ringing doorbell page to where after we have posted all
> WQEs to the receive queue during a callback from napi_poll().
> 
> Tests showed no regression in network latency benchmarks.
> 
> Cc: stable@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")

If this is supposed to be a fix, you need to clearly explain what the
performance loss was, so that backporters can make an informed decision.

>  drivers/net/ethernet/microsoft/mana/gdma_main.c |  5 ++++-
>  drivers/net/ethernet/microsoft/mana/mana_en.c   | 10 ++++++++--
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 8f3f78b68592..ef11d09a3655 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -300,8 +300,11 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
>  
>  void mana_gd_wq_ring_doorbell(struct gdma_context *gc, struct gdma_queue *queue)
>  {
> +	/* BNIC Spec specifies that client should set 0 for rq.wqe_cnt
> +	 * This value is not used in sq
> +	 */
>  	mana_gd_ring_doorbell(gc, queue->gdma_dev->doorbell, queue->type,
> -			      queue->id, queue->head * GDMA_WQE_BU_SIZE, 1);
> +			      queue->id, queue->head * GDMA_WQE_BU_SIZE, 0);
>  }

This change needs to be explained in the commit message, or should be 
a separate patch.
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH 22/26] net: mana: use array_size
From: Simon Horman @ 2023-06-24 15:48 UTC (permalink / raw)
  To: Julia Lawall
  Cc: K. Y. Srinivasan, keescook, kernel-janitors, Haiyang Zhang,
	Wei Liu, Dexuan Cui, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20230623211457.102544-23-Julia.Lawall@inria.fr>

On Fri, Jun 23, 2023 at 11:14:53PM +0200, Julia Lawall wrote:
> Use array_size to protect against multiplication overflows.
> 
> The changes were done using the following Coccinelle semantic patch:
> 
> // <smpl>
> @@
>     expression E1, E2;
>     constant C1, C2;
>     identifier alloc = {vmalloc,vzalloc};
> @@
>     
> (
>       alloc(C1 * C2,...)
> |
>       alloc(
> -           (E1) * (E2)
> +           array_size(E1, E2)
>       ,...)
> )
> // </smpl>
> 
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

Reviewed-by: Simon Horman <simon.horman@corigine.com>


^ permalink raw reply

* Re: [PATCH RFC net-next v4 8/8] tests: add vsock dgram tests
From: Bobby Eshleman @ 2023-06-23  6:33 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf,
	Jiang Wang
In-Reply-To: <d0db1e6b-3b9d-3b90-88f8-85aa5bd7ee86@gmail.com>

On Fri, Jun 23, 2023 at 09:34:51PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 23.06.2023 02:16, Bobby Eshleman wrote:
> > On Sun, Jun 11, 2023 at 11:54:57PM +0300, Arseniy Krasnov wrote:
> >> Hello Bobby!
> >>
> >> Sorry, may be I become a little bit annoying:), but I tried to run vsock_test with
> >> this v4 version, and again get the same crash:
> > 
> > Haha not annoying at all. I appreciate the testing!
> > 
> >>
> >> # cat client.sh 
> >> ./vsock_test  --mode=client --control-host=192.168.1.1 --control-port=12345 --peer-cid=2
> >> # ./client.sh 
> >> Control socket connected to 192.168.1.1:12345.
> >> 0 - SOCK_STREAM connection reset...[   20.065237] BUG: kernel NULL pointer dereference, addre0
> >> [   20.065895] #PF: supervisor read access in kernel mode
> >> [   20.065895] #PF: error_code(0x0000) - not-present page
> >> [   20.065895] PGD 0 P4D 0 
> >> [   20.065895] Oops: 0000 [#1] PREEMPT SMP PTI
> >> [   20.065895] CPU: 0 PID: 111 Comm: vsock_test Not tainted 6.4.0-rc3-gefcccba07069 #385
> >> [   20.065895] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd44
> >> [   20.065895] RIP: 0010:static_key_count+0x0/0x20
> >> [   20.065895] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
> >> [   20.065895] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
> >> [   20.065895] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
> >> [   20.065895] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
> >> [   20.065895] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
> >> [   20.065895] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
> >> [   20.065895] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
> >> [   20.065895] FS:  00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
> >> [   20.065895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   20.065895] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
> >> [   20.065895] Call Trace:
> >> [   20.065895]  <TASK>
> >> [   20.065895]  once_deferred+0xd/0x30
> >> [   20.065895]  vsock_assign_transport+0x9a/0x1b0 [vsock]
> >> [   20.065895]  vsock_connect+0xb4/0x3a0 [vsock]
> >> [   20.065895]  ? var_wake_function+0x60/0x60
> >> [   20.065895]  __sys_connect+0x9e/0xd0
> >> [   20.065895]  ? _raw_spin_unlock_irq+0xe/0x30
> >> [   20.065895]  ? do_setitimer+0x128/0x1f0
> >> [   20.065895]  ? alarm_setitimer+0x4c/0x90
> >> [   20.065895]  ? fpregs_assert_state_consistent+0x1d/0x50
> >> [   20.065895]  ? exit_to_user_mode_prepare+0x36/0x130
> >> [   20.065895]  __x64_sys_connect+0x11/0x20
> >> [   20.065895]  do_syscall_64+0x3b/0xc0
> >> [   20.065895]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> >> [   20.065895] RIP: 0033:0x7f1ad822dd13
> >> [   20.065895] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8
> >> [   20.065895] RSP: 002b:00007ffc513e3c98 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
> >> [   20.065895] RAX: ffffffffffffffda RBX: 000055aed298e020 RCX: 00007f1ad822dd13
> >> [   20.065895] RDX: 0000000000000010 RSI: 00007ffc513e3cb0 RDI: 0000000000000004
> >> [   20.065895] RBP: 0000000000000004 R08: 000055aed32b2018 R09: 0000000000000000
> >> [   20.065895] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >> [   20.065895] R13: 000055aed298acb1 R14: 00007ffc513e3cb0 R15: 00007ffc513e3d40
> >> [   20.065895]  </TASK>
> >> [   20.065895] Modules linked in: vsock_loopback vhost_vsock vmw_vsock_virtio_transport vmw_vb
> > 
> > ^ I'm guessing this is the difference between our setups. I have been
> > going all built-in, let me see if I can reproduce w/ modules...
> 
> Ah ok, I think using modules is good practice here, because it could test that Your symbols
> set is valid to work as modules and 'rmmod' could check problems with the forgotten references
> for example for socket or skb. I'm working with modules most of the time.
> 

Agreed, definitely good practice. I test combinations of 'm' and 'y' for
the subsystem as part of a more rigorous testing process that I do when
I feel it is getting closer to losing the RFC tag. I'll definitely test
it for the rest of this series so you don't run into issues though.  You
may have just convinced to change my environment around to let me use
modules by default...

Thanks,
Bobby

> > 
> >> [   20.065895] CR2: 0000000000000000
> >> [   20.154060] ---[ end trace 0000000000000000 ]---
> >> [   20.155519] RIP: 0010:static_key_count+0x0/0x20
> >> [   20.156932] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
> >> [   20.161367] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
> >> [   20.162613] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
> >> [   20.164262] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
> >> [   20.165934] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
> >> [   20.167684] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
> >> [   20.169427] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
> >> [   20.171109] FS:  00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
> >> [   20.173000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   20.174381] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
> >>
> >> So, what HEAD do You use? May be You have some specific config (I use x86-64 defconfig + vsock/vhost
> >> related things) ?
> >>
> > 
> > For this series I used net-next:
> > 	28cfea989d6f55c3d10608eba2a2bae609c5bf3e
> > 
> >> Thanks, Arseniy
> >>
> > 
> > As always, thanks for the bug finding! I'll report back when I
> > reproduce or with questions if I can't.
> 
> Thanks!
> 
> Thanks, Arseniy
> 
> > 
> > Best,
> > Bobby
> > 
> >>
> >> On 10.06.2023 03:58, Bobby Eshleman wrote:
> >>> From: Jiang Wang <jiang.wang@bytedance.com>
> >>>
> >>> This patch adds tests for vsock datagram.
> >>>
> >>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> >>> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> >>> ---
> >>>  tools/testing/vsock/util.c       | 141 ++++++++++++-
> >>>  tools/testing/vsock/util.h       |   6 +
> >>>  tools/testing/vsock/vsock_test.c | 432 +++++++++++++++++++++++++++++++++++++++
> >>>  3 files changed, 578 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
> >>> index 01b636d3039a..811e70d7cf1e 100644
> >>> --- a/tools/testing/vsock/util.c
> >>> +++ b/tools/testing/vsock/util.c
> >>> @@ -99,7 +99,8 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)
> >>>  	int ret;
> >>>  	int fd;
> >>>  
> >>> -	control_expectln("LISTENING");
> >>> +	if (type != SOCK_DGRAM)
> >>> +		control_expectln("LISTENING");
> >>>  
> >>>  	fd = socket(AF_VSOCK, type, 0);
> >>>  
> >>> @@ -130,6 +131,11 @@ int vsock_seqpacket_connect(unsigned int cid, unsigned int port)
> >>>  	return vsock_connect(cid, port, SOCK_SEQPACKET);
> >>>  }
> >>>  
> >>> +int vsock_dgram_connect(unsigned int cid, unsigned int port)
> >>> +{
> >>> +	return vsock_connect(cid, port, SOCK_DGRAM);
> >>> +}
> >>> +
> >>>  /* Listen on <cid, port> and return the first incoming connection.  The remote
> >>>   * address is stored to clientaddrp.  clientaddrp may be NULL.
> >>>   */
> >>> @@ -211,6 +217,34 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
> >>>  	return vsock_accept(cid, port, clientaddrp, SOCK_SEQPACKET);
> >>>  }
> >>>  
> >>> +int vsock_dgram_bind(unsigned int cid, unsigned int port)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = port,
> >>> +			.svm_cid = cid,
> >>> +		},
> >>> +	};
> >>> +	int fd;
> >>> +
> >>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +	if (fd < 0) {
> >>> +		perror("socket");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> >>> +		perror("bind");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	return fd;
> >>> +}
> >>> +
> >>>  /* Transmit one byte and check the return value.
> >>>   *
> >>>   * expected_ret:
> >>> @@ -260,6 +294,57 @@ void send_byte(int fd, int expected_ret, int flags)
> >>>  	}
> >>>  }
> >>>  
> >>> +/* Transmit one byte and check the return value.
> >>> + *
> >>> + * expected_ret:
> >>> + *  <0 Negative errno (for testing errors)
> >>> + *   0 End-of-file
> >>> + *   1 Success
> >>> + */
> >>> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> >>> +		 int flags)
> >>> +{
> >>> +	const uint8_t byte = 'A';
> >>> +	ssize_t nwritten;
> >>> +
> >>> +	timeout_begin(TIMEOUT);
> >>> +	do {
> >>> +		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
> >>> +				  len);
> >>> +		timeout_check("write");
> >>> +	} while (nwritten < 0 && errno == EINTR);
> >>> +	timeout_end();
> >>> +
> >>> +	if (expected_ret < 0) {
> >>> +		if (nwritten != -1) {
> >>> +			fprintf(stderr, "bogus sendto(2) return value %zd\n",
> >>> +				nwritten);
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +		if (errno != -expected_ret) {
> >>> +			perror("write");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +		return;
> >>> +	}
> >>> +
> >>> +	if (nwritten < 0) {
> >>> +		perror("write");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +	if (nwritten == 0) {
> >>> +		if (expected_ret == 0)
> >>> +			return;
> >>> +
> >>> +		fprintf(stderr, "unexpected EOF while sending byte\n");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +	if (nwritten != sizeof(byte)) {
> >>> +		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +}
> >>> +
> >>>  /* Receive one byte and check the return value.
> >>>   *
> >>>   * expected_ret:
> >>> @@ -313,6 +398,60 @@ void recv_byte(int fd, int expected_ret, int flags)
> >>>  	}
> >>>  }
> >>>  
> >>> +/* Receive one byte and check the return value.
> >>> + *
> >>> + * expected_ret:
> >>> + *  <0 Negative errno (for testing errors)
> >>> + *   0 End-of-file
> >>> + *   1 Success
> >>> + */
> >>> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> >>> +		   int expected_ret, int flags)
> >>> +{
> >>> +	uint8_t byte;
> >>> +	ssize_t nread;
> >>> +
> >>> +	timeout_begin(TIMEOUT);
> >>> +	do {
> >>> +		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
> >>> +		timeout_check("read");
> >>> +	} while (nread < 0 && errno == EINTR);
> >>> +	timeout_end();
> >>> +
> >>> +	if (expected_ret < 0) {
> >>> +		if (nread != -1) {
> >>> +			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
> >>> +				nread);
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +		if (errno != -expected_ret) {
> >>> +			perror("read");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +		return;
> >>> +	}
> >>> +
> >>> +	if (nread < 0) {
> >>> +		perror("read");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +	if (nread == 0) {
> >>> +		if (expected_ret == 0)
> >>> +			return;
> >>> +
> >>> +		fprintf(stderr, "unexpected EOF while receiving byte\n");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +	if (nread != sizeof(byte)) {
> >>> +		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +	if (byte != 'A') {
> >>> +		fprintf(stderr, "unexpected byte read %c\n", byte);
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +}
> >>> +
> >>>  /* Run test cases.  The program terminates if a failure occurs. */
> >>>  void run_tests(const struct test_case *test_cases,
> >>>  	       const struct test_opts *opts)
> >>> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
> >>> index fb99208a95ea..a69e128d120c 100644
> >>> --- a/tools/testing/vsock/util.h
> >>> +++ b/tools/testing/vsock/util.h
> >>> @@ -37,13 +37,19 @@ void init_signals(void);
> >>>  unsigned int parse_cid(const char *str);
> >>>  int vsock_stream_connect(unsigned int cid, unsigned int port);
> >>>  int vsock_seqpacket_connect(unsigned int cid, unsigned int port);
> >>> +int vsock_dgram_connect(unsigned int cid, unsigned int port);
> >>>  int vsock_stream_accept(unsigned int cid, unsigned int port,
> >>>  			struct sockaddr_vm *clientaddrp);
> >>>  int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
> >>>  			   struct sockaddr_vm *clientaddrp);
> >>> +int vsock_dgram_bind(unsigned int cid, unsigned int port);
> >>>  void vsock_wait_remote_close(int fd);
> >>>  void send_byte(int fd, int expected_ret, int flags);
> >>> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> >>> +		 int flags);
> >>>  void recv_byte(int fd, int expected_ret, int flags);
> >>> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> >>> +		   int expected_ret, int flags);
> >>>  void run_tests(const struct test_case *test_cases,
> >>>  	       const struct test_opts *opts);
> >>>  void list_tests(const struct test_case *test_cases);
> >>> diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> >>> index ac1bd3ac1533..ded82d39ee5d 100644
> >>> --- a/tools/testing/vsock/vsock_test.c
> >>> +++ b/tools/testing/vsock/vsock_test.c
> >>> @@ -1053,6 +1053,413 @@ static void test_stream_virtio_skb_merge_server(const struct test_opts *opts)
> >>>  	close(fd);
> >>>  }
> >>>  
> >>> +static void test_dgram_sendto_client(const struct test_opts *opts)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = 1234,
> >>> +			.svm_cid = opts->peer_cid,
> >>> +		},
> >>> +	};
> >>> +	int fd;
> >>> +
> >>> +	/* Wait for the server to be ready */
> >>> +	control_expectln("BIND");
> >>> +
> >>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +	if (fd < 0) {
> >>> +		perror("socket");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
> >>> +
> >>> +	/* Notify the server that the client has finished */
> >>> +	control_writeln("DONE");
> >>> +
> >>> +	close(fd);
> >>> +}
> >>> +
> >>> +static void test_dgram_sendto_server(const struct test_opts *opts)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = 1234,
> >>> +			.svm_cid = VMADDR_CID_ANY,
> >>> +		},
> >>> +	};
> >>> +	int len = sizeof(addr.sa);
> >>> +	int fd;
> >>> +
> >>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +	if (fd < 0) {
> >>> +		perror("socket");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> >>> +		perror("bind");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Notify the client that the server is ready */
> >>> +	control_writeln("BIND");
> >>> +
> >>> +	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> >>> +
> >>> +	/* Wait for the client to finish */
> >>> +	control_expectln("DONE");
> >>> +
> >>> +	close(fd);
> >>> +}
> >>> +
> >>> +static void test_dgram_connect_client(const struct test_opts *opts)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = 1234,
> >>> +			.svm_cid = opts->peer_cid,
> >>> +		},
> >>> +	};
> >>> +	int ret;
> >>> +	int fd;
> >>> +
> >>> +	/* Wait for the server to be ready */
> >>> +	control_expectln("BIND");
> >>> +
> >>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +	if (fd < 0) {
> >>> +		perror("bind");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	ret = connect(fd, &addr.sa, sizeof(addr.svm));
> >>> +	if (ret < 0) {
> >>> +		perror("connect");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	send_byte(fd, 1, 0);
> >>> +
> >>> +	/* Notify the server that the client has finished */
> >>> +	control_writeln("DONE");
> >>> +
> >>> +	close(fd);
> >>> +}
> >>> +
> >>> +static void test_dgram_connect_server(const struct test_opts *opts)
> >>> +{
> >>> +	test_dgram_sendto_server(opts);
> >>> +}
> >>> +
> >>> +static void test_dgram_multiconn_sendto_client(const struct test_opts *opts)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = 1234,
> >>> +			.svm_cid = opts->peer_cid,
> >>> +		},
> >>> +	};
> >>> +	int fds[MULTICONN_NFDS];
> >>> +	int i;
> >>> +
> >>> +	/* Wait for the server to be ready */
> >>> +	control_expectln("BIND");
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++) {
> >>> +		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +		if (fds[i] < 0) {
> >>> +			perror("socket");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +	}
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++)
> >>> +		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
> >>> +
> >>> +	/* Notify the server that the client has finished */
> >>> +	control_writeln("DONE");
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++)
> >>> +		close(fds[i]);
> >>> +}
> >>> +
> >>> +static void test_dgram_multiconn_sendto_server(const struct test_opts *opts)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = 1234,
> >>> +			.svm_cid = VMADDR_CID_ANY,
> >>> +		},
> >>> +	};
> >>> +	int len = sizeof(addr.sa);
> >>> +	int fd;
> >>> +	int i;
> >>> +
> >>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +	if (fd < 0) {
> >>> +		perror("socket");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> >>> +		perror("bind");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Notify the client that the server is ready */
> >>> +	control_writeln("BIND");
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++)
> >>> +		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> >>> +
> >>> +	/* Wait for the client to finish */
> >>> +	control_expectln("DONE");
> >>> +
> >>> +	close(fd);
> >>> +}
> >>> +
> >>> +static void test_dgram_multiconn_send_client(const struct test_opts *opts)
> >>> +{
> >>> +	int fds[MULTICONN_NFDS];
> >>> +	int i;
> >>> +
> >>> +	/* Wait for the server to be ready */
> >>> +	control_expectln("BIND");
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++) {
> >>> +		fds[i] = vsock_dgram_connect(opts->peer_cid, 1234);
> >>> +		if (fds[i] < 0) {
> >>> +			perror("socket");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +	}
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++)
> >>> +		send_byte(fds[i], 1, 0);
> >>> +
> >>> +	/* Notify the server that the client has finished */
> >>> +	control_writeln("DONE");
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++)
> >>> +		close(fds[i]);
> >>> +}
> >>> +
> >>> +static void test_dgram_multiconn_send_server(const struct test_opts *opts)
> >>> +{
> >>> +	union {
> >>> +		struct sockaddr sa;
> >>> +		struct sockaddr_vm svm;
> >>> +	} addr = {
> >>> +		.svm = {
> >>> +			.svm_family = AF_VSOCK,
> >>> +			.svm_port = 1234,
> >>> +			.svm_cid = VMADDR_CID_ANY,
> >>> +		},
> >>> +	};
> >>> +	int fd;
> >>> +	int i;
> >>> +
> >>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> >>> +	if (fd < 0) {
> >>> +		perror("socket");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> >>> +		perror("bind");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Notify the client that the server is ready */
> >>> +	control_writeln("BIND");
> >>> +
> >>> +	for (i = 0; i < MULTICONN_NFDS; i++)
> >>> +		recv_byte(fd, 1, 0);
> >>> +
> >>> +	/* Wait for the client to finish */
> >>> +	control_expectln("DONE");
> >>> +
> >>> +	close(fd);
> >>> +}
> >>> +
> >>> +static void test_dgram_msg_bounds_client(const struct test_opts *opts)
> >>> +{
> >>> +	unsigned long recv_buf_size;
> >>> +	int page_size;
> >>> +	int msg_cnt;
> >>> +	int fd;
> >>> +
> >>> +	fd = vsock_dgram_connect(opts->peer_cid, 1234);
> >>> +	if (fd < 0) {
> >>> +		perror("connect");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Let the server know the client is ready */
> >>> +	control_writeln("CLNTREADY");
> >>> +
> >>> +	msg_cnt = control_readulong();
> >>> +	recv_buf_size = control_readulong();
> >>> +
> >>> +	/* Wait, until receiver sets buffer size. */
> >>> +	control_expectln("SRVREADY");
> >>> +
> >>> +	page_size = getpagesize();
> >>> +
> >>> +	for (int i = 0; i < msg_cnt; i++) {
> >>> +		unsigned long curr_hash;
> >>> +		ssize_t send_size;
> >>> +		size_t buf_size;
> >>> +		void *buf;
> >>> +
> >>> +		/* Use "small" buffers and "big" buffers. */
> >>> +		if (i & 1)
> >>> +			buf_size = page_size +
> >>> +					(rand() % (MAX_MSG_SIZE - page_size));
> >>> +		else
> >>> +			buf_size = 1 + (rand() % page_size);
> >>> +
> >>> +		buf_size = min(buf_size, recv_buf_size);
> >>> +
> >>> +		buf = malloc(buf_size);
> >>> +
> >>> +		if (!buf) {
> >>> +			perror("malloc");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +
> >>> +		memset(buf, rand() & 0xff, buf_size);
> >>> +		/* Set at least one MSG_EOR + some random. */
> >>> +
> >>> +		send_size = send(fd, buf, buf_size, 0);
> >>> +
> >>> +		if (send_size < 0) {
> >>> +			perror("send");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +
> >>> +		if (send_size != buf_size) {
> >>> +			fprintf(stderr, "Invalid send size\n");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +
> >>> +		/* In theory the implementation isn't required to transmit
> >>> +		 * these packets in order, so we use this SYNC control message
> >>> +		 * so that server and client coordinate sending and receiving
> >>> +		 * one packet at a time. The client sends a packet and waits
> >>> +		 * until it has been received before sending another.
> >>> +		 */
> >>> +		control_writeln("PKTSENT");
> >>> +		control_expectln("PKTRECV");
> >>> +
> >>> +		/* Send the server a hash of the packet */
> >>> +		curr_hash = hash_djb2(buf, buf_size);
> >>> +		control_writeulong(curr_hash);
> >>> +		free(buf);
> >>> +	}
> >>> +
> >>> +	control_writeln("SENDDONE");
> >>> +	close(fd);
> >>> +}
> >>> +
> >>> +static void test_dgram_msg_bounds_server(const struct test_opts *opts)
> >>> +{
> >>> +	const unsigned long msg_cnt = 16;
> >>> +	unsigned long sock_buf_size;
> >>> +	struct msghdr msg = {0};
> >>> +	struct iovec iov = {0};
> >>> +	char buf[MAX_MSG_SIZE];
> >>> +	socklen_t len;
> >>> +	int fd;
> >>> +	int i;
> >>> +
> >>> +	fd = vsock_dgram_bind(VMADDR_CID_ANY, 1234);
> >>> +
> >>> +	if (fd < 0) {
> >>> +		perror("bind");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Set receive buffer to maximum */
> >>> +	sock_buf_size = -1;
> >>> +	if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
> >>> +		       &sock_buf_size, sizeof(sock_buf_size))) {
> >>> +		perror("setsockopt(SO_RECVBUF)");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Retrieve the receive buffer size */
> >>> +	len = sizeof(sock_buf_size);
> >>> +	if (getsockopt(fd, SOL_SOCKET, SO_RCVBUF,
> >>> +		       &sock_buf_size, &len)) {
> >>> +		perror("getsockopt(SO_RECVBUF)");
> >>> +		exit(EXIT_FAILURE);
> >>> +	}
> >>> +
> >>> +	/* Client ready to receive parameters */
> >>> +	control_expectln("CLNTREADY");
> >>> +
> >>> +	control_writeulong(msg_cnt);
> >>> +	control_writeulong(sock_buf_size);
> >>> +
> >>> +	/* Ready to receive data. */
> >>> +	control_writeln("SRVREADY");
> >>> +
> >>> +	iov.iov_base = buf;
> >>> +	iov.iov_len = sizeof(buf);
> >>> +	msg.msg_iov = &iov;
> >>> +	msg.msg_iovlen = 1;
> >>> +
> >>> +	for (i = 0; i < msg_cnt; i++) {
> >>> +		unsigned long remote_hash;
> >>> +		unsigned long curr_hash;
> >>> +		ssize_t recv_size;
> >>> +
> >>> +		control_expectln("PKTSENT");
> >>> +		recv_size = recvmsg(fd, &msg, 0);
> >>> +		control_writeln("PKTRECV");
> >>> +
> >>> +		if (!recv_size)
> >>> +			break;
> >>> +
> >>> +		if (recv_size < 0) {
> >>> +			perror("recvmsg");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +
> >>> +		curr_hash = hash_djb2(msg.msg_iov[0].iov_base, recv_size);
> >>> +		remote_hash = control_readulong();
> >>> +
> >>> +		if (curr_hash != remote_hash) {
> >>> +			fprintf(stderr, "Message bounds broken\n");
> >>> +			exit(EXIT_FAILURE);
> >>> +		}
> >>> +	}
> >>> +
> >>> +	close(fd);
> >>> +}
> >>> +
> >>>  static struct test_case test_cases[] = {
> >>>  	{
> >>>  		.name = "SOCK_STREAM connection reset",
> >>> @@ -1128,6 +1535,31 @@ static struct test_case test_cases[] = {
> >>>  		.run_client = test_stream_virtio_skb_merge_client,
> >>>  		.run_server = test_stream_virtio_skb_merge_server,
> >>>  	},
> >>> +	{
> >>> +		.name = "SOCK_DGRAM client sendto",
> >>> +		.run_client = test_dgram_sendto_client,
> >>> +		.run_server = test_dgram_sendto_server,
> >>> +	},
> >>> +	{
> >>> +		.name = "SOCK_DGRAM client connect",
> >>> +		.run_client = test_dgram_connect_client,
> >>> +		.run_server = test_dgram_connect_server,
> >>> +	},
> >>> +	{
> >>> +		.name = "SOCK_DGRAM multiple connections using sendto",
> >>> +		.run_client = test_dgram_multiconn_sendto_client,
> >>> +		.run_server = test_dgram_multiconn_sendto_server,
> >>> +	},
> >>> +	{
> >>> +		.name = "SOCK_DGRAM multiple connections using send",
> >>> +		.run_client = test_dgram_multiconn_send_client,
> >>> +		.run_server = test_dgram_multiconn_send_server,
> >>> +	},
> >>> +	{
> >>> +		.name = "SOCK_DGRAM msg bounds",
> >>> +		.run_client = test_dgram_msg_bounds_client,
> >>> +		.run_server = test_dgram_msg_bounds_server,
> >>> +	},
> >>>  	{},
> >>>  };
> >>>  
> >>>

^ permalink raw reply

* RE: [PATCH v2] Drivers: hv: Change hv_free_hyperv_page() to take void * argument
From: Dexuan Cui @ 2023-06-23 22:19 UTC (permalink / raw)
  To: Kameron Carr, arnd@arndb.de, Haiyang Zhang, KY Srinivasan,
	linux-arch@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, wei.liu@kernel.org
In-Reply-To: <1687558189-19734-1-git-send-email-kameroncarr@linux.microsoft.com>

> From: Kameron Carr <kameroncarr@linux.microsoft.com>
> Sent: Friday, June 23, 2023 3:10 PM
>  ...
> Currently hv_free_hyperv_page() takes an unsigned long argument, which
> is inconsistent with the void * return value from the corresponding
> hv_alloc_hyperv_page() function and variants. This creates unnecessary
> extra casting.
> 
> Change the hv_free_hyperv_page() argument type to void *.
> Also remove redundant casts from invocations of
> hv_alloc_hyperv_page() and variants.
> 
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
> V1 -> V2: Added Signed-off-by

Reviewed-by: Dexuan Cui <decui@microsoft.com>

^ permalink raw reply

* Re: [PATCH RFC net-next v4 6/8] virtio/vsock: support dgrams
From: Bobby Eshleman @ 2023-06-23  4:37 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Arseniy Krasnov, Bobby Eshleman, Stefan Hajnoczi,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf
In-Reply-To: <ppx75eomyyb354knfkwbwin3il2ot7hf5cefwrt6ztpcbc3pps@q736cq5v4bdh>

On Thu, Jun 22, 2023 at 06:09:12PM +0200, Stefano Garzarella wrote:
> On Sun, Jun 11, 2023 at 11:49:02PM +0300, Arseniy Krasnov wrote:
> > Hello Bobby!
> > 
> > On 10.06.2023 03:58, Bobby Eshleman wrote:
> > > This commit adds support for datagrams over virtio/vsock.
> > > 
> > > Message boundaries are preserved on a per-skb and per-vq entry basis.
> > 
> > I'm a little bit confused about the following case: let vhost sends 4097 bytes
> > datagram to the guest. Guest uses 4096 RX buffers in it's virtio queue, each
> > buffer has attached empty skb to it. Vhost places first 4096 bytes to the first
> > buffer of guests RX queue, and 1 last byte to the second buffer. Now IIUC guest
> > has two skb in it rx queue, and user in guest wants to read data - does it read
> > 4097 bytes, while guest has two skb - 4096 bytes and 1 bytes? In seqpacket there is
> > special marker in header which shows where message ends, and how it works here?
> 
> I think the main difference is that DGRAM is not connection-oriented, so
> we don't have a stream and we can't split the packet into 2 (maybe we
> could, but we have no guarantee that the second one for example will be
> not discarded because there is no space).
> 
> So I think it is acceptable as a restriction to keep it simple.
> 
> My only doubt is, should we make the RX buffer size configurable,
> instead of always using 4k?
> 
I think that is a really good idea. What mechanism do you imagine?

For sendmsg() with buflen > VQ_BUF_SIZE, I think I'd like -ENOBUFS
returned even though it is uncharacteristic of Linux sockets.
Alternatively, silently dropping is okay... but seems needlessly
unhelpful.

FYI, this patch is broken for h2g because it requeues partially sent
skbs, so probably doesn't need much code review until we decided on the
policy.

Best,
Bobby

^ permalink raw reply

* [PATCH v2] Drivers: hv: Change hv_free_hyperv_page() to take void * argument
From: Kameron Carr @ 2023-06-23 22:09 UTC (permalink / raw)
  To: arnd, decui, haiyangz, kys, linux-arch, linux-hyperv,
	linux-kernel, wei.liu

Currently hv_free_hyperv_page() takes an unsigned long argument, which
is inconsistent with the void * return value from the corresponding
hv_alloc_hyperv_page() function and variants. This creates unnecessary
extra casting.

Change the hv_free_hyperv_page() argument type to void *.
Also remove redundant casts from invocations of
hv_alloc_hyperv_page() and variants.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
V1 -> V2: Added Signed-off-by

 drivers/hv/connection.c        | 13 ++++++-------
 drivers/hv/hv_common.c         | 10 +++++-----
 include/asm-generic/mshyperv.h |  2 +-
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 5978e9d..ebf15f3 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -209,8 +209,7 @@ int vmbus_connect(void)
 	 * Setup the vmbus event connection for channel interrupt
 	 * abstraction stuff
 	 */
-	vmbus_connection.int_page =
-	(void *)hv_alloc_hyperv_zeroed_page();
+	vmbus_connection.int_page = hv_alloc_hyperv_zeroed_page();
 	if (vmbus_connection.int_page == NULL) {
 		ret = -ENOMEM;
 		goto cleanup;
@@ -225,8 +224,8 @@ int vmbus_connect(void)
 	 * Setup the monitor notification facility. The 1st page for
 	 * parent->child and the 2nd page for child->parent
 	 */
-	vmbus_connection.monitor_pages[0] = (void *)hv_alloc_hyperv_page();
-	vmbus_connection.monitor_pages[1] = (void *)hv_alloc_hyperv_page();
+	vmbus_connection.monitor_pages[0] = hv_alloc_hyperv_page();
+	vmbus_connection.monitor_pages[1] = hv_alloc_hyperv_page();
 	if ((vmbus_connection.monitor_pages[0] == NULL) ||
 	    (vmbus_connection.monitor_pages[1] == NULL)) {
 		ret = -ENOMEM;
@@ -333,15 +332,15 @@ void vmbus_disconnect(void)
 		destroy_workqueue(vmbus_connection.work_queue);
 
 	if (vmbus_connection.int_page) {
-		hv_free_hyperv_page((unsigned long)vmbus_connection.int_page);
+		hv_free_hyperv_page(vmbus_connection.int_page);
 		vmbus_connection.int_page = NULL;
 	}
 
 	set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[0], 1);
 	set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[1], 1);
 
-	hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
-	hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
+	hv_free_hyperv_page(vmbus_connection.monitor_pages[0]);
+	hv_free_hyperv_page(vmbus_connection.monitor_pages[1]);
 	vmbus_connection.monitor_pages[0] = NULL;
 	vmbus_connection.monitor_pages[1] = NULL;
 }
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 542a1d5..6a2258f 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -115,12 +115,12 @@ void *hv_alloc_hyperv_zeroed_page(void)
 }
 EXPORT_SYMBOL_GPL(hv_alloc_hyperv_zeroed_page);
 
-void hv_free_hyperv_page(unsigned long addr)
+void hv_free_hyperv_page(void *addr)
 {
 	if (PAGE_SIZE == HV_HYP_PAGE_SIZE)
-		free_page(addr);
+		free_page((unsigned long)addr);
 	else
-		kfree((void *)addr);
+		kfree(addr);
 }
 EXPORT_SYMBOL_GPL(hv_free_hyperv_page);
 
@@ -253,7 +253,7 @@ static void hv_kmsg_dump_unregister(void)
 	atomic_notifier_chain_unregister(&panic_notifier_list,
 					 &hyperv_panic_report_block);
 
-	hv_free_hyperv_page((unsigned long)hv_panic_page);
+	hv_free_hyperv_page(hv_panic_page);
 	hv_panic_page = NULL;
 }
 
@@ -270,7 +270,7 @@ static void hv_kmsg_dump_register(void)
 	ret = kmsg_dump_register(&hv_kmsg_dumper);
 	if (ret) {
 		pr_err("Hyper-V: kmsg dump register error 0x%x\n", ret);
-		hv_free_hyperv_page((unsigned long)hv_panic_page);
+		hv_free_hyperv_page(hv_panic_page);
 		hv_panic_page = NULL;
 	}
 }
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 402a8c1..a8f4b65 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -190,7 +190,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
 
 void *hv_alloc_hyperv_page(void);
 void *hv_alloc_hyperv_zeroed_page(void);
-void hv_free_hyperv_page(unsigned long addr);
+void hv_free_hyperv_page(void *addr);
 
 /**
  * hv_cpu_number_to_vp_number() - Map CPU to VP.
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 22/26] net: mana: use array_size
From: Julia Lawall @ 2023-06-23 21:14 UTC (permalink / raw)
  To: K. Y. Srinivasan
  Cc: keescook, kernel-janitors, Haiyang Zhang, Wei Liu, Dexuan Cui,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20230623211457.102544-1-Julia.Lawall@inria.fr>

Use array_size to protect against multiplication overflows.

The changes were done using the following Coccinelle semantic patch:

// <smpl>
@@
    expression E1, E2;
    constant C1, C2;
    identifier alloc = {vmalloc,vzalloc};
@@
    
(
      alloc(C1 * C2,...)
|
      alloc(
-           (E1) * (E2)
+           array_size(E1, E2)
      ,...)
)
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

---
 drivers/net/ethernet/microsoft/mana/hw_channel.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 9d1507eba5b9..e82c513760f9 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -627,7 +627,7 @@ static int mana_hwc_establish_channel(struct gdma_context *gc, u16 *q_depth,
 	if (WARN_ON(cq->id >= gc->max_num_cqs))
 		return -EPROTO;
 
-	gc->cq_table = vzalloc(gc->max_num_cqs * sizeof(struct gdma_queue *));
+	gc->cq_table = vzalloc(array_size(gc->max_num_cqs, sizeof(struct gdma_queue *)));
 	if (!gc->cq_table)
 		return -ENOMEM;
 


^ permalink raw reply related

* [PATCH 00/26] use array_size
From: Julia Lawall @ 2023-06-23 21:14 UTC (permalink / raw)
  To: linux-staging
  Cc: keescook, kernel-janitors, Tianshu Qiu, Bingbu Cao, linux-sgx,
	H. Peter Anvin, Dave Hansen, kasan-dev, Andrey Konovalov,
	Dmitry Vyukov, iommu, linux-tegra, Robin Murphy, Krishna Reddy,
	linux-scsi, linux-rdma, dri-devel, linux-kernel, netdev,
	Shailend Chand, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, linux-media, linaro-mm-sig, Xuan Zhuo,
	virtualization, mhi, linux-arm-msm, linux-btrfs, intel-gvt-dev,
	intel-gfx, VMware Graphics Reviewers, linux-hyperv

Use array_size to protect against multiplication overflows.

This follows up on the following patches by Kees Cook from 2018.

42bc47b35320 ("treewide: Use array_size() in vmalloc()")
fad953ce0b22 ("treewide: Use array_size() in vzalloc()")

The changes were done using the following Coccinelle semantic patch,
adapted from the one posted by Kees.

// Drop single-byte sizes and redundant parens.
@@
    expression COUNT;
    typedef u8;
    typedef __u8;
    type t = {u8,__u8,char,unsigned char};
    identifier alloc = {vmalloc,vzalloc};
@@
      alloc(
-           (sizeof(t)) * (COUNT)
+           COUNT
      , ...)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
    expression COUNT;
    size_t e1, e2, e3;
    identifier alloc = {vmalloc,vzalloc};
@@

(    
      alloc(
-           (e1) * (e2) * (e3)
+           array3_size(e1, e2, e3)
      ,...)
|
      alloc(
-           (e1) * (e2) * (COUNT)
+           array3_size(COUNT, e1, e2)
      ,...)
)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
    expression STRIDE, COUNT;
    size_t e;
    identifier alloc = {vmalloc,vzalloc};
@@

      alloc(
-           (e) * (COUNT) * (STRIDE)
+           array3_size(COUNT, STRIDE, e)
      ,...)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
    expression E1, E2, E3;
    constant C1, C2, C3;
    identifier alloc = {vmalloc,vzalloc};
@@
    
(
      alloc(C1 * C2 * C3,...)
|
      alloc(
-           (E1) * (E2) * (E3)
+           array3_size(E1, E2, E3)
      ,...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
    size_t e1,e2;
    expression COUNT;
    identifier alloc = {vmalloc,vzalloc};
@@

(
      alloc(
-           (e1) * (e2)
+           array_size(e1, e2)
      ,...)
|
      alloc(
-           (e1) * (COUNT)
+           array_size(COUNT, e1)
      ,...)
)
    
// And then all remaining 2 factors products when they're not all constants.
@@
    expression E1, E2;
    constant C1, C2;
    identifier alloc = {vmalloc,vzalloc};
@@
    
(
      alloc(C1 * C2,...)
|
      alloc(
-           (E1) * (E2)
+           array_size(E1, E2)
      ,...)
)


---

 arch/x86/kernel/cpu/sgx/main.c                    |    3 ++-
 drivers/accel/habanalabs/common/device.c          |    3 ++-
 drivers/accel/habanalabs/common/state_dump.c      |    6 +++---
 drivers/bus/mhi/host/init.c                       |    4 ++--
 drivers/comedi/comedi_buf.c                       |    4 ++--
 drivers/dma-buf/heaps/system_heap.c               |    2 +-
 drivers/gpu/drm/gud/gud_pipe.c                    |    2 +-
 drivers/gpu/drm/i915/gvt/gtt.c                    |    6 ++++--
 drivers/gpu/drm/vmwgfx/vmwgfx_devcaps.c           |    2 +-
 drivers/infiniband/hw/bnxt_re/qplib_res.c         |    4 ++--
 drivers/infiniband/hw/erdma/erdma_verbs.c         |    4 ++--
 drivers/infiniband/sw/siw/siw_qp.c                |    4 ++--
 drivers/infiniband/sw/siw/siw_verbs.c             |    6 +++---
 drivers/iommu/tegra-gart.c                        |    4 ++--
 drivers/net/ethernet/amd/pds_core/core.c          |    4 ++--
 drivers/net/ethernet/freescale/enetc/enetc.c      |    4 ++--
 drivers/net/ethernet/google/gve/gve_tx.c          |    2 +-
 drivers/net/ethernet/marvell/octeon_ep/octep_rx.c |    2 +-
 drivers/net/ethernet/microsoft/mana/hw_channel.c  |    2 +-
 drivers/net/ethernet/pensando/ionic/ionic_lif.c   |    4 ++--
 drivers/scsi/fnic/fnic_trace.c                    |    2 +-
 drivers/scsi/qla2xxx/qla_init.c                   |    4 ++--
 drivers/staging/media/ipu3/ipu3-mmu.c             |    2 +-
 drivers/vdpa/vdpa_user/iova_domain.c              |    3 +--
 drivers/virtio/virtio_mem.c                       |    6 +++---
 fs/btrfs/zoned.c                                  |    5 +++--
 kernel/kcov.c                                     |    2 +-
 lib/test_vmalloc.c                                |   12 ++++++------
 28 files changed, 56 insertions(+), 52 deletions(-)

^ permalink raw reply

* Re: [PATCH RFC net-next v4 3/8] vsock: support multi-transport datagrams
From: Bobby Eshleman @ 2023-06-23  2:59 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, linux-hyperv, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, VMware PV-Drivers Reviewers, Simon Horman,
	virtualization, Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu,
	Dexuan Cui, Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
	Krasnov Arseniy, Vishnu Dasa, netdev, linux-kernel, bpf,
	David S. Miller
In-Reply-To: <ZJUIWcgg13F7DNBm@bullseye>

On Fri, Jun 23, 2023 at 02:50:01AM +0000, Bobby Eshleman wrote:
> On Thu, Jun 22, 2023 at 05:19:08PM +0200, Stefano Garzarella wrote:
> > On Sat, Jun 10, 2023 at 12:58:30AM +0000, Bobby Eshleman wrote:
> > > This patch adds support for multi-transport datagrams.
> > > 
> > > This includes:
> > > - Per-packet lookup of transports when using sendto(sockaddr_vm)
> > > - Selecting H2G or G2H transport using VMADDR_FLAG_TO_HOST and CID in
> > >  sockaddr_vm
> > > 
> > > To preserve backwards compatibility with VMCI, some important changes
> > > were made. The "transport_dgram" / VSOCK_TRANSPORT_F_DGRAM is changed to
> > > be used for dgrams iff there is not yet a g2h or h2g transport that has
> > 
> > s/iff/if
> > 
> > > been registered that can transmit the packet. If there is a g2h/h2g
> > > transport for that remote address, then that transport will be used and
> > > not "transport_dgram". This essentially makes "transport_dgram" a
> > > fallback transport for when h2g/g2h has not yet gone online, which
> > > appears to be the exact use case for VMCI.
> > > 
> > > This design makes sense, because there is no reason that the
> > > transport_{g2h,h2g} cannot also service datagrams, which makes the role
> > > of transport_dgram difficult to understand outside of the VMCI context.
> > > 
> > > The logic around "transport_dgram" had to be retained to prevent
> > > breaking VMCI:
> > > 
> > > 1) VMCI datagrams appear to function outside of the h2g/g2h
> > >   paradigm. When the vmci transport becomes online, it registers itself
> > >   with the DGRAM feature, but not H2G/G2H. Only later when the
> > >   transport has more information about its environment does it register
> > >   H2G or G2H. In the case that a datagram socket becomes active
> > >   after DGRAM registration but before G2H/H2G registration, the
> > >   "transport_dgram" transport needs to be used.
> > 
> > IIRC we did this, because at that time only VMCI supported DGRAM. Now that
> > there are more transports, maybe DGRAM can follow the h2g/g2h paradigm.
> > 
> 
> Totally makes sense. I'll add the detail above that the prior design was
> a result of chronology.
> 
> > > 
> > > 2) VMCI seems to require special message be sent by the transport when a
> > >   datagram socket calls bind(). Under the h2g/g2h model, the transport
> > >   is selected using the remote_addr which is set by connect(). At
> > >   bind time there is no remote_addr because often no connect() has been
> > >   called yet: the transport is null. Therefore, with a null transport
> > >   there doesn't seem to be any good way for a datagram socket a tell the
> > >   VMCI transport that it has just had bind() called upon it.
> > 
> > @Vishnu, @Bryan do you think we can avoid this in some way?
> > 
> > > 
> > > Only transports with a special datagram fallback use-case such as VMCI
> > > need to register VSOCK_TRANSPORT_F_DGRAM.
> > 
> > Maybe we should rename it in VSOCK_TRANSPORT_F_DGRAM_FALLBACK or
> > something like that.
> > 
> > In any case, we definitely need to update the comment in
> > include/net/af_vsock.h on top of VSOCK_TRANSPORT_F_DGRAM mentioning
> > this.
> > 
> 
> Agreed. I'll rename to VSOCK_TRANSPORT_F_DGRAM_FALLBACK, unless we find
> there is a better way altogether.
> 
> > > 
> > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > > ---
> > > drivers/vhost/vsock.c                   |  1 -
> > > include/linux/virtio_vsock.h            |  2 -
> > > net/vmw_vsock/af_vsock.c                | 78 +++++++++++++++++++++++++--------
> > > net/vmw_vsock/hyperv_transport.c        |  6 ---
> > > net/vmw_vsock/virtio_transport.c        |  1 -
> > > net/vmw_vsock/virtio_transport_common.c |  7 ---
> > > net/vmw_vsock/vsock_loopback.c          |  1 -
> > > 7 files changed, 60 insertions(+), 36 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > index c8201c070b4b..8f0082da5e70 100644
> > > --- a/drivers/vhost/vsock.c
> > > +++ b/drivers/vhost/vsock.c
> > > @@ -410,7 +410,6 @@ static struct virtio_transport vhost_transport = {
> > > 		.cancel_pkt               = vhost_transport_cancel_pkt,
> > > 
> > > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > > -		.dgram_bind               = virtio_transport_dgram_bind,
> > > 		.dgram_allow              = virtio_transport_dgram_allow,
> > > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> > > 		.dgram_get_port		  = virtio_transport_dgram_get_port,
> > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > > index 23521a318cf0..73afa09f4585 100644
> > > --- a/include/linux/virtio_vsock.h
> > > +++ b/include/linux/virtio_vsock.h
> > > @@ -216,8 +216,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
> > > u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
> > > bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
> > > bool virtio_transport_stream_allow(u32 cid, u32 port);
> > > -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > -				struct sockaddr_vm *addr);
> > > bool virtio_transport_dgram_allow(u32 cid, u32 port);
> > > int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> > > int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > > index 74358f0b47fa..ef86765f3765 100644
> > > --- a/net/vmw_vsock/af_vsock.c
> > > +++ b/net/vmw_vsock/af_vsock.c
> > > @@ -438,6 +438,18 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
> > > 	return transport;
> > > }
> > > 
> > > +static const struct vsock_transport *
> > > +vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
> > > +{
> > > +	const struct vsock_transport *transport;
> > > +
> > > +	transport = vsock_connectible_lookup_transport(cid, flags);
> > > +	if (transport)
> > > +		return transport;
> > > +
> > > +	return transport_dgram;
> > > +}
> > > +
> > > /* Assign a transport to a socket and call the .init transport callback.
> > >  *
> > >  * Note: for connection oriented socket this must be called when vsk->remote_addr
> > > @@ -474,7 +486,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> > > 
> > > 	switch (sk->sk_type) {
> > > 	case SOCK_DGRAM:
> > > -		new_transport = transport_dgram;
> > > +		new_transport = vsock_dgram_lookup_transport(remote_cid,
> > > +							     remote_flags);
> > > 		break;
> > > 	case SOCK_STREAM:
> > > 	case SOCK_SEQPACKET:
> > > @@ -691,6 +704,9 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > 			      struct sockaddr_vm *addr)
> > > {
> > > +	if (!vsk->transport || !vsk->transport->dgram_bind)
> > > +		return -EINVAL;
> > > +
> > > 	return vsk->transport->dgram_bind(vsk, addr);
> > > }
> > > 
> > > @@ -1172,19 +1188,24 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > 
> > > 	lock_sock(sk);
> > > 
> > > -	transport = vsk->transport;
> > > -
> > > -	err = vsock_auto_bind(vsk);
> > > -	if (err)
> > > -		goto out;
> > > -
> > > -
> > > 	/* If the provided message contains an address, use that.  Otherwise
> > > 	 * fall back on the socket's remote handle (if it has been connected).
> > > 	 */
> > > 	if (msg->msg_name &&
> > > 	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
> > > 			    &remote_addr) == 0) {
> > > +		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
> > > +							 remote_addr->svm_flags);
> > > +		if (!transport) {
> > > +			err = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		if (!try_module_get(transport->module)) {
> > > +			err = -ENODEV;
> > > +			goto out;
> > > +		}
> > > +
> > > 		/* Ensure this address is of the right type and is a valid
> > > 		 * destination.
> > > 		 */
> > > @@ -1193,11 +1214,27 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > 			remote_addr->svm_cid = transport->get_local_cid();
> > > 
> > 
> > From here ...
> > 
> > > 		if (!vsock_addr_bound(remote_addr)) {
> > > +			module_put(transport->module);
> > > +			err = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		if (!transport->dgram_allow(remote_addr->svm_cid,
> > > +					    remote_addr->svm_port)) {
> > > +			module_put(transport->module);
> > > 			err = -EINVAL;
> > > 			goto out;
> > > 		}
> > > +
> > > +		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> > 
> > ... to here, looks like duplicate code, can we get it out of the if block?
> > 
> 
> Yes, I think using something like this:
> 
> [...]
> 	bool module_got = false;
> 
> [...]
> 		if (!try_module_get(transport->module)) {
> 			err = -ENODEV;
> 			goto out;
> 		}
> 		module_got = true;
> 
> [...]
> 
> out:
> 	if (likely(transport && !err && module_got))

Actually, just...

	if (module_got)

> 		module_put(transport->module)
> 
> > > +		module_put(transport->module);
> > > 	} else if (sock->state == SS_CONNECTED) {
> > > 		remote_addr = &vsk->remote_addr;
> > > +		transport = vsk->transport;
> > > +
> > > +		err = vsock_auto_bind(vsk);
> > > +		if (err)
> > > +			goto out;
> > > 
> > > 		if (remote_addr->svm_cid == VMADDR_CID_ANY)
> > > 			remote_addr->svm_cid = transport->get_local_cid();
> > > @@ -1205,23 +1242,23 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > 		/* XXX Should connect() or this function ensure remote_addr is
> > > 		 * bound?
> > > 		 */
> > > -		if (!vsock_addr_bound(&vsk->remote_addr)) {
> > > +		if (!vsock_addr_bound(remote_addr)) {
> > > 			err = -EINVAL;
> > > 			goto out;
> > > 		}
> > > -	} else {
> > > -		err = -EINVAL;
> > > -		goto out;
> > > -	}
> > > 
> > > -	if (!transport->dgram_allow(remote_addr->svm_cid,
> > > -				    remote_addr->svm_port)) {
> > > +		if (!transport->dgram_allow(remote_addr->svm_cid,
> > > +					    remote_addr->svm_port)) {
> > > +			err = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +
> > > +		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> > > +	} else {
> > > 		err = -EINVAL;
> > > 		goto out;
> > > 	}
> > > 
> > > -	err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> > > -
> > > out:
> > > 	release_sock(sk);
> > > 	return err;
> > > @@ -1255,13 +1292,18 @@ static int vsock_dgram_connect(struct socket *sock,
> > > 	if (err)
> > > 		goto out;
> > > 
> > > +	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> > > +
> > > +	err = vsock_assign_transport(vsk, NULL);
> > > +	if (err)
> > > +		goto out;
> > > +
> > > 	if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
> > > 					 remote_addr->svm_port)) {
> > > 		err = -EINVAL;
> > > 		goto out;
> > > 	}
> > > 
> > > -	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> > > 	sock->state = SS_CONNECTED;
> > > 
> > > 	/* sock map disallows redirection of non-TCP sockets with sk_state !=
> > > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > > index ff6e87e25fa0..c00bc5da769a 100644
> > > --- a/net/vmw_vsock/hyperv_transport.c
> > > +++ b/net/vmw_vsock/hyperv_transport.c
> > > @@ -551,11 +551,6 @@ static void hvs_destruct(struct vsock_sock *vsk)
> > > 	kfree(hvs);
> > > }
> > > 
> > > -static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
> > > -{
> > > -	return -EOPNOTSUPP;
> > > -}
> > > -
> > > static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > > {
> > > 	return -EOPNOTSUPP;
> > > @@ -841,7 +836,6 @@ static struct vsock_transport hvs_transport = {
> > > 	.connect                  = hvs_connect,
> > > 	.shutdown                 = hvs_shutdown,
> > > 
> > > -	.dgram_bind               = hvs_dgram_bind,
> > > 	.dgram_get_cid		  = hvs_dgram_get_cid,
> > > 	.dgram_get_port		  = hvs_dgram_get_port,
> > > 	.dgram_get_length	  = hvs_dgram_get_length,
> > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > > index 5763cdf13804..1b7843a7779a 100644
> > > --- a/net/vmw_vsock/virtio_transport.c
> > > +++ b/net/vmw_vsock/virtio_transport.c
> > > @@ -428,7 +428,6 @@ static struct virtio_transport virtio_transport = {
> > > 		.shutdown                 = virtio_transport_shutdown,
> > > 		.cancel_pkt               = virtio_transport_cancel_pkt,
> > > 
> > > -		.dgram_bind               = virtio_transport_dgram_bind,
> > > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > > 		.dgram_allow              = virtio_transport_dgram_allow,
> > > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> > > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > > index e6903c719964..d5a3c8efe84b 100644
> > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > @@ -790,13 +790,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
> > > }
> > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > 
> > > -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > -				struct sockaddr_vm *addr)
> > > -{
> > > -	return -EOPNOTSUPP;
> > > -}
> > > -EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > -
> > > int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > > {
> > > 	return -EOPNOTSUPP;
> > > diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> > > index 2f3cabc79ee5..e9de45a26fbd 100644
> > > --- a/net/vmw_vsock/vsock_loopback.c
> > > +++ b/net/vmw_vsock/vsock_loopback.c
> > > @@ -61,7 +61,6 @@ static struct virtio_transport loopback_transport = {
> > > 		.shutdown                 = virtio_transport_shutdown,
> > > 		.cancel_pkt               = vsock_loopback_cancel_pkt,
> > > 
> > > -		.dgram_bind               = virtio_transport_dgram_bind,
> > > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > > 		.dgram_allow              = virtio_transport_dgram_allow,
> > > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> > > 
> > > -- 
> > > 2.30.2
> > > 
> > 
> > The rest LGTM!
> > 
> > Stefano
> 
> Thanks,
> Bobby

^ permalink raw reply

* Re: [PATCH RFC net-next v4 3/8] vsock: support multi-transport datagrams
From: Bobby Eshleman @ 2023-06-23  2:50 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, linux-hyperv, Stefan Hajnoczi, kvm,
	Michael S. Tsirkin, VMware PV-Drivers Reviewers, Simon Horman,
	virtualization, Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu,
	Dexuan Cui, Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
	Krasnov Arseniy, Vishnu Dasa, netdev, linux-kernel, bpf,
	David S. Miller
In-Reply-To: <tngyeva5by3aldrhlixajjin2hqmcl6uruvuoed7hyrndlesfd@bbv7aphqye2q>

On Thu, Jun 22, 2023 at 05:19:08PM +0200, Stefano Garzarella wrote:
> On Sat, Jun 10, 2023 at 12:58:30AM +0000, Bobby Eshleman wrote:
> > This patch adds support for multi-transport datagrams.
> > 
> > This includes:
> > - Per-packet lookup of transports when using sendto(sockaddr_vm)
> > - Selecting H2G or G2H transport using VMADDR_FLAG_TO_HOST and CID in
> >  sockaddr_vm
> > 
> > To preserve backwards compatibility with VMCI, some important changes
> > were made. The "transport_dgram" / VSOCK_TRANSPORT_F_DGRAM is changed to
> > be used for dgrams iff there is not yet a g2h or h2g transport that has
> 
> s/iff/if
> 
> > been registered that can transmit the packet. If there is a g2h/h2g
> > transport for that remote address, then that transport will be used and
> > not "transport_dgram". This essentially makes "transport_dgram" a
> > fallback transport for when h2g/g2h has not yet gone online, which
> > appears to be the exact use case for VMCI.
> > 
> > This design makes sense, because there is no reason that the
> > transport_{g2h,h2g} cannot also service datagrams, which makes the role
> > of transport_dgram difficult to understand outside of the VMCI context.
> > 
> > The logic around "transport_dgram" had to be retained to prevent
> > breaking VMCI:
> > 
> > 1) VMCI datagrams appear to function outside of the h2g/g2h
> >   paradigm. When the vmci transport becomes online, it registers itself
> >   with the DGRAM feature, but not H2G/G2H. Only later when the
> >   transport has more information about its environment does it register
> >   H2G or G2H. In the case that a datagram socket becomes active
> >   after DGRAM registration but before G2H/H2G registration, the
> >   "transport_dgram" transport needs to be used.
> 
> IIRC we did this, because at that time only VMCI supported DGRAM. Now that
> there are more transports, maybe DGRAM can follow the h2g/g2h paradigm.
> 

Totally makes sense. I'll add the detail above that the prior design was
a result of chronology.

> > 
> > 2) VMCI seems to require special message be sent by the transport when a
> >   datagram socket calls bind(). Under the h2g/g2h model, the transport
> >   is selected using the remote_addr which is set by connect(). At
> >   bind time there is no remote_addr because often no connect() has been
> >   called yet: the transport is null. Therefore, with a null transport
> >   there doesn't seem to be any good way for a datagram socket a tell the
> >   VMCI transport that it has just had bind() called upon it.
> 
> @Vishnu, @Bryan do you think we can avoid this in some way?
> 
> > 
> > Only transports with a special datagram fallback use-case such as VMCI
> > need to register VSOCK_TRANSPORT_F_DGRAM.
> 
> Maybe we should rename it in VSOCK_TRANSPORT_F_DGRAM_FALLBACK or
> something like that.
> 
> In any case, we definitely need to update the comment in
> include/net/af_vsock.h on top of VSOCK_TRANSPORT_F_DGRAM mentioning
> this.
> 

Agreed. I'll rename to VSOCK_TRANSPORT_F_DGRAM_FALLBACK, unless we find
there is a better way altogether.

> > 
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > drivers/vhost/vsock.c                   |  1 -
> > include/linux/virtio_vsock.h            |  2 -
> > net/vmw_vsock/af_vsock.c                | 78 +++++++++++++++++++++++++--------
> > net/vmw_vsock/hyperv_transport.c        |  6 ---
> > net/vmw_vsock/virtio_transport.c        |  1 -
> > net/vmw_vsock/virtio_transport_common.c |  7 ---
> > net/vmw_vsock/vsock_loopback.c          |  1 -
> > 7 files changed, 60 insertions(+), 36 deletions(-)
> > 
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index c8201c070b4b..8f0082da5e70 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -410,7 +410,6 @@ static struct virtio_transport vhost_transport = {
> > 		.cancel_pkt               = vhost_transport_cancel_pkt,
> > 
> > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > -		.dgram_bind               = virtio_transport_dgram_bind,
> > 		.dgram_allow              = virtio_transport_dgram_allow,
> > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> > 		.dgram_get_port		  = virtio_transport_dgram_get_port,
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index 23521a318cf0..73afa09f4585 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -216,8 +216,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
> > u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
> > bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
> > bool virtio_transport_stream_allow(u32 cid, u32 port);
> > -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > -				struct sockaddr_vm *addr);
> > bool virtio_transport_dgram_allow(u32 cid, u32 port);
> > int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid);
> > int virtio_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port);
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index 74358f0b47fa..ef86765f3765 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -438,6 +438,18 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
> > 	return transport;
> > }
> > 
> > +static const struct vsock_transport *
> > +vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
> > +{
> > +	const struct vsock_transport *transport;
> > +
> > +	transport = vsock_connectible_lookup_transport(cid, flags);
> > +	if (transport)
> > +		return transport;
> > +
> > +	return transport_dgram;
> > +}
> > +
> > /* Assign a transport to a socket and call the .init transport callback.
> >  *
> >  * Note: for connection oriented socket this must be called when vsk->remote_addr
> > @@ -474,7 +486,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> > 
> > 	switch (sk->sk_type) {
> > 	case SOCK_DGRAM:
> > -		new_transport = transport_dgram;
> > +		new_transport = vsock_dgram_lookup_transport(remote_cid,
> > +							     remote_flags);
> > 		break;
> > 	case SOCK_STREAM:
> > 	case SOCK_SEQPACKET:
> > @@ -691,6 +704,9 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > 			      struct sockaddr_vm *addr)
> > {
> > +	if (!vsk->transport || !vsk->transport->dgram_bind)
> > +		return -EINVAL;
> > +
> > 	return vsk->transport->dgram_bind(vsk, addr);
> > }
> > 
> > @@ -1172,19 +1188,24 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > 
> > 	lock_sock(sk);
> > 
> > -	transport = vsk->transport;
> > -
> > -	err = vsock_auto_bind(vsk);
> > -	if (err)
> > -		goto out;
> > -
> > -
> > 	/* If the provided message contains an address, use that.  Otherwise
> > 	 * fall back on the socket's remote handle (if it has been connected).
> > 	 */
> > 	if (msg->msg_name &&
> > 	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
> > 			    &remote_addr) == 0) {
> > +		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
> > +							 remote_addr->svm_flags);
> > +		if (!transport) {
> > +			err = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		if (!try_module_get(transport->module)) {
> > +			err = -ENODEV;
> > +			goto out;
> > +		}
> > +
> > 		/* Ensure this address is of the right type and is a valid
> > 		 * destination.
> > 		 */
> > @@ -1193,11 +1214,27 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > 			remote_addr->svm_cid = transport->get_local_cid();
> > 
> 
> From here ...
> 
> > 		if (!vsock_addr_bound(remote_addr)) {
> > +			module_put(transport->module);
> > +			err = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		if (!transport->dgram_allow(remote_addr->svm_cid,
> > +					    remote_addr->svm_port)) {
> > +			module_put(transport->module);
> > 			err = -EINVAL;
> > 			goto out;
> > 		}
> > +
> > +		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> 
> ... to here, looks like duplicate code, can we get it out of the if block?
> 

Yes, I think using something like this:

[...]
	bool module_got = false;

[...]
		if (!try_module_get(transport->module)) {
			err = -ENODEV;
			goto out;
		}
		module_got = true;

[...]

out:
	if (likely(transport && !err && module_got))
		module_put(transport->module)

> > +		module_put(transport->module);
> > 	} else if (sock->state == SS_CONNECTED) {
> > 		remote_addr = &vsk->remote_addr;
> > +		transport = vsk->transport;
> > +
> > +		err = vsock_auto_bind(vsk);
> > +		if (err)
> > +			goto out;
> > 
> > 		if (remote_addr->svm_cid == VMADDR_CID_ANY)
> > 			remote_addr->svm_cid = transport->get_local_cid();
> > @@ -1205,23 +1242,23 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > 		/* XXX Should connect() or this function ensure remote_addr is
> > 		 * bound?
> > 		 */
> > -		if (!vsock_addr_bound(&vsk->remote_addr)) {
> > +		if (!vsock_addr_bound(remote_addr)) {
> > 			err = -EINVAL;
> > 			goto out;
> > 		}
> > -	} else {
> > -		err = -EINVAL;
> > -		goto out;
> > -	}
> > 
> > -	if (!transport->dgram_allow(remote_addr->svm_cid,
> > -				    remote_addr->svm_port)) {
> > +		if (!transport->dgram_allow(remote_addr->svm_cid,
> > +					    remote_addr->svm_port)) {
> > +			err = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> > +	} else {
> > 		err = -EINVAL;
> > 		goto out;
> > 	}
> > 
> > -	err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> > -
> > out:
> > 	release_sock(sk);
> > 	return err;
> > @@ -1255,13 +1292,18 @@ static int vsock_dgram_connect(struct socket *sock,
> > 	if (err)
> > 		goto out;
> > 
> > +	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> > +
> > +	err = vsock_assign_transport(vsk, NULL);
> > +	if (err)
> > +		goto out;
> > +
> > 	if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
> > 					 remote_addr->svm_port)) {
> > 		err = -EINVAL;
> > 		goto out;
> > 	}
> > 
> > -	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> > 	sock->state = SS_CONNECTED;
> > 
> > 	/* sock map disallows redirection of non-TCP sockets with sk_state !=
> > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > index ff6e87e25fa0..c00bc5da769a 100644
> > --- a/net/vmw_vsock/hyperv_transport.c
> > +++ b/net/vmw_vsock/hyperv_transport.c
> > @@ -551,11 +551,6 @@ static void hvs_destruct(struct vsock_sock *vsk)
> > 	kfree(hvs);
> > }
> > 
> > -static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
> > -{
> > -	return -EOPNOTSUPP;
> > -}
> > -
> > static int hvs_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > {
> > 	return -EOPNOTSUPP;
> > @@ -841,7 +836,6 @@ static struct vsock_transport hvs_transport = {
> > 	.connect                  = hvs_connect,
> > 	.shutdown                 = hvs_shutdown,
> > 
> > -	.dgram_bind               = hvs_dgram_bind,
> > 	.dgram_get_cid		  = hvs_dgram_get_cid,
> > 	.dgram_get_port		  = hvs_dgram_get_port,
> > 	.dgram_get_length	  = hvs_dgram_get_length,
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index 5763cdf13804..1b7843a7779a 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -428,7 +428,6 @@ static struct virtio_transport virtio_transport = {
> > 		.shutdown                 = virtio_transport_shutdown,
> > 		.cancel_pkt               = virtio_transport_cancel_pkt,
> > 
> > -		.dgram_bind               = virtio_transport_dgram_bind,
> > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > 		.dgram_allow              = virtio_transport_dgram_allow,
> > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index e6903c719964..d5a3c8efe84b 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -790,13 +790,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
> > }
> > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > 
> > -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > -				struct sockaddr_vm *addr)
> > -{
> > -	return -EOPNOTSUPP;
> > -}
> > -EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > -
> > int virtio_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> > {
> > 	return -EOPNOTSUPP;
> > diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> > index 2f3cabc79ee5..e9de45a26fbd 100644
> > --- a/net/vmw_vsock/vsock_loopback.c
> > +++ b/net/vmw_vsock/vsock_loopback.c
> > @@ -61,7 +61,6 @@ static struct virtio_transport loopback_transport = {
> > 		.shutdown                 = virtio_transport_shutdown,
> > 		.cancel_pkt               = vsock_loopback_cancel_pkt,
> > 
> > -		.dgram_bind               = virtio_transport_dgram_bind,
> > 		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > 		.dgram_allow              = virtio_transport_dgram_allow,
> > 		.dgram_get_cid		  = virtio_transport_dgram_get_cid,
> > 
> > -- 
> > 2.30.2
> > 
> 
> The rest LGTM!
> 
> Stefano

Thanks,
Bobby

^ permalink raw reply

* Re: [PATCH RFC net-next v4 8/8] tests: add vsock dgram tests
From: Arseniy Krasnov @ 2023-06-23 18:34 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Dan Carpenter, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv, bpf,
	Jiang Wang
In-Reply-To: <ZJTWYRGd95xl+yRE@bullseye>



On 23.06.2023 02:16, Bobby Eshleman wrote:
> On Sun, Jun 11, 2023 at 11:54:57PM +0300, Arseniy Krasnov wrote:
>> Hello Bobby!
>>
>> Sorry, may be I become a little bit annoying:), but I tried to run vsock_test with
>> this v4 version, and again get the same crash:
> 
> Haha not annoying at all. I appreciate the testing!
> 
>>
>> # cat client.sh 
>> ./vsock_test  --mode=client --control-host=192.168.1.1 --control-port=12345 --peer-cid=2
>> # ./client.sh 
>> Control socket connected to 192.168.1.1:12345.
>> 0 - SOCK_STREAM connection reset...[   20.065237] BUG: kernel NULL pointer dereference, addre0
>> [   20.065895] #PF: supervisor read access in kernel mode
>> [   20.065895] #PF: error_code(0x0000) - not-present page
>> [   20.065895] PGD 0 P4D 0 
>> [   20.065895] Oops: 0000 [#1] PREEMPT SMP PTI
>> [   20.065895] CPU: 0 PID: 111 Comm: vsock_test Not tainted 6.4.0-rc3-gefcccba07069 #385
>> [   20.065895] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd44
>> [   20.065895] RIP: 0010:static_key_count+0x0/0x20
>> [   20.065895] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
>> [   20.065895] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
>> [   20.065895] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
>> [   20.065895] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
>> [   20.065895] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
>> [   20.065895] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
>> [   20.065895] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
>> [   20.065895] FS:  00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
>> [   20.065895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   20.065895] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
>> [   20.065895] Call Trace:
>> [   20.065895]  <TASK>
>> [   20.065895]  once_deferred+0xd/0x30
>> [   20.065895]  vsock_assign_transport+0x9a/0x1b0 [vsock]
>> [   20.065895]  vsock_connect+0xb4/0x3a0 [vsock]
>> [   20.065895]  ? var_wake_function+0x60/0x60
>> [   20.065895]  __sys_connect+0x9e/0xd0
>> [   20.065895]  ? _raw_spin_unlock_irq+0xe/0x30
>> [   20.065895]  ? do_setitimer+0x128/0x1f0
>> [   20.065895]  ? alarm_setitimer+0x4c/0x90
>> [   20.065895]  ? fpregs_assert_state_consistent+0x1d/0x50
>> [   20.065895]  ? exit_to_user_mode_prepare+0x36/0x130
>> [   20.065895]  __x64_sys_connect+0x11/0x20
>> [   20.065895]  do_syscall_64+0x3b/0xc0
>> [   20.065895]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
>> [   20.065895] RIP: 0033:0x7f1ad822dd13
>> [   20.065895] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8
>> [   20.065895] RSP: 002b:00007ffc513e3c98 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
>> [   20.065895] RAX: ffffffffffffffda RBX: 000055aed298e020 RCX: 00007f1ad822dd13
>> [   20.065895] RDX: 0000000000000010 RSI: 00007ffc513e3cb0 RDI: 0000000000000004
>> [   20.065895] RBP: 0000000000000004 R08: 000055aed32b2018 R09: 0000000000000000
>> [   20.065895] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> [   20.065895] R13: 000055aed298acb1 R14: 00007ffc513e3cb0 R15: 00007ffc513e3d40
>> [   20.065895]  </TASK>
>> [   20.065895] Modules linked in: vsock_loopback vhost_vsock vmw_vsock_virtio_transport vmw_vb
> 
> ^ I'm guessing this is the difference between our setups. I have been
> going all built-in, let me see if I can reproduce w/ modules...

Ah ok, I think using modules is good practice here, because it could test that Your symbols
set is valid to work as modules and 'rmmod' could check problems with the forgotten references
for example for socket or skb. I'm working with modules most of the time.

> 
>> [   20.065895] CR2: 0000000000000000
>> [   20.154060] ---[ end trace 0000000000000000 ]---
>> [   20.155519] RIP: 0010:static_key_count+0x0/0x20
>> [   20.156932] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 48 89 46 08 f
>> [   20.161367] RSP: 0018:ffffbbb000223dc0 EFLAGS: 00010202
>> [   20.162613] RAX: ffffffff85709880 RBX: ffffffffc0079140 RCX: 0000000000000000
>> [   20.164262] RDX: ffff9f73c2175700 RSI: 0000000000000000 RDI: 0000000000000000
>> [   20.165934] RBP: ffff9f73c2385900 R08: ffffbbb000223d30 R09: ffff9f73ff896000
>> [   20.167684] R10: 0000000000001000 R11: 0000000000000000 R12: ffffbbb000223e80
>> [   20.169427] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9f73c1cfaa80
>> [   20.171109] FS:  00007f1ad82f55c0(0000) GS:ffff9f73fe400000(0000) knlGS:0000000000000000
>> [   20.173000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   20.174381] CR2: 0000000000000000 CR3: 000000003f954000 CR4: 00000000000006f0
>>
>> So, what HEAD do You use? May be You have some specific config (I use x86-64 defconfig + vsock/vhost
>> related things) ?
>>
> 
> For this series I used net-next:
> 	28cfea989d6f55c3d10608eba2a2bae609c5bf3e
> 
>> Thanks, Arseniy
>>
> 
> As always, thanks for the bug finding! I'll report back when I
> reproduce or with questions if I can't.

Thanks!

Thanks, Arseniy

> 
> Best,
> Bobby
> 
>>
>> On 10.06.2023 03:58, Bobby Eshleman wrote:
>>> From: Jiang Wang <jiang.wang@bytedance.com>
>>>
>>> This patch adds tests for vsock datagram.
>>>
>>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>>> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
>>> ---
>>>  tools/testing/vsock/util.c       | 141 ++++++++++++-
>>>  tools/testing/vsock/util.h       |   6 +
>>>  tools/testing/vsock/vsock_test.c | 432 +++++++++++++++++++++++++++++++++++++++
>>>  3 files changed, 578 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
>>> index 01b636d3039a..811e70d7cf1e 100644
>>> --- a/tools/testing/vsock/util.c
>>> +++ b/tools/testing/vsock/util.c
>>> @@ -99,7 +99,8 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)
>>>  	int ret;
>>>  	int fd;
>>>  
>>> -	control_expectln("LISTENING");
>>> +	if (type != SOCK_DGRAM)
>>> +		control_expectln("LISTENING");
>>>  
>>>  	fd = socket(AF_VSOCK, type, 0);
>>>  
>>> @@ -130,6 +131,11 @@ int vsock_seqpacket_connect(unsigned int cid, unsigned int port)
>>>  	return vsock_connect(cid, port, SOCK_SEQPACKET);
>>>  }
>>>  
>>> +int vsock_dgram_connect(unsigned int cid, unsigned int port)
>>> +{
>>> +	return vsock_connect(cid, port, SOCK_DGRAM);
>>> +}
>>> +
>>>  /* Listen on <cid, port> and return the first incoming connection.  The remote
>>>   * address is stored to clientaddrp.  clientaddrp may be NULL.
>>>   */
>>> @@ -211,6 +217,34 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
>>>  	return vsock_accept(cid, port, clientaddrp, SOCK_SEQPACKET);
>>>  }
>>>  
>>> +int vsock_dgram_bind(unsigned int cid, unsigned int port)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = port,
>>> +			.svm_cid = cid,
>>> +		},
>>> +	};
>>> +	int fd;
>>> +
>>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +	if (fd < 0) {
>>> +		perror("socket");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
>>> +		perror("bind");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	return fd;
>>> +}
>>> +
>>>  /* Transmit one byte and check the return value.
>>>   *
>>>   * expected_ret:
>>> @@ -260,6 +294,57 @@ void send_byte(int fd, int expected_ret, int flags)
>>>  	}
>>>  }
>>>  
>>> +/* Transmit one byte and check the return value.
>>> + *
>>> + * expected_ret:
>>> + *  <0 Negative errno (for testing errors)
>>> + *   0 End-of-file
>>> + *   1 Success
>>> + */
>>> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
>>> +		 int flags)
>>> +{
>>> +	const uint8_t byte = 'A';
>>> +	ssize_t nwritten;
>>> +
>>> +	timeout_begin(TIMEOUT);
>>> +	do {
>>> +		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
>>> +				  len);
>>> +		timeout_check("write");
>>> +	} while (nwritten < 0 && errno == EINTR);
>>> +	timeout_end();
>>> +
>>> +	if (expected_ret < 0) {
>>> +		if (nwritten != -1) {
>>> +			fprintf(stderr, "bogus sendto(2) return value %zd\n",
>>> +				nwritten);
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +		if (errno != -expected_ret) {
>>> +			perror("write");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +		return;
>>> +	}
>>> +
>>> +	if (nwritten < 0) {
>>> +		perror("write");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +	if (nwritten == 0) {
>>> +		if (expected_ret == 0)
>>> +			return;
>>> +
>>> +		fprintf(stderr, "unexpected EOF while sending byte\n");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +	if (nwritten != sizeof(byte)) {
>>> +		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +}
>>> +
>>>  /* Receive one byte and check the return value.
>>>   *
>>>   * expected_ret:
>>> @@ -313,6 +398,60 @@ void recv_byte(int fd, int expected_ret, int flags)
>>>  	}
>>>  }
>>>  
>>> +/* Receive one byte and check the return value.
>>> + *
>>> + * expected_ret:
>>> + *  <0 Negative errno (for testing errors)
>>> + *   0 End-of-file
>>> + *   1 Success
>>> + */
>>> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
>>> +		   int expected_ret, int flags)
>>> +{
>>> +	uint8_t byte;
>>> +	ssize_t nread;
>>> +
>>> +	timeout_begin(TIMEOUT);
>>> +	do {
>>> +		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
>>> +		timeout_check("read");
>>> +	} while (nread < 0 && errno == EINTR);
>>> +	timeout_end();
>>> +
>>> +	if (expected_ret < 0) {
>>> +		if (nread != -1) {
>>> +			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
>>> +				nread);
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +		if (errno != -expected_ret) {
>>> +			perror("read");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +		return;
>>> +	}
>>> +
>>> +	if (nread < 0) {
>>> +		perror("read");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +	if (nread == 0) {
>>> +		if (expected_ret == 0)
>>> +			return;
>>> +
>>> +		fprintf(stderr, "unexpected EOF while receiving byte\n");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +	if (nread != sizeof(byte)) {
>>> +		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +	if (byte != 'A') {
>>> +		fprintf(stderr, "unexpected byte read %c\n", byte);
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +}
>>> +
>>>  /* Run test cases.  The program terminates if a failure occurs. */
>>>  void run_tests(const struct test_case *test_cases,
>>>  	       const struct test_opts *opts)
>>> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
>>> index fb99208a95ea..a69e128d120c 100644
>>> --- a/tools/testing/vsock/util.h
>>> +++ b/tools/testing/vsock/util.h
>>> @@ -37,13 +37,19 @@ void init_signals(void);
>>>  unsigned int parse_cid(const char *str);
>>>  int vsock_stream_connect(unsigned int cid, unsigned int port);
>>>  int vsock_seqpacket_connect(unsigned int cid, unsigned int port);
>>> +int vsock_dgram_connect(unsigned int cid, unsigned int port);
>>>  int vsock_stream_accept(unsigned int cid, unsigned int port,
>>>  			struct sockaddr_vm *clientaddrp);
>>>  int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
>>>  			   struct sockaddr_vm *clientaddrp);
>>> +int vsock_dgram_bind(unsigned int cid, unsigned int port);
>>>  void vsock_wait_remote_close(int fd);
>>>  void send_byte(int fd, int expected_ret, int flags);
>>> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
>>> +		 int flags);
>>>  void recv_byte(int fd, int expected_ret, int flags);
>>> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
>>> +		   int expected_ret, int flags);
>>>  void run_tests(const struct test_case *test_cases,
>>>  	       const struct test_opts *opts);
>>>  void list_tests(const struct test_case *test_cases);
>>> diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
>>> index ac1bd3ac1533..ded82d39ee5d 100644
>>> --- a/tools/testing/vsock/vsock_test.c
>>> +++ b/tools/testing/vsock/vsock_test.c
>>> @@ -1053,6 +1053,413 @@ static void test_stream_virtio_skb_merge_server(const struct test_opts *opts)
>>>  	close(fd);
>>>  }
>>>  
>>> +static void test_dgram_sendto_client(const struct test_opts *opts)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = 1234,
>>> +			.svm_cid = opts->peer_cid,
>>> +		},
>>> +	};
>>> +	int fd;
>>> +
>>> +	/* Wait for the server to be ready */
>>> +	control_expectln("BIND");
>>> +
>>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +	if (fd < 0) {
>>> +		perror("socket");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
>>> +
>>> +	/* Notify the server that the client has finished */
>>> +	control_writeln("DONE");
>>> +
>>> +	close(fd);
>>> +}
>>> +
>>> +static void test_dgram_sendto_server(const struct test_opts *opts)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = 1234,
>>> +			.svm_cid = VMADDR_CID_ANY,
>>> +		},
>>> +	};
>>> +	int len = sizeof(addr.sa);
>>> +	int fd;
>>> +
>>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +	if (fd < 0) {
>>> +		perror("socket");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
>>> +		perror("bind");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Notify the client that the server is ready */
>>> +	control_writeln("BIND");
>>> +
>>> +	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
>>> +
>>> +	/* Wait for the client to finish */
>>> +	control_expectln("DONE");
>>> +
>>> +	close(fd);
>>> +}
>>> +
>>> +static void test_dgram_connect_client(const struct test_opts *opts)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = 1234,
>>> +			.svm_cid = opts->peer_cid,
>>> +		},
>>> +	};
>>> +	int ret;
>>> +	int fd;
>>> +
>>> +	/* Wait for the server to be ready */
>>> +	control_expectln("BIND");
>>> +
>>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +	if (fd < 0) {
>>> +		perror("bind");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	ret = connect(fd, &addr.sa, sizeof(addr.svm));
>>> +	if (ret < 0) {
>>> +		perror("connect");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	send_byte(fd, 1, 0);
>>> +
>>> +	/* Notify the server that the client has finished */
>>> +	control_writeln("DONE");
>>> +
>>> +	close(fd);
>>> +}
>>> +
>>> +static void test_dgram_connect_server(const struct test_opts *opts)
>>> +{
>>> +	test_dgram_sendto_server(opts);
>>> +}
>>> +
>>> +static void test_dgram_multiconn_sendto_client(const struct test_opts *opts)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = 1234,
>>> +			.svm_cid = opts->peer_cid,
>>> +		},
>>> +	};
>>> +	int fds[MULTICONN_NFDS];
>>> +	int i;
>>> +
>>> +	/* Wait for the server to be ready */
>>> +	control_expectln("BIND");
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++) {
>>> +		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +		if (fds[i] < 0) {
>>> +			perror("socket");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +	}
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++)
>>> +		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
>>> +
>>> +	/* Notify the server that the client has finished */
>>> +	control_writeln("DONE");
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++)
>>> +		close(fds[i]);
>>> +}
>>> +
>>> +static void test_dgram_multiconn_sendto_server(const struct test_opts *opts)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = 1234,
>>> +			.svm_cid = VMADDR_CID_ANY,
>>> +		},
>>> +	};
>>> +	int len = sizeof(addr.sa);
>>> +	int fd;
>>> +	int i;
>>> +
>>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +	if (fd < 0) {
>>> +		perror("socket");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
>>> +		perror("bind");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Notify the client that the server is ready */
>>> +	control_writeln("BIND");
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++)
>>> +		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
>>> +
>>> +	/* Wait for the client to finish */
>>> +	control_expectln("DONE");
>>> +
>>> +	close(fd);
>>> +}
>>> +
>>> +static void test_dgram_multiconn_send_client(const struct test_opts *opts)
>>> +{
>>> +	int fds[MULTICONN_NFDS];
>>> +	int i;
>>> +
>>> +	/* Wait for the server to be ready */
>>> +	control_expectln("BIND");
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++) {
>>> +		fds[i] = vsock_dgram_connect(opts->peer_cid, 1234);
>>> +		if (fds[i] < 0) {
>>> +			perror("socket");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +	}
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++)
>>> +		send_byte(fds[i], 1, 0);
>>> +
>>> +	/* Notify the server that the client has finished */
>>> +	control_writeln("DONE");
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++)
>>> +		close(fds[i]);
>>> +}
>>> +
>>> +static void test_dgram_multiconn_send_server(const struct test_opts *opts)
>>> +{
>>> +	union {
>>> +		struct sockaddr sa;
>>> +		struct sockaddr_vm svm;
>>> +	} addr = {
>>> +		.svm = {
>>> +			.svm_family = AF_VSOCK,
>>> +			.svm_port = 1234,
>>> +			.svm_cid = VMADDR_CID_ANY,
>>> +		},
>>> +	};
>>> +	int fd;
>>> +	int i;
>>> +
>>> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
>>> +	if (fd < 0) {
>>> +		perror("socket");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
>>> +		perror("bind");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Notify the client that the server is ready */
>>> +	control_writeln("BIND");
>>> +
>>> +	for (i = 0; i < MULTICONN_NFDS; i++)
>>> +		recv_byte(fd, 1, 0);
>>> +
>>> +	/* Wait for the client to finish */
>>> +	control_expectln("DONE");
>>> +
>>> +	close(fd);
>>> +}
>>> +
>>> +static void test_dgram_msg_bounds_client(const struct test_opts *opts)
>>> +{
>>> +	unsigned long recv_buf_size;
>>> +	int page_size;
>>> +	int msg_cnt;
>>> +	int fd;
>>> +
>>> +	fd = vsock_dgram_connect(opts->peer_cid, 1234);
>>> +	if (fd < 0) {
>>> +		perror("connect");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Let the server know the client is ready */
>>> +	control_writeln("CLNTREADY");
>>> +
>>> +	msg_cnt = control_readulong();
>>> +	recv_buf_size = control_readulong();
>>> +
>>> +	/* Wait, until receiver sets buffer size. */
>>> +	control_expectln("SRVREADY");
>>> +
>>> +	page_size = getpagesize();
>>> +
>>> +	for (int i = 0; i < msg_cnt; i++) {
>>> +		unsigned long curr_hash;
>>> +		ssize_t send_size;
>>> +		size_t buf_size;
>>> +		void *buf;
>>> +
>>> +		/* Use "small" buffers and "big" buffers. */
>>> +		if (i & 1)
>>> +			buf_size = page_size +
>>> +					(rand() % (MAX_MSG_SIZE - page_size));
>>> +		else
>>> +			buf_size = 1 + (rand() % page_size);
>>> +
>>> +		buf_size = min(buf_size, recv_buf_size);
>>> +
>>> +		buf = malloc(buf_size);
>>> +
>>> +		if (!buf) {
>>> +			perror("malloc");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +
>>> +		memset(buf, rand() & 0xff, buf_size);
>>> +		/* Set at least one MSG_EOR + some random. */
>>> +
>>> +		send_size = send(fd, buf, buf_size, 0);
>>> +
>>> +		if (send_size < 0) {
>>> +			perror("send");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +
>>> +		if (send_size != buf_size) {
>>> +			fprintf(stderr, "Invalid send size\n");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +
>>> +		/* In theory the implementation isn't required to transmit
>>> +		 * these packets in order, so we use this SYNC control message
>>> +		 * so that server and client coordinate sending and receiving
>>> +		 * one packet at a time. The client sends a packet and waits
>>> +		 * until it has been received before sending another.
>>> +		 */
>>> +		control_writeln("PKTSENT");
>>> +		control_expectln("PKTRECV");
>>> +
>>> +		/* Send the server a hash of the packet */
>>> +		curr_hash = hash_djb2(buf, buf_size);
>>> +		control_writeulong(curr_hash);
>>> +		free(buf);
>>> +	}
>>> +
>>> +	control_writeln("SENDDONE");
>>> +	close(fd);
>>> +}
>>> +
>>> +static void test_dgram_msg_bounds_server(const struct test_opts *opts)
>>> +{
>>> +	const unsigned long msg_cnt = 16;
>>> +	unsigned long sock_buf_size;
>>> +	struct msghdr msg = {0};
>>> +	struct iovec iov = {0};
>>> +	char buf[MAX_MSG_SIZE];
>>> +	socklen_t len;
>>> +	int fd;
>>> +	int i;
>>> +
>>> +	fd = vsock_dgram_bind(VMADDR_CID_ANY, 1234);
>>> +
>>> +	if (fd < 0) {
>>> +		perror("bind");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Set receive buffer to maximum */
>>> +	sock_buf_size = -1;
>>> +	if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
>>> +		       &sock_buf_size, sizeof(sock_buf_size))) {
>>> +		perror("setsockopt(SO_RECVBUF)");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Retrieve the receive buffer size */
>>> +	len = sizeof(sock_buf_size);
>>> +	if (getsockopt(fd, SOL_SOCKET, SO_RCVBUF,
>>> +		       &sock_buf_size, &len)) {
>>> +		perror("getsockopt(SO_RECVBUF)");
>>> +		exit(EXIT_FAILURE);
>>> +	}
>>> +
>>> +	/* Client ready to receive parameters */
>>> +	control_expectln("CLNTREADY");
>>> +
>>> +	control_writeulong(msg_cnt);
>>> +	control_writeulong(sock_buf_size);
>>> +
>>> +	/* Ready to receive data. */
>>> +	control_writeln("SRVREADY");
>>> +
>>> +	iov.iov_base = buf;
>>> +	iov.iov_len = sizeof(buf);
>>> +	msg.msg_iov = &iov;
>>> +	msg.msg_iovlen = 1;
>>> +
>>> +	for (i = 0; i < msg_cnt; i++) {
>>> +		unsigned long remote_hash;
>>> +		unsigned long curr_hash;
>>> +		ssize_t recv_size;
>>> +
>>> +		control_expectln("PKTSENT");
>>> +		recv_size = recvmsg(fd, &msg, 0);
>>> +		control_writeln("PKTRECV");
>>> +
>>> +		if (!recv_size)
>>> +			break;
>>> +
>>> +		if (recv_size < 0) {
>>> +			perror("recvmsg");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +
>>> +		curr_hash = hash_djb2(msg.msg_iov[0].iov_base, recv_size);
>>> +		remote_hash = control_readulong();
>>> +
>>> +		if (curr_hash != remote_hash) {
>>> +			fprintf(stderr, "Message bounds broken\n");
>>> +			exit(EXIT_FAILURE);
>>> +		}
>>> +	}
>>> +
>>> +	close(fd);
>>> +}
>>> +
>>>  static struct test_case test_cases[] = {
>>>  	{
>>>  		.name = "SOCK_STREAM connection reset",
>>> @@ -1128,6 +1535,31 @@ static struct test_case test_cases[] = {
>>>  		.run_client = test_stream_virtio_skb_merge_client,
>>>  		.run_server = test_stream_virtio_skb_merge_server,
>>>  	},
>>> +	{
>>> +		.name = "SOCK_DGRAM client sendto",
>>> +		.run_client = test_dgram_sendto_client,
>>> +		.run_server = test_dgram_sendto_server,
>>> +	},
>>> +	{
>>> +		.name = "SOCK_DGRAM client connect",
>>> +		.run_client = test_dgram_connect_client,
>>> +		.run_server = test_dgram_connect_server,
>>> +	},
>>> +	{
>>> +		.name = "SOCK_DGRAM multiple connections using sendto",
>>> +		.run_client = test_dgram_multiconn_sendto_client,
>>> +		.run_server = test_dgram_multiconn_sendto_server,
>>> +	},
>>> +	{
>>> +		.name = "SOCK_DGRAM multiple connections using send",
>>> +		.run_client = test_dgram_multiconn_send_client,
>>> +		.run_server = test_dgram_multiconn_send_server,
>>> +	},
>>> +	{
>>> +		.name = "SOCK_DGRAM msg bounds",
>>> +		.run_client = test_dgram_msg_bounds_client,
>>> +		.run_server = test_dgram_msg_bounds_server,
>>> +	},
>>>  	{},
>>>  };
>>>  
>>>

^ permalink raw reply

* Re: [PATCH V2 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Simon Horman @ 2023-06-23 17:44 UTC (permalink / raw)
  To: souradeep chakrabarti
  Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma, stable,
	schakrabarti
In-Reply-To: <1687505355-29212-1-git-send-email-schakrabarti@linux.microsoft.com>

On Fri, Jun 23, 2023 at 12:29:15AM -0700, souradeep chakrabarti wrote:
> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> 
> This patch addresses  the VF unload issue, where mana_dealloc_queues()
> gets stuck in infinite while loop, because of host unresponsiveness.
> It adds a timeout in the while loop, to fix it.
> 
> Also this patch adds a new attribute in mana_context, which gets set when
> mana_hwc_send_request() hits a timeout because of host unresponsiveness.
> This flag then helps to avoid the timeouts in successive calls.
> 
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

...

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index d907727c7b7a..cb2080b3a00c 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c

...

> @@ -2348,13 +2351,26 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 *
>  	 * Drain all the in-flight TX packets
>  	 */
> +
> +	timeout = jiffies + 120 * HZ;
>  	for (i = 0; i < apc->num_queues; i++) {
>  		txq = &apc->tx_qp[i].txq;
> -
> -		while (atomic_read(&txq->pending_sends) > 0)
> +		while (atomic_read(&txq->pending_sends) > 0 &&
> +		       time_before(jiffies, timeout)) {
>  			usleep_range(1000, 2000);
> +		}

Hi Souradeep,

minor feedback from my side, as it seems there will be a new version anyway:
I think the braces - '{' '}' - are unnecessary above.

...

-- 
pw-bot: cr


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox